[00:24:29] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:38:12] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1101233 [00:38:12] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1101233 (owner: 10TrainBranchBot) [00:55:44] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1101233 (owner: 10TrainBranchBot) [01:08:11] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1101234 [01:08:11] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1101234 (owner: 10TrainBranchBot) [01:28:24] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1101234 (owner: 10TrainBranchBot) [01:36:48] PROBLEM - MD RAID on aqs1014 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [01:36:49] ACKNOWLEDGEMENT - MD RAID on aqs1014 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T381742 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [01:36:56] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T381742 (10ops-monitoring-bot) 03NEW [01:37:31] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [01:38:31] FIRING: Primary inbound port utilisation over 80% #page: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [01:42:31] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [01:43:31] RESOLVED: Primary inbound port utilisation over 80% #page: Device asw2-b-eqiad.mgmt.eqiad.wmnet recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [02:40:43] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:57:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100217 (https://phabricator.wikimedia.org/T33951) (owner: 10Tim Starling) [02:58:22] (03Merged) 10jenkins-bot: Prepare for migration of the Interwiki extension to core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100217 (https://phabricator.wikimedia.org/T33951) (owner: 10Tim Starling) [02:59:04] !log tstarling@deploy2002 Started scap sync-world: Backport for [[gerrit:1100217|Prepare for migration of the Interwiki extension to core (T33951)]] [02:59:08] T33951: Merge Interwiki extension into MediaWiki core - https://phabricator.wikimedia.org/T33951 [03:04:28] FIRING: [4x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:05:43] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:10:36] !log tstarling@deploy2002 tstarling: Backport for [[gerrit:1100217|Prepare for migration of the Interwiki extension to core (T33951)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [03:10:39] T33951: Merge Interwiki extension into MediaWiki core - https://phabricator.wikimedia.org/T33951 [03:20:20] !log tstarling@deploy2002 tstarling: Continuing with sync [03:30:21] !log tstarling@deploy2002 Finished scap sync-world: Backport for [[gerrit:1100217|Prepare for migration of the Interwiki extension to core (T33951)]] (duration: 31m 17s) [03:30:25] T33951: Merge Interwiki extension into MediaWiki core - https://phabricator.wikimedia.org/T33951 [03:44:18] !log tstarling@deploy2002 Started deploy [restbase/deploy@6d0b97e]: no-op test deploy [03:55:40] !log tstarling@deploy2002 Finished deploy [restbase/deploy@6d0b97e]: no-op test deploy (duration: 11m 22s) [03:57:59] !log tstarling@deploy2002 Started deploy [restbase/deploy@27f4a8e]: add 3 wikis T380726 [03:58:03] T380726: Create Wikivoyage Indonesian - https://phabricator.wikimedia.org/T380726 [04:08:45] !log tstarling@deploy2002 Finished deploy [restbase/deploy@27f4a8e]: add 3 wikis T380726 (duration: 10m 46s) [04:08:49] T380726: Create Wikivoyage Indonesian - https://phabricator.wikimedia.org/T380726 [04:20:37] !log tstarling@deploy2002 Started deploy [restbase/deploy@27f4a8e]: try again, seems like restbase2026 at least was skipped T380726 [04:20:40] T380726: Create Wikivoyage Indonesian - https://phabricator.wikimedia.org/T380726 [04:24:29] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:29:37] !log tstarling@deploy2002 Finished deploy [restbase/deploy@27f4a8e]: try again, seems like restbase2026 at least was skipped T380726 (duration: 09m 00s) [04:29:40] T380726: Create Wikivoyage Indonesian - https://phabricator.wikimedia.org/T380726 [04:31:15] !log tstarling@deploy2002 Started deploy [restbase/deploy@0531d4e]: try again after removing decom servers T380790 T380726 [04:31:20] T380790: decommission restbase202[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T380790 [04:45:51] !log tstarling@deploy2002 Finished deploy [restbase/deploy@0531d4e]: try again after removing decom servers T380790 T380726 (duration: 14m 36s) [04:45:56] T380790: decommission restbase202[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T380790 [04:45:57] T380726: Create Wikivoyage Indonesian - https://phabricator.wikimedia.org/T380726 [05:23:42] !log tstarling@deploy2002 Started deploy [restbase/deploy@8184836]: also deploy to restbase2036-9 T380726 T377896 [05:23:48] T380726: Create Wikivoyage Indonesian - https://phabricator.wikimedia.org/T380726 [05:23:48] T377896: Q2:rack/setup/install restbase203[6-8] - https://phabricator.wikimedia.org/T377896 [05:39:49] !log tstarling@deploy2002 Finished deploy [restbase/deploy@8184836]: also deploy to restbase2036-9 T380726 T377896 (duration: 16m 06s) [05:39:52] here I am getting old waiting for this deployment to finish for the 5th time, I wonder what is taking so long? [05:39:54] T380726: Create Wikivoyage Indonesian - https://phabricator.wikimedia.org/T380726 [05:39:54] T377896: Q2:rack/setup/install restbase203[6-8] - https://phabricator.wikimedia.org/T377896 [05:39:59] 1154875 | \_ /var/lib/scap/scap/bin/python3 /usr/bin/scap deploy-local -v --repo restbase/deploy -g default promote --refresh-config [05:39:59] 1155000 | \_ sleep 52 [05:40:11] at least I know how to speed it up in future [05:40:22] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [05:41:31] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [05:53:44] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [05:54:42] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [05:58:06] (03PS1) 10Tim Starling: Enable canShellboxGetTempUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101239 (https://phabricator.wikimedia.org/T292322) [06:28:50] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [06:29:48] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [06:49:45] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [06:50:40] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [07:04:28] FIRING: [4x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:05:43] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:18:51] !log homer 'cr*eqiad*' commit 'T377876' [07:18:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:55] T377876: Migrate wikikube-eqiad to containerd - https://phabricator.wikimedia.org/T377876 [07:21:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Comm Error: backplane 0 when reimaging wikikube-worker1057 - https://phabricator.wikimedia.org/T381676#10389323 (10Jelto) The following commands have to be executed when the host is back (just noting it down so I don't forget it): ` cookbook sre.hosts.reimage --... [07:34:32] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1056.eqiad.wmnet [07:34:34] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1056.eqiad.wmnet [07:35:24] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T381504#10389333 (10Jelto) [07:38:18] (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for wikireplicas roles [puppet] - 10https://gerrit.wikimedia.org/r/1101068 (owner: 10Muehlenhoff) [07:41:26] (03CR) 10Muehlenhoff: [C:03+2] maps: Remove support for osm2pgsql as OSM engine [puppet] - 10https://gerrit.wikimedia.org/r/1100784 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [07:44:11] (03PS1) 10Jelto: Rename kubernetes[1039-1042] to wikikube-worker[1064-1067] [puppet] - 10https://gerrit.wikimedia.org/r/1101449 (https://phabricator.wikimedia.org/T377876) [07:52:50] (03CR) 10Muehlenhoff: [C:03+2] osm_master: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1100788 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [07:55:41] (03PS4) 10Anzx: jawiki: lift IP cap on 2024-12-17 and 2025-01-14 for Editation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101231 (https://phabricator.wikimedia.org/T381729) [07:56:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 09 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101231 (https://phabricator.wikimedia.org/T381729) (owner: 10Anzx) [07:56:53] (03PS2) 10Anzx: idwikivoyage: add timezone, sitename and project namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101185 (https://phabricator.wikimedia.org/T381080) [07:57:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 09 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101185 (https://phabricator.wikimedia.org/T381080) (owner: 10Anzx) [07:58:42] (03PS1) 10Elukey: modules: add helper_1.1.4.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101450 [07:58:42] (03PS1) 10Elukey: modules: remove tpl() usage in base:helper's resourcesDataChecksum [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101451 [07:58:42] (03PS1) 10Elukey: [WIP] charts: Add kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101452 [08:00:04] Amir1, Urbanecm, and awight: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241209T0800). [08:00:05] mszabo and anzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:13] o/ [08:02:15] (03CR) 10Brouberol: [C:03+1] modules: add helper_1.1.4.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101450 (owner: 10Elukey) [08:03:49] (03CR) 10Brouberol: "Let's add a changelog entry?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101451 (owner: 10Elukey) [08:05:05] (03PS2) 10Muehlenhoff: maps: Allow disabling the installation of kartotherian [puppet] - 10https://gerrit.wikimedia.org/r/1100456 [08:07:50] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1100456 (owner: 10Muehlenhoff) [08:08:13] (03PS2) 10Elukey: modules: remove tpl() usage in base:helper's resourcesDataChecksum [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101451 [08:08:13] (03PS2) 10Elukey: [WIP] charts: Add kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101452 [08:08:25] (03CR) 10Elukey: "Right added!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101451 (owner: 10Elukey) [08:15:12] (03PS1) 10Muehlenhoff: Add a define to determine the postgresql version used for a Debian release [puppet] - 10https://gerrit.wikimedia.org/r/1101454 [08:15:19] (03CR) 10Brouberol: [C:03+1] "Nicely done!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101451 (owner: 10Elukey) [08:17:08] (03CR) 10Elukey: [C:03+2] modules: add helper_1.1.4.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101450 (owner: 10Elukey) [08:17:10] o/ [08:17:13] (03CR) 10Elukey: [C:03+2] modules: remove tpl() usage in base:helper's resourcesDataChecksum [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101451 (owner: 10Elukey) [08:17:20] (03PS3) 10Elukey: modules: remove tpl() usage in base:helper's resourcesDataChecksum [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101451 [08:17:20] (03CR) 10CI reject: [V:04-1] modules: remove tpl() usage in base:helper's resourcesDataChecksum [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101451 (owner: 10Elukey) [08:17:34] (03CR) 10Elukey: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101451 (owner: 10Elukey) [08:19:31] (03Merged) 10jenkins-bot: modules: remove tpl() usage in base:helper's resourcesDataChecksum [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101451 (owner: 10Elukey) [08:23:05] (03PS3) 10Muehlenhoff: maps: Allow disabling the installation of kartotherian [puppet] - 10https://gerrit.wikimedia.org/r/1100456 (https://phabricator.wikimedia.org/T381565) [08:24:29] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:24:57] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Q2:rack/setup/install cloudelastic101[12] - https://phabricator.wikimedia.org/T378368#10389360 (10elukey) >>! In T378368#10386835, @elukey wrote: > I am reviewing the quote of these nodes to figure out what t... [08:26:29] (03PS1) 10JMeybohm: Enable pki external service in cfssl-issuer deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101455 [08:28:17] (03CR) 10JMeybohm: [C:03+1] Rename kubernetes[1039-1042] to wikikube-worker[1064-1067] [puppet] - 10https://gerrit.wikimedia.org/r/1101449 (https://phabricator.wikimedia.org/T377876) (owner: 10Jelto) [08:29:45] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes[1039-1042].eqiad.wmnet [08:32:02] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes[1039-1042].eqiad.wmnet [08:32:33] (03PS2) 10JMeybohm: Enable pki external service in cfssl-issuer deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101455 [08:32:33] (03PS1) 10JMeybohm: cfssl-issuer: Add external_services to chart fixture [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101456 [08:32:35] (03CR) 10Jelto: [C:03+2] Rename kubernetes[1039-1042] to wikikube-worker[1064-1067] [puppet] - 10https://gerrit.wikimedia.org/r/1101449 (https://phabricator.wikimedia.org/T377876) (owner: 10Jelto) [08:33:08] (03PS1) 10Elukey: sre.hosts.provision: add uefi only devices for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1101457 (https://phabricator.wikimedia.org/T378368) [08:34:39] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host cloudelastic1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:35:10] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudelastic1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:35:15] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1039 to wikikube-worker1064 [08:35:35] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [08:35:43] RESOLVED: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:36:29] (03PS2) 10Elukey: sre.hosts.provision: add uefi only devices for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1101457 (https://phabricator.wikimedia.org/T378368) [08:36:44] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host cloudelastic1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:38:27] (03CR) 10CI reject: [V:04-1] [WIP] charts: Add kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101452 (owner: 10Elukey) [08:39:14] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1039 to wikikube-worker1064 - jelto@cumin1002" [08:40:11] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1039 to wikikube-worker1064 - jelto@cumin1002" [08:40:11] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:40:11] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1064 [08:40:54] (03CR) 10Elukey: [C:03+1] maps: Allow disabling the installation of kartotherian [puppet] - 10https://gerrit.wikimedia.org/r/1100456 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:41:59] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudelastic1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:42:00] (03CR) 10Elukey: [C:03+1] Add a define to determine the postgresql version used for a Debian release [puppet] - 10https://gerrit.wikimedia.org/r/1101454 (owner: 10Muehlenhoff) [08:42:42] (03CR) 10Muehlenhoff: [C:03+2] maps: Allow disabling the installation of kartotherian [puppet] - 10https://gerrit.wikimedia.org/r/1100456 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:42:43] (03CR) 10Jelto: [C:03+1] "looks good to me" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101455 (owner: 10JMeybohm) [08:42:46] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1064 [08:42:58] (03CR) 10Jelto: [C:03+1] "looks good to me" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101456 (owner: 10JMeybohm) [08:43:24] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1039 to wikikube-worker1064 [08:45:37] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1040 to wikikube-worker1065 [08:45:40] FIRING: [3x] KubernetesRsyslogDown: rsyslog on kubernetes1040:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:46:01] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [08:49:37] (03CR) 10JMeybohm: [C:03+2] Enable pki external service in cfssl-issuer deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101455 (owner: 10JMeybohm) [08:49:40] (03CR) 10JMeybohm: [C:03+2] cfssl-issuer: Add external_services to chart fixture [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101456 (owner: 10JMeybohm) [08:49:51] (03CR) 10JMeybohm: [V:03+2 C:03+2] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099837 (owner: 10Wziko) [08:50:14] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1040 to wikikube-worker1065 - jelto@cumin1002" [08:50:49] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1040 to wikikube-worker1065 - jelto@cumin1002" [08:50:49] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:50:49] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1065 [08:52:27] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1065 [08:53:06] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1040 to wikikube-worker1065 [08:53:26] (03Merged) 10jenkins-bot: feat(cfssl-issuer): change default value for external_services in cfssl issuer helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099837 (owner: 10Wziko) [08:53:41] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1041 to wikikube-worker1066 [08:53:43] (03Merged) 10jenkins-bot: cfssl-issuer: Add external_services to chart fixture [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101456 (owner: 10JMeybohm) [08:53:43] (03Merged) 10jenkins-bot: Enable pki external service in cfssl-issuer deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101455 (owner: 10JMeybohm) [08:54:01] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [08:57:47] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1041 to wikikube-worker1066 - jelto@cumin1002" [08:58:09] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1041 to wikikube-worker1066 - jelto@cumin1002" [08:58:09] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:58:09] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1066 [08:59:19] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1066 [08:59:26] (03CR) 10Volans: "question inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1101457 (https://phabricator.wikimedia.org/T378368) (owner: 10Elukey) [08:59:57] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1041 to wikikube-worker1066 [09:00:51] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1042 to wikikube-worker1067 [09:00:59] !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wikikube-worker[2074-2075,2091,2124].codfw.wmnet with reason: reimage [09:01:11] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [09:01:19] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wikikube-worker[2074-2075,2091,2124].codfw.wmnet with reason: reimage [09:02:10] !log jayme@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2074-2075,2091,2124].codfw.wmnet [09:04:25] !log jayme@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2074-2075,2091,2124].codfw.wmnet [09:04:39] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1042 to wikikube-worker1067 - jelto@cumin1002" [09:04:58] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1042 to wikikube-worker1067 - jelto@cumin1002" [09:04:58] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:04:58] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1067 [09:05:12] (03PS1) 10Muehlenhoff: maps::postgresql_common: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1101461 [09:05:23] (03PS2) 10Muehlenhoff: maps::postgresql_common: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1101461 [09:05:47] 06SRE, 06serviceops, 13Patch-For-Review: mw2420-mw2451 do have unnecessary raid controllers (configured) - https://phabricator.wikimedia.org/T358489#10389434 (10JMeybohm) [09:06:27] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1067 [09:07:02] (03CR) 10Harroyo-wmf: [C:03+1] dialog: Fix wrong title on Types of unacceptable behavior step [extensions/ReportIncident] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1101069 (https://phabricator.wikimedia.org/T381529) (owner: 10Máté Szabó) [09:07:06] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1042 to wikikube-worker1067 [09:07:40] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1064.eqiad.wmnet wikikube-worker1065.eqiad.wmnet wikikube-worker1066.eqiad.wmnet wikikube-worker1067.eqiad.wmnet on all recursors [09:07:43] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1064.eqiad.wmnet wikikube-worker1065.eqiad.wmnet wikikube-worker1066.eqiad.wmnet wikikube-worker1067.eqiad.wmnet on all recursors [09:08:14] PROBLEM - BGP status on lsw1-a5-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:09:33] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1101461 (owner: 10Muehlenhoff) [09:10:29] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1064.eqiad.wmnet with OS bookworm [09:10:53] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1065.eqiad.wmnet with OS bookworm [09:11:37] ACKNOWLEDGEMENT - MD RAID on wikikube-worker2091 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T381747 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [09:11:43] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on wikikube-worker2091 - https://phabricator.wikimedia.org/T381747 (10ops-monitoring-bot) 03NEW [09:12:07] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1066.eqiad.wmnet with OS bookworm [09:12:35] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1067.eqiad.wmnet with OS bookworm [09:12:36] PROBLEM - BGP status on lsw1-a6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:13:05] !log jayme@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2074.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [09:13:43] !log jayme@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2075.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [09:14:00] (03PS3) 10Muehlenhoff: maps::postgresql_common: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1101461 [09:14:15] (03CR) 10Harroyo-wmf: [C:03+1] dialog: Fix spacing between buttons in the dialog footer [extensions/ReportIncident] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1101070 (https://phabricator.wikimedia.org/T381530) (owner: 10Máté Szabó) [09:14:18] !log jayme@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2091.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [09:14:36] RECOVERY - BGP status on lsw1-a6-codfw.mgmt is OK: BGP OK - up: 44, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:16:08] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2124.codfw.wmnet with OS bookworm [09:16:19] !log jayme@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2124.codfw.wmnet with OS bookworm [09:16:49] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1101461 (owner: 10Muehlenhoff) [09:18:40] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2124.codfw.wmnet with OS bookworm [09:18:45] (03CR) 10Volans: [C:03+2] style: a pass of black on all files [software/spicerack] - 10https://gerrit.wikimedia.org/r/1100772 (owner: 10Volans) [09:21:16] RECOVERY - BGP status on lsw1-a5-codfw.mgmt is OK: BGP OK - up: 32, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:21:36] PROBLEM - BGP status on lsw1-a6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:21:57] (03PS3) 10Elukey: charts: Add kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101452 (https://phabricator.wikimedia.org/T216826) [09:22:05] (03PS2) 10Gergő Tisza: Fix protocol for .well-known/change-password Apache rule [puppet] - 10https://gerrit.wikimedia.org/r/1101462 (https://phabricator.wikimedia.org/T381625) [09:25:56] (03PS4) 10Muehlenhoff: maps::postgresql_common: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1101461 [09:28:03] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1064.eqiad.wmnet with reason: host reimage [09:28:41] (03Merged) 10jenkins-bot: style: a pass of black on all files [software/spicerack] - 10https://gerrit.wikimedia.org/r/1100772 (owner: 10Volans) [09:28:50] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1065.eqiad.wmnet with reason: host reimage [09:29:53] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1066.eqiad.wmnet with reason: host reimage [09:30:19] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1067.eqiad.wmnet with reason: host reimage [09:31:11] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1064.eqiad.wmnet with reason: host reimage [09:31:43] (03CR) 10Muehlenhoff: [C:03+2] Add a define to determine the postgresql version used for a Debian release [puppet] - 10https://gerrit.wikimedia.org/r/1101454 (owner: 10Muehlenhoff) [09:34:05] (03PS1) 10JMeybohm: move-vlan: Don't fail if there is nothing to do [cookbooks] - 10https://gerrit.wikimedia.org/r/1101464 [09:35:03] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1066.eqiad.wmnet with reason: host reimage [09:35:27] (03PS5) 10Anzx: idwikivoyage: add logo, wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101459 (https://phabricator.wikimedia.org/T381080) [09:36:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101459 (https://phabricator.wikimedia.org/T381080) (owner: 10Anzx) [09:37:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/ReportIncident] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1101069 (https://phabricator.wikimedia.org/T381529) (owner: 10Máté Szabó) [09:37:48] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/ReportIncident] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1101070 (https://phabricator.wikimedia.org/T381530) (owner: 10Máté Szabó) [09:37:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100101 (owner: 10Máté Szabó) [09:38:05] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1067.eqiad.wmnet with reason: host reimage [09:38:24] 06SRE, 06serviceops, 13Patch-For-Review: mw2420-mw2451 do have unnecessary raid controllers (configured) - https://phabricator.wikimedia.org/T358489#10389536 (10JMeybohm) [09:38:28] !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2124.codfw.wmnet with reason: host reimage [09:39:05] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2091.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [09:39:12] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2074.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [09:39:18] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2075.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [09:40:42] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2091.codfw.wmnet with OS bookworm [09:40:50] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2074.codfw.wmnet with OS bookworm [09:40:54] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2075.codfw.wmnet with OS bookworm [09:41:11] jouncebot: nowandnext [09:41:11] No deployments scheduled for the next 1 hour(s) and 18 minute(s) [09:41:11] In 1 hour(s) and 18 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241209T1100) [09:42:19] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2124.codfw.wmnet with reason: host reimage [09:43:57] (03CR) 10Volans: move-vlan: Don't fail if there is nothing to do (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1101464 (owner: 10JMeybohm) [09:44:19] PROBLEM - BGP status on lsw1-a5-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:46:17] (03PS2) 10JMeybohm: move-vlan: Don't fail if there is nothing to do [cookbooks] - 10https://gerrit.wikimedia.org/r/1101464 [09:46:25] (03CR) 10Jelto: "two comments in-line" [cookbooks] - 10https://gerrit.wikimedia.org/r/1101464 (owner: 10JMeybohm) [09:46:28] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1065.eqiad.wmnet with reason: host reimage [09:47:21] (03CR) 10Elukey: sre.hosts.provision: add uefi only devices for Supermicro (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1101457 (https://phabricator.wikimedia.org/T378368) (owner: 10Elukey) [09:47:54] (03CR) 10Jelto: move-vlan: Don't fail if there is nothing to do (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1101464 (owner: 10JMeybohm) [09:48:41] (03CR) 10JMeybohm: move-vlan: Don't fail if there is nothing to do (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1101464 (owner: 10JMeybohm) [09:49:54] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1101457 (https://phabricator.wikimedia.org/T378368) (owner: 10Elukey) [09:49:54] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1064.eqiad.wmnet with OS bookworm [09:51:27] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1101464 (owner: 10JMeybohm) [09:52:06] (03CR) 10CI reject: [V:04-1] charts: Add kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101452 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [09:53:11] (03CR) 10Jelto: [C:03+1] "lgtm now" [cookbooks] - 10https://gerrit.wikimedia.org/r/1101464 (owner: 10JMeybohm) [09:53:12] (03PS1) 10Muehlenhoff: maps/postgresql: Support bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1101465 (https://phabricator.wikimedia.org/T381565) [09:54:15] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1066.eqiad.wmnet with OS bookworm [09:56:16] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1055493 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [09:56:19] (03CR) 10JMeybohm: [C:03+2] move-vlan: Don't fail if there is nothing to do [cookbooks] - 10https://gerrit.wikimedia.org/r/1101464 (owner: 10JMeybohm) [09:56:23] (03CR) 10Filippo Giunchedi: [C:03+2] tests: validate deploy-tag values [alerts] - 10https://gerrit.wikimedia.org/r/1101019 (owner: 10Filippo Giunchedi) [09:56:32] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1067.eqiad.wmnet with OS bookworm [09:59:40] !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2074.codfw.wmnet with reason: host reimage [09:59:50] !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2075.codfw.wmnet with reason: host reimage [10:00:06] !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2091.codfw.wmnet with reason: host reimage [10:01:49] RECOVERY - BGP status on lsw1-a6-codfw.mgmt is OK: BGP OK - up: 44, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:01:53] 06SRE, 06serviceops, 13Patch-For-Review: mw2420-mw2451 do have unnecessary raid controllers (configured) - https://phabricator.wikimedia.org/T358489#10389656 (10JMeybohm) [10:02:04] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2124.codfw.wmnet with OS bookworm [10:02:47] (03Merged) 10jenkins-bot: move-vlan: Don't fail if there is nothing to do [cookbooks] - 10https://gerrit.wikimedia.org/r/1101464 (owner: 10JMeybohm) [10:03:26] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2074.codfw.wmnet with reason: host reimage [10:04:29] FIRING: [2x] SystemdUnitFailed: mediawiki_job_translationnotifications-metawiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:04:59] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1065.eqiad.wmnet with OS bookworm [10:06:23] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2075.codfw.wmnet with reason: host reimage [10:06:43] !log homer 'cr*eqiad*' commit 'T377876' [10:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:46] T377876: Migrate wikikube-eqiad to containerd - https://phabricator.wikimedia.org/T377876 [10:08:34] 10ops-codfw, 06SRE, 06DC-Ops: ganeti2042 seems to have a broken CPU? (new Supermicro node) - https://phabricator.wikimedia.org/T378358#10389669 (10MoritzMuehlenhoff) Looks fine, the server is running stable now and the error message disappeared from IPMI logs: ` 40 | 11/11/2024 | 05:35:48 PM UTC | Proces... [10:08:42] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10389670 (10MoritzMuehlenhoff) [10:08:42] PROBLEM - Swift https frontend on ms-fe1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [10:09:26] PROBLEM - Swift https backend on ms-fe1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [10:09:29] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:09:32] RECOVERY - Swift https frontend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Swift [10:10:10] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2091.codfw.wmnet with reason: host reimage [10:10:16] RECOVERY - Swift https backend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.059 second response time https://wikitech.wikimedia.org/wiki/Swift [10:10:34] !log rebalance Ganeti cluster in codfw/A following server refresh T376594 [10:10:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:37] T376594: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594 [10:14:10] PROBLEM - Disk space on titan2001 is CRITICAL: DISK CRITICAL - free space: /srv 23738MiB (1% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=titan2001&var-datasource=codfw+prometheus/ops [10:15:14] mmhh I'll take a look at that ^ [10:20:03] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1064-1067].eqiad.wmnet [10:20:04] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1064-1067].eqiad.wmnet [10:21:17] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T381504#10389703 (10Jelto) [10:22:55] 06SRE, 06Data-Platform-SRE, 06Infrastructure-Foundations, 10netops: Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10389706 (10BTullis) This change looks fine to me, but would it be OK to wait until the New Year to implement it? I'm just a bit cautious... [10:23:06] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2074.codfw.wmnet with OS bookworm [10:23:30] FIRING: Primary inbound port utilisation over 80% #page: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [10:24:30] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [10:25:31] 06SRE-OnFire, 10MW-on-K8s, 06serviceops, 13Patch-For-Review, 10Sustainability (Incident Followup): mwscript-k8s creates too many resources - https://phabricator.wikimedia.org/T376795#10389740 (10dcausse) The search platform team is working on migrating a set of tools from `mwscript` to `mwscript-k8s` (T3... [10:25:51] checking [10:25:56] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2075.codfw.wmnet with OS bookworm [10:28:30] RESOLVED: Primary inbound port utilisation over 80% #page: Device asw2-b-eqiad.mgmt.eqiad.wmnet recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [10:29:09] (03PS1) 10Jelto: Rename kubernetes[1043-1046] to wikikube-worker[1068-1071] [puppet] - 10https://gerrit.wikimedia.org/r/1101473 (https://phabricator.wikimedia.org/T377876) [10:29:30] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [10:29:32] RECOVERY - BGP status on lsw1-a5-codfw.mgmt is OK: BGP OK - up: 32, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:30:02] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2091.codfw.wmnet with OS bookworm [10:32:32] (03CR) 10JMeybohm: [C:03+1] Rename kubernetes[1043-1046] to wikikube-worker[1068-1071] [puppet] - 10https://gerrit.wikimedia.org/r/1101473 (https://phabricator.wikimedia.org/T377876) (owner: 10Jelto) [10:32:58] (03PS1) 10Slyngshede: Prevent leak via window.opener [software/bitu] - 10https://gerrit.wikimedia.org/r/1101474 (https://phabricator.wikimedia.org/T381637) [10:33:18] 06SRE: The ops-maint-gcal.js script is missing support for some vendors - https://phabricator.wikimedia.org/T381680#10389761 (10Aklapper) [10:33:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 10 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100417 (https://phabricator.wikimedia.org/T381322) (owner: 10Gmodena) [10:34:03] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes[1043-1046].eqiad.wmnet [10:34:10] RECOVERY - Disk space on titan2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=titan2001&var-datasource=codfw+prometheus/ops [10:35:46] !log jayme@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2074-2075,2091,2124].codfw.wmnet [10:35:49] !log jayme@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2074-2075,2091,2124].codfw.wmnet [10:36:17] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes[1043-1046].eqiad.wmnet [10:36:20] (03CR) 10Slyngshede: [C:03+2] Prevent leak via window.opener [software/bitu] - 10https://gerrit.wikimedia.org/r/1101474 (https://phabricator.wikimedia.org/T381637) (owner: 10Slyngshede) [10:36:52] (03CR) 10Jelto: [C:03+2] Rename kubernetes[1043-1046] to wikikube-worker[1068-1071] [puppet] - 10https://gerrit.wikimedia.org/r/1101473 (https://phabricator.wikimedia.org/T377876) (owner: 10Jelto) [10:37:11] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1101474 (https://phabricator.wikimedia.org/T381637) (owner: 10Slyngshede) [10:37:57] 06SRE, 06serviceops, 13Patch-For-Review: mw2420-mw2451 do have unnecessary raid controllers (configured) - https://phabricator.wikimedia.org/T358489#10389785 (10JMeybohm) [10:38:31] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1101465 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [10:38:33] !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wikikube-worker[2103-2106].codfw.wmnet with reason: reimage [10:38:37] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1043 to wikikube-worker1068 [10:38:53] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wikikube-worker[2103-2106].codfw.wmnet with reason: reimage [10:38:57] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [10:39:05] (03Merged) 10jenkins-bot: Prevent leak via window.opener [software/bitu] - 10https://gerrit.wikimedia.org/r/1101474 (https://phabricator.wikimedia.org/T381637) (owner: 10Slyngshede) [10:39:23] !log jayme@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2103-2106].codfw.wmnet [10:41:41] !log jayme@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2103-2106].codfw.wmnet [10:42:31] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1043 to wikikube-worker1068 - jelto@cumin1002" [10:42:52] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1043 to wikikube-worker1068 - jelto@cumin1002" [10:42:52] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:42:52] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1068 [10:44:06] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1068 [10:44:21] (03PS2) 10Muehlenhoff: maps/postgresql: Support bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1101465 (https://phabricator.wikimedia.org/T381565) [10:44:45] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1043 to wikikube-worker1068 [10:45:23] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1044 to wikikube-worker1069 [10:45:42] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [10:46:23] (03CR) 10Muehlenhoff: [C:03+2] apt::repository: Fix configuration of source-only repositories on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1100814 (https://phabricator.wikimedia.org/T379343) (owner: 10Muehlenhoff) [10:47:17] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1101461 (owner: 10Muehlenhoff) [10:47:40] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1101465 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [10:49:22] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1044 to wikikube-worker1069 - jelto@cumin1002" [10:49:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on kubernetes1045:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:49:51] (03PS4) 10Elukey: charts: Add kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101452 (https://phabricator.wikimedia.org/T216826) [10:50:01] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1044 to wikikube-worker1069 - jelto@cumin1002" [10:50:01] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:50:01] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1069 [10:50:46] PROBLEM - BGP status on lsw1-b6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:51:05] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1069 [10:51:44] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1044 to wikikube-worker1069 [10:52:10] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1045 to wikikube-worker1070 [10:52:30] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [10:54:12] (03PS1) 10FNegri: WMCS: fix expected number of active nodes [alerts] - 10https://gerrit.wikimedia.org/r/1101477 [10:54:32] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on es2024.codfw.wmnet with reason: cloning [10:54:46] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on es2024.codfw.wmnet with reason: cloning [10:55:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2024 to clone es2045', diff saved to https://phabricator.wikimedia.org/P71639 and previous config saved to /var/cache/conftool/dbconfig/20241209-105508-marostegui.json [10:55:24] ACKNOWLEDGEMENT - MD RAID on wikikube-worker2106 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 2, Failed: 0, Spare: 1 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T381765 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [10:55:33] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on wikikube-worker2106 - https://phabricator.wikimedia.org/T381765 (10ops-monitoring-bot) 03NEW [10:55:48] RECOVERY - BGP status on lsw1-b6-codfw.mgmt is OK: BGP OK - up: 38, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:55:52] (03CR) 10CI reject: [V:04-1] WMCS: fix expected number of active nodes [alerts] - 10https://gerrit.wikimedia.org/r/1101477 (owner: 10FNegri) [10:56:05] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1045 to wikikube-worker1070 - jelto@cumin1002" [10:56:31] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1045 to wikikube-worker1070 - jelto@cumin1002" [10:56:31] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:56:32] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1070 [10:56:38] (03PS5) 10Elukey: charts: Add kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101452 (https://phabricator.wikimedia.org/T216826) [10:57:45] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1070 [10:58:24] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1045 to wikikube-worker1070 [10:58:38] (03PS1) 10Marostegui: mariadb: Move db1159 to s5 [puppet] - 10https://gerrit.wikimedia.org/r/1101478 (https://phabricator.wikimedia.org/T381550) [10:59:03] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1046 to wikikube-worker1071 [10:59:05] (03PS2) 10FNegri: WMCS: fix expected number of active nodes [alerts] - 10https://gerrit.wikimedia.org/r/1101477 [10:59:22] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [10:59:37] (03PS1) 10Filippo Giunchedi: sre: add multi-team to conntrack alert [alerts] - 10https://gerrit.wikimedia.org/r/1101480 [10:59:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1210 to clone db1159 T381550', diff saved to https://phabricator.wikimedia.org/P71640 and previous config saved to /var/cache/conftool/dbconfig/20241209-105941-marostegui.json [10:59:45] T381550: Move db1159 to s5 - https://phabricator.wikimedia.org/T381550 [10:59:53] (03CR) 10Marostegui: [C:03+2] mariadb: Move db1159 to s5 [puppet] - 10https://gerrit.wikimedia.org/r/1101478 (https://phabricator.wikimedia.org/T381550) (owner: 10Marostegui) [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241209T1100) [11:00:20] (03CR) 10CI reject: [V:04-1] WMCS: fix expected number of active nodes [alerts] - 10https://gerrit.wikimedia.org/r/1101477 (owner: 10FNegri) [11:00:32] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1159.eqiad.wmnet with reason: cloning [11:00:39] (03PS1) 10Aklapper: Phabricator: Add "video/webm" to files.viewable-mime-types [puppet] - 10https://gerrit.wikimedia.org/r/1101481 (https://phabricator.wikimedia.org/T309222) [11:00:45] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1159.eqiad.wmnet with reason: cloning [11:01:16] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1210.eqiad.wmnet with reason: cloning [11:01:29] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1210.eqiad.wmnet with reason: cloning [11:01:44] !log jayme@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2103.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [11:01:51] !log jayme@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2104.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [11:01:57] !log jayme@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2105.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [11:02:03] !log jayme@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2106.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [11:03:01] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1046 to wikikube-worker1071 - jelto@cumin1002" [11:03:02] !log root@cumin1002 START - Cookbook sre.mysql.clone of db1210.eqiad.wmnet onto db1159.eqiad.wmnet [11:03:22] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1046 to wikikube-worker1071 - jelto@cumin1002" [11:03:22] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:03:22] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1071 [11:03:51] PROBLEM - BGP status on lsw1-b6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/ [11:03:51] monitoring%23BGP_status [11:04:27] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1071 [11:04:28] FIRING: [4x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:05:06] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1046 to wikikube-worker1071 [11:05:17] (03PS3) 10Muehlenhoff: maps/postgresql: Support bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1101465 (https://phabricator.wikimedia.org/T381565) [11:05:18] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1068.eqiad.wmnet wikikube-worker1069.eqiad.wmnet wikikube-worker1070.eqiad.wmnet wikikube-worker1071.eqiad.wmnet on all recursors [11:05:21] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1068.eqiad.wmnet wikikube-worker1069.eqiad.wmnet wikikube-worker1070.eqiad.wmnet wikikube-worker1071.eqiad.wmnet on all recursors [11:05:41] (03PS1) 10Elukey: profile::k8s::deployment_server: add config for Kartotherian [puppet] - 10https://gerrit.wikimedia.org/r/1101483 (https://phabricator.wikimedia.org/T216826) [11:08:03] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1101465 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [11:08:24] (03PS3) 10FNegri: WMCS: fix expected number of active nodes [alerts] - 10https://gerrit.wikimedia.org/r/1101477 [11:08:51] RECOVERY - BGP status on lsw1-b6-codfw.mgmt is OK: BGP OK - up: 38, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:09:38] (03CR) 10CI reject: [V:04-1] WMCS: fix expected number of active nodes [alerts] - 10https://gerrit.wikimedia.org/r/1101477 (owner: 10FNegri) [11:11:19] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1068.eqiad.wmnet with OS bookworm [11:11:44] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1069.eqiad.wmnet with OS bookworm [11:12:11] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1070.eqiad.wmnet with OS bookworm [11:12:34] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1071.eqiad.wmnet with OS bookworm [11:15:46] (03PS1) 10Elukey: admin_ng: add the kartotherian namespace on Wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101487 (https://phabricator.wikimedia.org/T216826) [11:15:48] (03PS1) 10Elukey: services: add helmfile config for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101488 (https://phabricator.wikimedia.org/T216826) [11:21:09] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2105.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [11:21:14] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2106.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [11:21:17] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2103.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [11:21:20] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2104.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [11:23:21] 06SRE, 06Infrastructure-Foundations: ganeti105[34] implementation tracking - https://phabricator.wikimedia.org/T381581#10389915 (10MoritzMuehlenhoff) p:05Triage→03Medium [11:23:29] (03PS1) 10Btullis: Add hadoop/HTTP keytabs for labs hadoop workers [labs/private] - 10https://gerrit.wikimedia.org/r/1101490 (https://phabricator.wikimedia.org/T381087) [11:24:15] (03CR) 10Btullis: [V:03+2 C:03+2] Add hadoop/HTTP keytabs for labs hadoop workers [labs/private] - 10https://gerrit.wikimedia.org/r/1101490 (https://phabricator.wikimedia.org/T381087) (owner: 10Btullis) [11:24:34] (03CR) 10Zoe: [C:03+1] "I don't have +2 permissions here" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099658 (owner: 10PipelineBot) [11:25:34] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2103.codfw.wmnet with OS bookworm [11:25:45] !log jayme@cumin2002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2103 [11:25:45] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2103 [11:25:45] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2104.codfw.wmnet with OS bookworm [11:25:55] !log jayme@cumin2002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2104 [11:25:55] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2104 [11:26:41] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2105.codfw.wmnet with OS bookworm [11:26:51] !log jayme@cumin2002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2105 [11:26:52] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2105 [11:26:57] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2106.codfw.wmnet with OS bookworm [11:27:08] !log jayme@cumin2002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2106 [11:27:08] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2106 [11:27:28] !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1068.eqiad.wmnet with OS bookworm [11:28:08] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101494 [11:30:09] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1070.eqiad.wmnet with reason: host reimage [11:30:13] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1071.eqiad.wmnet with reason: host reimage [11:32:31] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1068.eqiad.wmnet with OS bookworm [11:33:34] (03CR) 10Slyngshede: [C:03+1] "Much much better than my approach." [alerts] - 10https://gerrit.wikimedia.org/r/1101480 (owner: 10Filippo Giunchedi) [11:33:48] (03CR) 10Slyngshede: [V:03+1 C:03+2] P:prometheus::ops JMX collector for IDP hosts [puppet] - 10https://gerrit.wikimedia.org/r/1100771 (https://phabricator.wikimedia.org/T380402) (owner: 10Slyngshede) [11:34:03] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1070.eqiad.wmnet with reason: host reimage [11:36:08] (03CR) 10Filippo Giunchedi: [C:03+2] sre: add multi-team to conntrack alert [alerts] - 10https://gerrit.wikimedia.org/r/1101480 (owner: 10Filippo Giunchedi) [11:37:29] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1071.eqiad.wmnet with reason: host reimage [11:40:34] (03PS4) 10Muehlenhoff: maps/postgresql: Support bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1101465 (https://phabricator.wikimedia.org/T381565) [11:42:10] (03PS2) 10Hnowlan: mediawiki: add debug flag for mercurius [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101081 (https://phabricator.wikimedia.org/T371701) [11:42:55] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1210.eqiad.wmnet onto db1159.eqiad.wmnet [11:45:01] !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2103.codfw.wmnet with reason: host reimage [11:45:23] !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2104.codfw.wmnet with reason: host reimage [11:45:48] (03CR) 10Giuseppe Lavagetto: [C:03+1] mediawiki: add debug flag for mercurius [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101081 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [11:45:56] !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2105.codfw.wmnet with reason: host reimage [11:46:01] !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2106.codfw.wmnet with reason: host reimage [11:48:06] (03CR) 10Hnowlan: [C:03+2] mediawiki: add debug flag for mercurius [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101081 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [11:48:21] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1101465 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [11:48:39] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1101483 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [11:48:41] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2103.codfw.wmnet with reason: host reimage [11:48:56] PROBLEM - BGP status on lsw1-b6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_ [11:48:56] ng%23BGP_status [11:49:47] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1068.eqiad.wmnet with reason: host reimage [11:50:30] (03Merged) 10jenkins-bot: mediawiki: add debug flag for mercurius [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101081 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [11:51:52] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2104.codfw.wmnet with reason: host reimage [11:52:58] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1070.eqiad.wmnet with OS bookworm [11:55:16] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1068.eqiad.wmnet with reason: host reimage [11:55:30] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1071.eqiad.wmnet with OS bookworm [12:00:15] (03PS1) 10Muehlenhoff: netbox::db: Use new helper function [puppet] - 10https://gerrit.wikimedia.org/r/1101497 [12:02:10] (03PS1) 10Hnowlan: mediawiki: fix mercurius argument order [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101498 (https://phabricator.wikimedia.org/T371701) [12:02:30] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2105.codfw.wmnet with reason: host reimage [12:04:31] !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1069.eqiad.wmnet with OS bookworm [12:05:15] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1069.eqiad.wmnet with OS bookworm [12:05:25] (03PS4) 10FNegri: WMCS: fix expected number of active nodes [alerts] - 10https://gerrit.wikimedia.org/r/1101477 [12:06:40] (03CR) 10CI reject: [V:04-1] WMCS: fix expected number of active nodes [alerts] - 10https://gerrit.wikimedia.org/r/1101477 (owner: 10FNegri) [12:06:58] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2106.codfw.wmnet with reason: host reimage [12:07:06] !log installing reportbug bugfix updates [12:07:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:23] (03CR) 10Giuseppe Lavagetto: [C:03+1] mediawiki: fix mercurius argument order [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101498 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [12:08:48] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2103.codfw.wmnet with OS bookworm [12:08:59] (03CR) 10Hnowlan: [C:03+2] mediawiki: fix mercurius argument order [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101498 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [12:10:51] (03Merged) 10jenkins-bot: mediawiki: fix mercurius argument order [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101498 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [12:12:00] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2104.codfw.wmnet with OS bookworm [12:12:04] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.10 point update - https://phabricator.wikimedia.org/T368288#10390060 (10MoritzMuehlenhoff) [12:13:49] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1068.eqiad.wmnet with OS bookworm [12:15:34] (03PS5) 10David Caro: WMCS: fix expected number of active nodes [alerts] - 10https://gerrit.wikimedia.org/r/1101477 (owner: 10FNegri) [12:16:36] (03PS6) 10David Caro: WMCS: fix expected number of active nodes [alerts] - 10https://gerrit.wikimedia.org/r/1101477 (owner: 10FNegri) [12:19:04] (03CR) 10FNegri: [C:03+1] WMCS: fix expected number of active nodes [alerts] - 10https://gerrit.wikimedia.org/r/1101477 (owner: 10FNegri) [12:19:12] (03CR) 10David Caro: [C:03+2] WMCS: fix expected number of active nodes [alerts] - 10https://gerrit.wikimedia.org/r/1101477 (owner: 10FNegri) [12:21:12] (03Merged) 10jenkins-bot: WMCS: fix expected number of active nodes [alerts] - 10https://gerrit.wikimedia.org/r/1101477 (owner: 10FNegri) [12:22:48] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2105.codfw.wmnet with OS bookworm [12:26:48] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2106.codfw.wmnet with OS bookworm [12:27:00] RECOVERY - BGP status on lsw1-b6-codfw.mgmt is OK: BGP OK - up: 38, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:30:06] PROBLEM - Disk space on ml-lab1001 is CRITICAL: DISK CRITICAL - free space: /srv 0MiB (0% inode=96%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops [12:50:06] RECOVERY - Disk space on ml-lab1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops [12:52:06] PROBLEM - MariaDB Replica SQL: s2 on dbstore1007 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table recentchanges is corrupt: try to repair it on query. Default database: nlwiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:57:33] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1101497 (owner: 10Muehlenhoff) [12:57:44] !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1069.eqiad.wmnet with OS bookworm [12:59:36] PROBLEM - MariaDB Replica Lag: s2 on dbstore1007 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 630.41 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:07:08] (03PS1) 10Ilias Sarantopoulos: amd-pytorch25: add torch 2.5.1 + ROCm 6.1 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1101524 [13:07:22] (03CR) 10Ilias Sarantopoulos: "`" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1101524 (owner: 10Ilias Sarantopoulos) [13:07:28] (03CR) 10Jforrester: [C:03+1] "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100217 (https://phabricator.wikimedia.org/T33951) (owner: 10Tim Starling) [13:09:24] (03CR) 10Ilias Sarantopoulos: "`" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1101524 (owner: 10Ilias Sarantopoulos) [13:12:40] (03PS1) 10Filippo Giunchedi: prometheus: fix jmx_idp config [puppet] - 10https://gerrit.wikimedia.org/r/1101525 [13:13:10]  [13:15:38] 10ops-eqiad, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Comm Error: backplane 0 when reimaging wikikube-worker1069 - https://phabricator.wikimedia.org/T381770 (10Jelto) 03NEW [13:16:08] (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: fix jmx_idp config [puppet] - 10https://gerrit.wikimedia.org/r/1101525 (owner: 10Filippo Giunchedi) [13:16:19] !log homer 'cr*eqiad*' commit 'T377876' [13:16:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:23] T377876: Migrate wikikube-eqiad to containerd - https://phabricator.wikimedia.org/T377876 [13:24:08] (03PS5) 10Anzx: jawiki: lift IP cap on 2024-12-17 and 2025-01-14 for Edit-a-ton [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101231 (https://phabricator.wikimedia.org/T381729) [13:26:11] 06SRE, 06Traffic: Occasional saturation of asw2-b-eqiad / cr port uplink and cache upload usage - https://phabricator.wikimedia.org/T381771 (10fgiunchedi) 03NEW [13:28:07] jouncebot: now [13:28:07] No deployments scheduled for the next 0 hour(s) and 31 minute(s) [13:30:37] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1068,1070-1071].eqiad.wmnet [13:30:39] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1068,1070-1071].eqiad.wmnet [13:32:39] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T381504#10390178 (10Jelto) [13:35:32] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Comm Error: backplane 0 when reimaging wikikube-worker1069 - https://phabricator.wikimedia.org/T381770#10390181 (10Jelto) The following commands have to be executed when the host is back (just noting it down so I don't forget it): ` cookbook s... [13:41:42] (03PS1) 10Jelto: Rename kubernetes[1047-1050] to wikikube-worker[1072-1075] [puppet] - 10https://gerrit.wikimedia.org/r/1101526 (https://phabricator.wikimedia.org/T377876) [13:42:14] I’ll run a maintenance script in a moment if nobody objects [13:42:40] (03PS1) 10Stevemunene: Enable airflow-analytics-test access to mx server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101527 (https://phabricator.wikimedia.org/T377926) [13:45:47] (03CR) 10JMeybohm: [C:03+1] Rename kubernetes[1047-1050] to wikikube-worker[1072-1075] [puppet] - 10https://gerrit.wikimedia.org/r/1101526 (https://phabricator.wikimedia.org/T377876) (owner: 10Jelto) [13:46:10] (03PS1) 10Btullis: Add a truststore password for the hadoopcluster in labs [labs/private] - 10https://gerrit.wikimedia.org/r/1101528 (https://phabricator.wikimedia.org/T381087) [13:46:18] !log jayme@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2103-2106].codfw.wmnet [13:46:21] !log jayme@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2103-2106].codfw.wmnet [13:46:34] (03CR) 10Btullis: [V:03+2 C:03+2] Add a truststore password for the hadoopcluster in labs [labs/private] - 10https://gerrit.wikimedia.org/r/1101528 (https://phabricator.wikimedia.org/T381087) (owner: 10Btullis) [13:46:43] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: add uefi only devices for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1101457 (https://phabricator.wikimedia.org/T378368) (owner: 10Elukey) [13:47:03] 06SRE, 06serviceops, 13Patch-For-Review: mw2420-mw2451 do have unnecessary raid controllers (configured) - https://phabricator.wikimedia.org/T358489#10390202 (10JMeybohm) [13:47:23] about to start PropertySuggester UpdateTable.php for wikidatawiki on deploy2002 [13:47:29] (will log when it’s done) [13:48:31] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 2 others: Q2:rack/setup/install cloudelastic101[12] - https://phabricator.wikimedia.org/T378368#10390203 (10elukey) @Jclark-ctr @bking I updated the provision cookbook to support this case, but the TL;DR is that we may need to use UEFI to avoid weird co... [13:49:00] (03CR) 10Elukey: [C:03+1] maps/postgresql: Support bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1101465 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [13:49:23] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes[1047-1050].eqiad.wmnet [13:54:18] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes[1047-1050].eqiad.wmnet [13:55:52] (03CR) 10Jelto: [C:03+2] Rename kubernetes[1047-1050] to wikikube-worker[1072-1075] [puppet] - 10https://gerrit.wikimedia.org/r/1101526 (https://phabricator.wikimedia.org/T377876) (owner: 10Jelto) [13:57:53] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1047 to wikikube-worker1072 [13:58:13] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241209T1400). [14:00:05] wangombe_g, joelyrookewmde, abijeet, anzx, and mszabo: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:13] * anzx 👋 [14:00:44] !log 'Updated the Wikidata property suggester with data from 20241125’s JSON dump: mwscript-k8s --attach -- extensions/PropertySuggester/maintenance/UpdateTable.php --wiki wikidatawiki --file php://stdin < wbs_propertypairs.csv # T377986, T376604' [14:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:49] T377986: Q4 2024 update of Property Suggester data - https://phabricator.wikimedia.org/T377986 [14:00:50] T376604: [PS] Update PropertySuggester update process for mwscript-k8s - https://phabricator.wikimedia.org/T376604 [14:01:08] PROBLEM - BGP status on lsw1-e3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Netw [14:01:08] toring%23BGP_status [14:02:07] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1047 to wikikube-worker1072 - jelto@cumin1002" [14:02:12] I can deploy, I think [14:02:52] ✋🏽 [14:03:20] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1047 to wikikube-worker1072 - jelto@cumin1002" [14:03:20] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:03:21] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1072 [14:03:33] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1072 [14:04:12] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1047 to wikikube-worker1072 [14:04:19] let’s start with wangombe_g then :) [14:04:34] (03PS1) 10Btullis: Add hadoop keystore_keypassword [labs/private] - 10https://gerrit.wikimedia.org/r/1101530 (https://phabricator.wikimedia.org/T381087) [14:05:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097499 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe) [14:05:24] (03PS2) 10Btullis: Add hadoop keystore_keypassword [labs/private] - 10https://gerrit.wikimedia.org/r/1101530 (https://phabricator.wikimedia.org/T381087) [14:05:30] tetsting [14:05:40] (03CR) 10Btullis: [V:03+2 C:03+2] Add hadoop keystore_keypassword [labs/private] - 10https://gerrit.wikimedia.org/r/1101530 (https://phabricator.wikimedia.org/T381087) (owner: 10Btullis) [14:05:46] way too early for testing [14:05:47] (03Merged) 10jenkins-bot: Add Metrics Platform stream configuration for translate_extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097499 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe) [14:05:59] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1048 to wikikube-worker1073 [14:06:10] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1097499|Add Metrics Platform stream configuration for translate_extension (T364460)]] [14:06:14] T364460: Implement the instrumentation to track usage of MinT in the Translate extension - https://phabricator.wikimedia.org/T364460 [14:06:20] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [14:08:17] changes look good on my end. [14:08:36] on testwiki, that is... [14:08:37] that’s strange, because they have not yet been fully deployed to the test hosts [14:08:40] FIRING: KubernetesRsyslogDown: rsyslog on kubernetes1050:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes1050 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:08:53] Oh? [14:09:01] when did you start testing? [14:09:12] they started out rolling to test servers at 14:07:21 UTC according to scap [14:09:27] so if it was after that, then I guess it’s possible that you coincidentally hit a server that already had the change [14:09:29] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:09:30] It's a config change. not feature. So I'm looking for errors, warning... [14:09:30] but you’re not supposed to test yet :) [14:09:47] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1048 to wikikube-worker1073 - jelto@cumin1002" [14:10:03] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1048 to wikikube-worker1073 - jelto@cumin1002" [14:10:03] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:10:03] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1073 [14:10:10] (that said, I’m not sure why sync-testservers-k8s is taking almost three minutes already o_O currently at 83%) [14:10:32] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1073 [14:11:08] hello [14:11:10] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1048 to wikikube-worker1073 [14:11:19] Makes sense why I didn't find any 😄 [14:11:38] * Lucas_WMDE waves at abijeet [14:11:45] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, wangombe: Backport for [[gerrit:1097499|Add Metrics Platform stream configuration for translate_extension (T364460)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:11:48] o/ [14:11:49] T364460: Implement the instrumentation to track usage of MinT in the Translate extension - https://phabricator.wikimedia.org/T364460 [14:11:56] wangombe_g: now you can test ^^ [14:12:00] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1049 to wikikube-worker1074 [14:12:00] (with WikimediaDebug) [14:12:06] 👍 [14:12:08] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [14:14:40] (03PS1) 10Btullis: Remove hadoop_clusters_secrets for labs from common.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/1101532 (https://phabricator.wikimedia.org/T381087) [14:15:50] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1049 to wikikube-worker1074 - jelto@cumin1002" [14:16:07] wangombe_g: are you still testing? [14:16:10] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1049 to wikikube-worker1074 - jelto@cumin1002" [14:16:10] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:16:11] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1074 [14:16:20] (just want to make sure there’s no misunderstanding between us and we’re both waiting for each other ^^) [14:16:21] Done. It's good. [14:16:24] ok, thanks! [14:16:26] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, wangombe: Continuing with sync [14:17:09] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1074 [14:17:15] I don’t see joelyrookewmde yet so I guess abijeet’s config change will be next once the current change is done [14:17:43] o/ sorry, i'm around just forgot to post a notice [14:17:48] hi! [14:17:48] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1049 to wikikube-worker1074 [14:18:00] I’m not sure we’ll have enough time for your backports though, it’s a full window :/ [14:18:11] Lucas_WMDE, ok [14:18:14] no problem, I can self-service later in that case [14:18:52] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1050 to wikikube-worker1075 [14:19:13] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [14:20:10] (03CR) 10Btullis: [V:03+2 C:03+2] Remove hadoop_clusters_secrets for labs from common.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/1101532 (https://phabricator.wikimedia.org/T381087) (owner: 10Btullis) [14:23:07] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1050 to wikikube-worker1075 - jelto@cumin1002" [14:23:23] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1097499|Add Metrics Platform stream configuration for translate_extension (T364460)]] (duration: 17m 12s) [14:23:26] T364460: Implement the instrumentation to track usage of MinT in the Translate extension - https://phabricator.wikimedia.org/T364460 [14:23:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101008 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [14:24:31] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1050 to wikikube-worker1075 - jelto@cumin1002" [14:24:31] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:24:31] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1075 [14:24:35] (03Merged) 10jenkins-bot: Translate: Enable message group subscription for 6 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101008 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [14:24:51] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1101008|Translate: Enable message group subscription for 6 wikis (T372386)]] [14:24:54] T372386: Enable message group subscription feature on Wikimedia wikis - https://phabricator.wikimedia.org/T372386 [14:24:57] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1075 [14:25:22] (03CR) 10Muehlenhoff: [C:03+2] maps/postgresql: Support bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1101465 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:25:36] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1050 to wikikube-worker1075 [14:25:51] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1072.eqiad.wmnet wikikube-worker1073.eqiad.wmnet wikikube-worker1074.eqiad.wmnet wikikube-worker1075.eqiad.wmnet on all recursors [14:25:54] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1072.eqiad.wmnet wikikube-worker1073.eqiad.wmnet wikikube-worker1074.eqiad.wmnet wikikube-worker1075.eqiad.wmnet on all recursors [14:29:39] !log lucaswerkmeister-wmde@deploy2002 abi, lucaswerkmeister-wmde: Backport for [[gerrit:1101008|Translate: Enable message group subscription for 6 wikis (T372386)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:29:47] abijeet: please test :) [14:29:51] Lucas_WMDE, on it [14:33:13] !log jayme@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2089-2090].codfw.wmnet [14:33:36] Lucas_WMDE, looks OK [14:33:40] !log lucaswerkmeister-wmde@deploy2002 abi, lucaswerkmeister-wmde: Continuing with sync [14:33:42] ok! [14:34:24] !log jayme@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2089-2090].codfw.wmnet [14:34:47] !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wikikube-worker[2089-2090].codfw.wmnet with reason: reimage [14:34:53] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wikikube-worker[2089-2090].codfw.wmnet with reason: reimage [14:35:04] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1072.eqiad.wmnet with OS bookworm [14:35:22] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1073.eqiad.wmnet with OS bookworm [14:35:26] (03PS1) 10Btullis: Add HTTP keytabs to hadoop masters in labs [labs/private] - 10https://gerrit.wikimedia.org/r/1101534 (https://phabricator.wikimedia.org/T381087) [14:35:42] (03CR) 10Btullis: [V:03+2 C:03+2] Add HTTP keytabs to hadoop masters in labs [labs/private] - 10https://gerrit.wikimedia.org/r/1101534 (https://phabricator.wikimedia.org/T381087) (owner: 10Btullis) [14:35:42] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1074.eqiad.wmnet with OS bookworm [14:36:00] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1075.eqiad.wmnet with OS bookworm [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:36:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [14:37:28] checking ^ [14:38:20] looks like the spike has already passed [14:38:22] !incidents [14:38:23] 5530 (UNACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [14:38:23] 5526 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [14:38:24] 5525 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (asw2-b-eqiad.mgmt.eqiad.wmnet) [14:38:24] 5524 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (asw2-b-eqiad.mgmt.eqiad.wmnet) [14:38:24] 5523 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [14:38:32] !ack 5530 [14:38:32] 5530 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [14:38:42] indeed [14:38:44] * Lucas_WMDE currently has a scap running ftr [14:39:16] PROBLEM - BGP status on lsw1-b8-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:39:25] ack thank you Lucas_WMDE [14:39:25] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1101008|Translate: Enable message group subscription for 6 wikis (T372386)]] (duration: 14m 34s) [14:39:29] T372386: Enable message group subscription feature on Wikimedia wikis - https://phabricator.wikimedia.org/T372386 [14:39:33] 06SRE, 06serviceops, 13Patch-For-Review: mw2420-mw2451 do have unnecessary raid controllers (configured) - https://phabricator.wikimedia.org/T358489#10390302 (10JMeybohm) [14:39:45] anzx: still there? (just checking ^^) [14:39:54] Lucas_WMDE: yes around [14:39:55] godog: can you let me know when it’s okay to continue deploying? (holding off for now) [14:39:59] godog: on #wikimedia-traffic `FIRING: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX` [14:40:04] could be related ? [14:40:17] anzx: okay! currently pausing deployments due to the above incidents [14:40:28] * Lucas_WMDE reviews the gerrit change in the meantime [14:40:31] 06SRE, 06serviceops, 13Patch-For-Review: mw2420-mw2451 do have unnecessary raid controllers (configured) - https://phabricator.wikimedia.org/T358489#10390304 (10JMeybohm) 05In progress→03Resolved a:03JMeybohm Well, that was a pretty painful experience - thanks @Clement_Goubert for working out the p... [14:40:46] Lucas_WMDE: will do [14:40:53] fabfur: not sure yet tbh [14:42:20] (03PS1) 10Btullis: Revert "Remove hadoop_clusters_secrets for labs from common.yaml" [labs/private] - 10https://gerrit.wikimedia.org/r/1101535 [14:42:27] (03CR) 10Btullis: [V:03+2 C:03+2] Revert "Remove hadoop_clusters_secrets for labs from common.yaml" [labs/private] - 10https://gerrit.wikimedia.org/r/1101535 (owner: 10Btullis) [14:42:55] (03CR) 10Brouberol: Enable airflow-analytics-test access to mx server (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101527 (https://phabricator.wikimedia.org/T377926) (owner: 10Stevemunene) [14:43:16] RECOVERY - BGP status on lsw1-b8-codfw.mgmt is OK: BGP OK - up: 16, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:44:52] !log jayme@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2090.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [14:44:59] !log jayme@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2089.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [14:45:05] Lucas_WMDE: still looking if we're ok to proceed with the deployments btw [14:45:26] ack [14:46:08] Lucas_WMDE: I think we're okay, please go ahead [14:46:17] ok, thanks! [14:46:30] !log bking@cumin2002 START - Cookbook sre.hosts.provision for host wdqs1025.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [14:46:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101459 (https://phabricator.wikimedia.org/T381080) (owner: 10Anzx) [14:46:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [14:47:16] PROBLEM - BGP status on lsw1-b8-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:47:18] (03Merged) 10jenkins-bot: idwikivoyage: add logo, wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101459 (https://phabricator.wikimedia.org/T381080) (owner: 10Anzx) [14:47:35] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1101459|idwikivoyage: add logo, wordmark (T381080)]] [14:47:38] T381080: Post-creation work for idwikivoyage - https://phabricator.wikimedia.org/T381080 [14:51:10] 10ops-codfw, 06SRE, 06DC-Ops: ganeti2042 seems to have a broken CPU? (new Supermicro node) - https://phabricator.wikimedia.org/T378358#10390324 (10Jhancock.wm) 05Open→03Resolved good to know. closing ticket and sending back the part. [14:51:57] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, anzx: Backport for [[gerrit:1101459|idwikivoyage: add logo, wordmark (T381080)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:52:06] anzx: please test :) [14:52:16] RECOVERY - BGP status on lsw1-b8-codfw.mgmt is OK: BGP OK - up: 16, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:52:17] Lucas_WMDE: checking [14:52:17] ok [14:52:19] (03CR) 10JHathaway: [C:03+2] hadoop: sort local-dirs [puppet] - 10https://gerrit.wikimedia.org/r/1101093 (https://phabricator.wikimedia.org/T381538) (owner: 10JHathaway) [14:53:06] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1072.eqiad.wmnet with reason: host reimage [14:53:22] Lucas_WMDE: looks good, both skin logos [14:53:27] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, anzx: Continuing with sync [14:53:32] thanks! [14:53:41] (03PS1) 10Hnowlan: php8.1: rebuild to pick up new mercurius images. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1101536 (https://phabricator.wikimedia.org/T371701) [14:53:54] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1074.eqiad.wmnet with reason: host reimage [14:54:06] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1075.eqiad.wmnet with reason: host reimage [14:55:24] (03CR) 10Lucas Werkmeister (WMDE): jawiki: lift IP cap on 2024-12-17 and 2025-01-14 for Edit-a-ton (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101231 (https://phabricator.wikimedia.org/T381729) (owner: 10Anzx) [14:56:49] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1072.eqiad.wmnet with reason: host reimage [14:56:58] (03CR) 10JMeybohm: [C:03+1] php8.1: rebuild to pick up new mercurius images. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1101536 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [14:58:04] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2089.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [14:58:31] (03CR) 10Hnowlan: [V:03+2 C:03+2] php8.1: rebuild to pick up new mercurius images. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1101536 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [14:58:51] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2089.codfw.wmnet with OS bookworm [14:59:02] !log jayme@cumin2002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2089 [14:59:02] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2089 [14:59:19] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1101459|idwikivoyage: add logo, wordmark (T381080)]] (duration: 11m 44s) [14:59:23] T381080: Post-creation work for idwikivoyage - https://phabricator.wikimedia.org/T381080 [14:59:33] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2090.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [15:00:01] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1025.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [15:00:08] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2090.codfw.wmnet with OS bookworm [15:00:18] !log jayme@cumin2002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2090 [15:00:19] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2090 [15:00:20] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1074.eqiad.wmnet with reason: host reimage [15:00:59] (03PS6) 10Anzx: jawiki: lift IP cap on 2024-12-17 and 2025-01-14 for Edit-a-ton [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101231 (https://phabricator.wikimedia.org/T381729) [15:01:46] !log UTC afternoon backport+config window done [15:01:46] 06SRE, 06Traffic: Occasional saturation of asw2-b-eqiad / cr port uplink and cache upload usage - https://phabricator.wikimedia.org/T381771#10390368 (10Fabfur) Contacted WME SRE that kindly agreed to lower current requests parallelism and check for results [15:01:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:53] I don’t have time to continue deploying, sorry [15:02:06] anzx: please reschedule the remaining changes at your convenience [15:02:17] PROBLEM - BGP status on lsw1-b8-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:02:25] Lucas_WMDE: ok, thanks [15:02:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101231 (https://phabricator.wikimedia.org/T381729) (owner: 10Anzx) [15:02:44] jouncebot: now [15:02:45] No deployments scheduled for the next 1 hour(s) and 27 minute(s) [15:02:45] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "LGTM – previous deployments suggest no special steps are needed for changing these settings – but ran out of time to deploy today" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101185 (https://phabricator.wikimedia.org/T381080) (owner: 10Anzx) [15:04:02] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1075.eqiad.wmnet with reason: host reimage [15:04:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101185 (https://phabricator.wikimedia.org/T381080) (owner: 10Anzx) [15:04:28] FIRING: [4x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:05:21] (03CR) 10Anzx: jawiki: lift IP cap on 2024-12-17 and 2025-01-14 for Edit-a-ton (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101231 (https://phabricator.wikimedia.org/T381729) (owner: 10Anzx) [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy2002 using scap backport" [extensions/ReportIncident] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1101069 (https://phabricator.wikimedia.org/T381529) (owner: 10Máté Szabó) [15:08:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy2002 using scap backport" [extensions/ReportIncident] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1101070 (https://phabricator.wikimedia.org/T381530) (owner: 10Máté Szabó) [15:08:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100101 (owner: 10Máté Szabó) [15:08:57] (03Merged) 10jenkins-bot: Prep IRS config for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100101 (owner: 10Máté Szabó) [15:15:28] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1072.eqiad.wmnet with OS bookworm [15:18:45] !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2089.codfw.wmnet with reason: host reimage [15:18:46] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1074.eqiad.wmnet with OS bookworm [15:20:11] !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2090.codfw.wmnet with reason: host reimage [15:20:13] (03Merged) 10jenkins-bot: dialog: Fix wrong title on Types of unacceptable behavior step [extensions/ReportIncident] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1101069 (https://phabricator.wikimedia.org/T381529) (owner: 10Máté Szabó) [15:20:14] (03Merged) 10jenkins-bot: dialog: Fix spacing between buttons in the dialog footer [extensions/ReportIncident] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1101070 (https://phabricator.wikimedia.org/T381530) (owner: 10Máté Szabó) [15:20:36] !log mszabo@deploy2002 Started scap sync-world: Backport for [[gerrit:1101069|dialog: Fix wrong title on Types of unacceptable behavior step (T381529)]], [[gerrit:1101070|dialog: Fix spacing between buttons in the dialog footer (T381530)]], [[gerrit:1100101|Prep IRS config for testwiki]] [15:20:41] T381529: Wrong title on Types of unacceptable behavior step - https://phabricator.wikimedia.org/T381529 [15:20:41] T381530: Missing spacing between buttons - https://phabricator.wikimedia.org/T381530 [15:21:45] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1075.eqiad.wmnet with OS bookworm [15:22:11] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2089.codfw.wmnet with reason: host reimage [15:24:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T381720#10390469 (10phaultfinder) [15:25:06] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2090.codfw.wmnet with reason: host reimage [15:25:06] !log mszabo@deploy2002 mszabo: Backport for [[gerrit:1101069|dialog: Fix wrong title on Types of unacceptable behavior step (T381529)]], [[gerrit:1101070|dialog: Fix spacing between buttons in the dialog footer (T381530)]], [[gerrit:1100101|Prep IRS config for testwiki]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:28:49] !log mszabo@deploy2002 mszabo: Continuing with sync [15:28:59] jouncebot: nowandnext [15:28:59] No deployments scheduled for the next 1 hour(s) and 1 minute(s) [15:28:59] In 1 hour(s) and 1 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241209T1630) [15:31:09] IRS will need a followup config change (these changes are good as they are but do not actually enable the extension on testwiki...) [15:32:22] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on wikikube-worker2091 - https://phabricator.wikimedia.org/T381747#10390489 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm T 358489 - probably an false error from this. server is fine right now. logged into idrac and all disks are active. [15:33:39] !log depool/restart swift/repool ms-fe1010 [15:33:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:57] (03PS1) 10Máté Szabó: Actually load IRS in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101541 [15:34:15] !log mszabo@deploy2002 Finished scap sync-world: Backport for [[gerrit:1101069|dialog: Fix wrong title on Types of unacceptable behavior step (T381529)]], [[gerrit:1101070|dialog: Fix spacing between buttons in the dialog footer (T381530)]], [[gerrit:1100101|Prep IRS config for testwiki]] (duration: 13m 39s) [15:34:20] T381529: Wrong title on Types of unacceptable behavior step - https://phabricator.wikimedia.org/T381529 [15:34:21] T381530: Missing spacing between buttons - https://phabricator.wikimedia.org/T381530 [15:34:37] !log depool/restart swift/repool ms-fe1012 [15:34:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:07] (03PS2) 10Slyngshede: Updated notification handling [software/bitu] - 10https://gerrit.wikimedia.org/r/1100388 (https://phabricator.wikimedia.org/T381075) [15:38:58] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti test cluster to UEFI - https://phabricator.wikimedia.org/T381780 (10MoritzMuehlenhoff) 03NEW [15:39:09] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti test cluster to UEFI - https://phabricator.wikimedia.org/T381780#10390513 (10MoritzMuehlenhoff) p:05Triage→03Medium [15:41:47] I'm going to do a scap sync to rebuild images to pick up new php 8.1 base images [15:43:01] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2089.codfw.wmnet with OS bookworm [15:44:25] RECOVERY - BGP status on lsw1-b8-codfw.mgmt is OK: BGP OK - up: 16, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:44:31] !log hnowlan@deploy2002 Started scap sync-world: Rebuild and deploy to pick up new php8.1 base [15:45:08] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2090.codfw.wmnet with OS bookworm [15:55:08] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: PowerSupplyFailure Power Supply - Status - issue on cloudbackup2003:9290 - https://phabricator.wikimedia.org/T380479#10390547 (10Jhancock.wm) @Andrew we wanna swap the power supplies. It looks like all three happened on PSU2. We need to shut it off to s... [15:55:27] !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1073.eqiad.wmnet with OS bookworm [15:56:06] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1073.eqiad.wmnet with OS bookworm [16:00:20] (03PS1) 10Samtar: IS/IS-l: wgUseCodexSpecialBlock for beta, prod test.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101545 (https://phabricator.wikimedia.org/T377121) [16:04:27] (03PS1) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1101547 [16:05:10] !log hnowlan@deploy2002 Finished scap sync-world: Rebuild and deploy to pick up new php8.1 base (duration: 23m 00s) [16:05:29] (03PS2) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1101547 [16:05:31] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1101547 (owner: 10CDanis) [16:06:25] !log jayme@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2089-2090].codfw.wmnet [16:06:27] !log jayme@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2089-2090].codfw.wmnet [16:07:19] (03CR) 10Kosta Harlan: [C:03+1] Actually load IRS in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101541 (owner: 10Máté Szabó) [16:07:55] (03CR) 10CI reject: [V:04-1] WIP [puppet] - 10https://gerrit.wikimedia.org/r/1101547 (owner: 10CDanis) [16:12:33] !log rebalance Ganeti cluster in codfw/B following server refresh T376594 [16:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:37] T376594: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594 [16:14:29] (03PS3) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1101547 [16:15:50] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1101547 (owner: 10CDanis) [16:16:53] (03CR) 10CI reject: [V:04-1] WIP [puppet] - 10https://gerrit.wikimedia.org/r/1101547 (owner: 10CDanis) [16:17:33] (03PS4) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1101547 (https://phabricator.wikimedia.org/T381771) [16:17:39] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Message content lost when mailing list is the only recipient - https://phabricator.wikimedia.org/T377045#10390615 (10Dzahn) We just installed package version 3.3.8-2~deb12u2 and this should be fixed now. Please let us know how it looks. [16:18:00] (03PS5) 10CDanis: Skip cache on WME upload.wm.o HEAD reqs [puppet] - 10https://gerrit.wikimedia.org/r/1101547 (https://phabricator.wikimedia.org/T381771) [16:18:25] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1101547 (https://phabricator.wikimedia.org/T381771) (owner: 10CDanis) [16:22:26] jouncebot: nowandnext [16:22:26] No deployments scheduled for the next 0 hour(s) and 7 minute(s) [16:22:26] In 0 hour(s) and 7 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241209T1630) [16:24:12] I'm going to do another sync-world to rebuild the 8.1 images to pick something up that was missed last time [16:26:37] !log hnowlan@deploy2002 Started scap sync-world: Rebuild and deploy to pick up new php8.1 base [16:28:00] (03CR) 10Fabfur: [C:03+1] "I would say it's ok, I'd prefer someone else have a look anyway" [puppet] - 10https://gerrit.wikimedia.org/r/1101547 (https://phabricator.wikimedia.org/T381771) (owner: 10CDanis) [16:29:29] (03PS6) 10CDanis: Skip cache on WME upload.wm.o HEAD reqs [puppet] - 10https://gerrit.wikimedia.org/r/1101547 (https://phabricator.wikimedia.org/T381771) [16:30:05] jan_drewniak: I, the Bot under the Fountain, call upon thee, The Deployer, to do Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241209T1630). [16:34:45] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T381720#10390671 (10phaultfinder) [16:39:25] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - cxserver_4002: Servers kubernetes2056.codfw.wmnet, wikikube-worker2063.codfw.wmnet, mw2338.codfw.wmnet, mw2370.codfw.wmnet, wikikube-worker2155.codfw.wmnet, wikikube-worker2076.codfw.wmnet, wikikube-worker2071.codfw.wmnet, wikikube-worker2022.codfw.wmnet, wikikube-worker2157.codfw.wmnet, wikikube-worker2139.codfw.wmnet, wikikube-worker2058.codfw.wmne [16:39:25] ube-worker2065.codfw.wmnet, wikikube-worker2055.codfw.wmnet, kubernetes2039.codfw.wmnet, wikikube-worker2062.codfw.wmnet, wikikube-worker2045.codfw.wmnet, kubernetes2022.codfw.wmnet, mw2419.codfw.wmnet, wikikube-worker2014.codfw.wmnet, wikikube-worker2156.codfw.wmnet, wikikube-worker2133.codfw.wmnet, wikikube-worker2127.codfw.wmnet, wikikube-worker2087.codfw.wmnet, wikikube-worker2013.codfw.wmnet, wikikube-worker2106.codfw.wmnet, mw2372.c [16:39:25] et, wikikube-worker2104.codfw.wmnet, wikikube-worker2146.codfw.wmnet, wikikube-worker2035.codfw.wmnet, wikikube-worker2024.codfw.wmnet, kubernetes2017.codfw.wmnet, wikikube-worker2112.c https://wikitech.wikimedia.org/wiki/PyBal [16:40:25] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:40:29] (03CR) 10BBlack: [C:03+1] "SGTM! Nice catch!" [puppet] - 10https://gerrit.wikimedia.org/r/1101547 (https://phabricator.wikimedia.org/T381771) (owner: 10CDanis) [16:41:01] (03CR) 10CDanis: [C:03+2] Skip cache on WME upload.wm.o HEAD reqs [puppet] - 10https://gerrit.wikimedia.org/r/1101547 (https://phabricator.wikimedia.org/T381771) (owner: 10CDanis) [16:43:56] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1101497 (owner: 10Muehlenhoff) [16:45:04] (03CR) 10Klausman: [C:03+1] amd-pytorch25: add torch 2.5.1 + ROCm 6.1 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1101524 (owner: 10Ilias Sarantopoulos) [16:46:58] 06SRE, 06Traffic, 13Patch-For-Review: Occasional saturation of asw2-b-eqiad / cr port uplink and cache upload usage - https://phabricator.wikimedia.org/T381771#10390692 (10Fabfur) Adding a comment to not forget: - Investigate why (if) Varnish performs GET for each HEAD request, and if this is the rationale... [16:47:08] !log hnowlan@deploy2002 Finished scap sync-world: Rebuild and deploy to pick up new php8.1 base (duration: 21m 09s) [16:48:08] (03PS1) 10Herron: thanos: add bool_gauge recording rules for search/wdqs update lag slos [puppet] - 10https://gerrit.wikimedia.org/r/1101558 (https://phabricator.wikimedia.org/T302995) [16:57:22] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on wikikube-worker2106 - https://phabricator.wikimedia.org/T381765#10390723 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm T 358489 - probably an false error from this. server is fine right now. logged into idrac and all disks are active. [16:57:36] (03PS1) 10Herron: pyrra: onboard wdqs/serach update lag slos [puppet] - 10https://gerrit.wikimedia.org/r/1101560 (https://phabricator.wikimedia.org/T302995) [16:58:16] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [16:58:57] (03PS1) 10CDanis: Skip cache on all WME upload.wm.o reqs [puppet] - 10https://gerrit.wikimedia.org/r/1101561 (https://phabricator.wikimedia.org/T381771) [16:59:10] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [17:09:03] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:09:53] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:11:47] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53071 bytes in 2.672 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:11:53] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:12:04] (03CR) 10BBlack: [C:03+1] Skip cache on all WME upload.wm.o reqs [puppet] - 10https://gerrit.wikimedia.org/r/1101561 (https://phabricator.wikimedia.org/T381771) (owner: 10CDanis) [17:14:52] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1025.eqiad.wmnet with OS bullseye [17:15:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10390785 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host wdqs1025.eqiad.wmnet with OS bullseye [17:15:15] !log 💙cdanis@cumin1002.eqiad.wmnet ~ 🕛☕ sudo cumin 'A:cp' 'disable-puppet "cdanis testing in production I464702d8fb T381771"' [17:15:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:18] T381771: Occasional saturation of asw2-b-eqiad / cr port uplink and cache upload usage - https://phabricator.wikimedia.org/T381771 [17:15:47] (03CR) 10CDanis: [C:03+2] Skip cache on all WME upload.wm.o reqs [puppet] - 10https://gerrit.wikimedia.org/r/1101561 (https://phabricator.wikimedia.org/T381771) (owner: 10CDanis) [17:16:21] !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1073.eqiad.wmnet with OS bookworm [17:18:26] !log T381771 💙cdanis@cp1107.eqiad.wmnet ~ 🕧☕ sudo run-puppet-agent --force [17:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:43] 10ops-codfw, 06DC-Ops: Move kafka-main2010 within the same rack - https://phabricator.wikimedia.org/T381788 (10Jhancock.wm) 03NEW [17:18:59] 10ops-eqiad, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Comm Error: backplane 0 when reimaging wikikube-worker1073 - https://phabricator.wikimedia.org/T381789 (10Jelto) 03NEW [17:19:42] 10ops-eqiad, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Comm Error: backplane 0 when reimaging wikikube-worker1073 - https://phabricator.wikimedia.org/T381789#10390817 (10Jelto) The following commands have to be executed when the host is back (just noting it down so I don't forget it): ` c... [17:19:44] 10ops-codfw, 06DC-Ops: Move kafka-main2010 within the same rack - https://phabricator.wikimedia.org/T381788#10390819 (10Jhancock.wm) @bking I believe this is part of your team. But please correct me if I'm wrong. Is it possible to move this server? the down time would only be for a few minutes. If yes, when wo... [17:20:11] !log homer 'lsw1-e3-eqiad*' commit 'T377876' [17:20:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:15] T377876: Migrate wikikube-eqiad to containerd - https://phabricator.wikimedia.org/T377876 [17:22:31] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1072,1074-1075].eqiad.wmnet [17:22:33] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1072,1074-1075].eqiad.wmnet [17:23:14] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T381504#10390832 (10Jelto) [17:24:57] (03PS7) 10Hnowlan: mediawiki: add multi-job support to mercurius [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099752 (https://phabricator.wikimedia.org/T371701) [17:25:47] (03CR) 10Fabfur: [C:03+1] Skip cache on all WME upload.wm.o reqs [puppet] - 10https://gerrit.wikimedia.org/r/1101561 (https://phabricator.wikimedia.org/T381771) (owner: 10CDanis) [17:25:56] (03CR) 10CI reject: [V:04-1] mediawiki: add multi-job support to mercurius [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099752 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [17:26:45] FIRING: KubernetesDeploymentUnavailableReplicas: ... [17:26:46] Deployment cxserver-production in cxserver at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=cxserver&var-deployment=cxserver-production - ... [17:26:46] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [17:30:42] (03CR) 10Klausman: [V:03+2 C:03+2] amd-pytorch25: add torch 2.5.1 + ROCm 6.1 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1101524 (owner: 10Ilias Sarantopoulos) [17:35:31] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [17:36:05] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [17:38:09] (03CR) 10Dzahn: [C:03+2] phabricator: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055493 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [17:41:45] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [17:41:46] Deployment cxserver-production in cxserver at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=cxserver&var-deployment=cxserver-production - ... [17:41:46] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [17:43:57] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1025.eqiad.wmnet with reason: host reimage [17:44:13] !log 💙cdanis@cumin1002.eqiad.wmnet ~ 🕧☕ sudo cumin 'A:cp' 'enable-puppet "cdanis testing in production I464702d8fb T381771"' [17:44:15] (03PS1) 10Hnowlan: jobqueue: disable webvideotranscodeprioritized [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101565 (https://phabricator.wikimedia.org/T371701) [17:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:17] T381771: Occasional saturation of asw2-b-eqiad / cr port uplink and cache upload usage - https://phabricator.wikimedia.org/T381771 [17:44:24] (03CR) 10Dzahn: [C:03+2] "Aware of the need to reboot, planning that for tomorrow during the window where we sometimes do phab deployments." [puppet] - 10https://gerrit.wikimedia.org/r/1055493 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [17:44:57] (03CR) 10Scott French: [C:03+1] jobqueue: disable webvideotranscodeprioritized [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101565 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [17:47:02] (03CR) 10Hnowlan: [C:03+2] jobqueue: disable webvideotranscodeprioritized [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101565 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [17:47:49] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1025.eqiad.wmnet with reason: host reimage [17:48:18] (03Merged) 10jenkins-bot: jobqueue: disable webvideotranscodeprioritized [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101565 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [17:51:06] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [17:52:13] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [17:58:28] (03PS1) 10Herron: thanos: add bool_gauge recording rules for search/wdqs update lag slos [puppet] - 10https://gerrit.wikimedia.org/r/1101558 (https://phabricator.wikimedia.org/T302995) [17:58:28] (03CR) 10Herron: [C:03+2] "self merging for slo onboarding" [puppet] - 10https://gerrit.wikimedia.org/r/1101558 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [18:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241209T1800) [18:00:04] ryankemper: I, the Bot under the Fountain, call upon thee, The Deployer, to do Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241209T1800). [18:05:05] (03PS1) 10Herron: pyrra: onboard wdqs/serach update lag slos [puppet] - 10https://gerrit.wikimedia.org/r/1101560 (https://phabricator.wikimedia.org/T302995) [18:05:05] (03CR) 10Herron: [C:03+2] "self merge for onboarding" [puppet] - 10https://gerrit.wikimedia.org/r/1101560 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [18:06:07] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1025.eqiad.wmnet with OS bullseye [18:06:26] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10391035 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host wdqs1025.eqiad.wmnet with OS bullseye completed: - wdqs... [18:09:29] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:12:20] (03PS1) 10Herron: add onboarded notes to wdqs/search update lag slos [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1101567 [18:12:47] (03CR) 10Herron: [V:03+2 C:03+2] add onboarded notes to wdqs/search update lag slos [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1101567 (owner: 10Herron) [18:14:25] 06SRE, 06Traffic: Survey the third-party library market for UA policy compliance - https://phabricator.wikimedia.org/T313634#10391067 (10Scott_French) Tagging this as #Traffic for consideration as it's likely a better fit than #SRE as a whole (though I realize @CDanis may have interest in picking this back up). [18:16:39] PROBLEM - Disk space on build2001 is CRITICAL: DISK CRITICAL - free space: / 10413 MB (4% inode=79%): /tmp 10413 MB (4% inode=79%): /var/tmp 10413 MB (4% inode=79%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=build2001&var-datasource=codfw+prometheus/ops [18:17:38] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich: apply [18:17:44] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich: apply [18:22:27] (03CR) 10Ottomata: mediawiki.org/beacon/event/index.php - use EventLoggingLegacyConverter::submitEvent (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063222 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [18:29:20] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10391170 (10bking) It took a few tries, but `wdqs1025` is now running off UEFI. I left some notes [[ https://wikitech.wikimedia.org/wiki/Talk:UEFI_Boot#Results_... [18:45:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100158 (https://phabricator.wikimedia.org/T377128) (owner: 10Ebernhardson) [18:47:39] (03PS1) 10Jdlrobson: Expand support for dark mode for anonymous users (itwiki, enwikivoyage) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101573 (https://phabricator.wikimedia.org/T379352) [18:47:39] (03CR) 10Jly: [C:03+1] Fix protocol for .well-known/change-password Apache rule [puppet] - 10https://gerrit.wikimedia.org/r/1101462 (https://phabricator.wikimedia.org/T381625) (owner: 10Gergő Tisza) [18:47:41] (03PS1) 10Jdlrobson: Disable QuickSurveys for recommendations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101574 (https://phabricator.wikimedia.org/T379241) [18:47:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101573 (https://phabricator.wikimedia.org/T379352) (owner: 10Jdlrobson) [18:48:05] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101574 (https://phabricator.wikimedia.org/T379241) (owner: 10Jdlrobson) [18:55:09] (03PS1) 10Eevans: aqs1010: canary Cassandra 4.1.7 [puppet] - 10https://gerrit.wikimedia.org/r/1101576 (https://phabricator.wikimedia.org/T380420) [18:57:25] (03PS1) 10Arlolra: Add Atieno's public key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101577 [18:57:52] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1101576 (https://phabricator.wikimedia.org/T380420) (owner: 10Eevans) [19:01:53] (03PS2) 10Eevans: aqs1010: canary Cassandra 4.1.7 [puppet] - 10https://gerrit.wikimedia.org/r/1101576 (https://phabricator.wikimedia.org/T380420) [19:03:20] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1101576 (https://phabricator.wikimedia.org/T380420) (owner: 10Eevans) [19:04:28] FIRING: [4x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:06:06] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T381720#10391297 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF Rebalanced Power [19:22:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Q2:rack/setup/install cloudelastic101[12] - https://phabricator.wikimedia.org/T378368#10391331 (10bking) @elukey I'm fine with focusing our efforts on UEFI, it seems like the best use of our time. Ping me in... [19:35:13] (03PS1) 10FNegri: WMCS: fix expr in TooManyCloud*Down [alerts] - 10https://gerrit.wikimedia.org/r/1101584 [19:36:01] (03PS2) 10FNegri: WMCS: fix expr in TooManyCloud*Down [alerts] - 10https://gerrit.wikimedia.org/r/1101584 [19:37:13] (03CR) 10CI reject: [V:04-1] WMCS: fix expr in TooManyCloud*Down [alerts] - 10https://gerrit.wikimedia.org/r/1101584 (owner: 10FNegri) [19:41:37] (03CR) 10Wangombe: Add Metrics Platform stream configuration for translate_extension (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097499 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe) [19:45:07] (03CR) 10AOkoth: "- The miscweb chart was chosen primarily just to reduce the "blast radius" of this change. We did not want to a change that might affect o" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [19:54:13] (03CR) 10Eevans: [C:03+2] aqs1010: canary Cassandra 4.1.7 [puppet] - 10https://gerrit.wikimedia.org/r/1101576 (https://phabricator.wikimedia.org/T380420) (owner: 10Eevans) [19:58:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101541 (owner: 10Máté Szabó) [19:58:49] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching aqs1010.eqiad.wmnet: Upgrading to Cassandra 4.1.7 — T380420 - eevans@cumin1002 [19:58:53] T380420: Upgrade Cassandra clusters to v4.1.7 - https://phabricator.wikimedia.org/T380420 [20:00:48] (03PS2) 10CDanis: Actually load IRS in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101541 (https://phabricator.wikimedia.org/T374105) (owner: 10Máté Szabó) [20:07:27] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs1010.eqiad.wmnet: Upgrading to Cassandra 4.1.7 — T380420 - eevans@cumin1002 [20:07:30] T380420: Upgrade Cassandra clusters to v4.1.7 - https://phabricator.wikimedia.org/T380420 [20:17:01] (03PS1) 10Bking: wdqs1025: enable as wdqs-internal-main host [puppet] - 10https://gerrit.wikimedia.org/r/1101588 (https://phabricator.wikimedia.org/T376150) [20:18:29] (03PS3) 10FNegri: WMCS: fix expr in TooManyCloud*Down [alerts] - 10https://gerrit.wikimedia.org/r/1101584 (https://phabricator.wikimedia.org/T381807) [20:19:45] (03CR) 10CI reject: [V:04-1] WMCS: fix expr in TooManyCloud*Down [alerts] - 10https://gerrit.wikimedia.org/r/1101584 (https://phabricator.wikimedia.org/T381807) (owner: 10FNegri) [20:21:12] (03PS4) 10FNegri: WMCS: fix expr in TooManyCloud*Down [alerts] - 10https://gerrit.wikimedia.org/r/1101584 (https://phabricator.wikimedia.org/T381807) [20:22:26] (03CR) 10CI reject: [V:04-1] WMCS: fix expr in TooManyCloud*Down [alerts] - 10https://gerrit.wikimedia.org/r/1101584 (https://phabricator.wikimedia.org/T381807) (owner: 10FNegri) [20:23:36] !log aqu@deploy2002 Started deploy [airflow-dags/analytics@1d9b4b5]: Canary events generation: pooling [20:24:25] (03PS5) 10FNegri: WMCS: fix expr in TooManyCloud*Down [alerts] - 10https://gerrit.wikimedia.org/r/1101584 (https://phabricator.wikimedia.org/T381807) [20:25:23] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@1d9b4b5]: Canary events generation: pooling (duration: 01m 46s) [20:33:03] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde, ldap/nda for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T380487#10391531 (10KFrancis) Hi all, I am confirming the NDA is complete. Please proceed with next steps. Thanks! [20:33:20] (03PS6) 10FNegri: WMCS: fix expr in TooManyCloud*Down [alerts] - 10https://gerrit.wikimedia.org/r/1101584 (https://phabricator.wikimedia.org/T381807) [20:35:03] (03PS7) 10FNegri: WMCS: fix expr in TooManyCloud*Down [alerts] - 10https://gerrit.wikimedia.org/r/1101584 (https://phabricator.wikimedia.org/T381807) [20:46:09] (03PS8) 10Bking: dse-k8s-services: introduce Blunderbuss config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091827 (https://phabricator.wikimedia.org/T371994) [20:51:57] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde, ldap/nda for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T380487#10391566 (10Scott_French) 05Stalled→03In progress a:03Scott_French Great, thank you! I'll take this from here. [20:58:56] (03PS9) 10Bking: dse-k8s-services: introduce Blunderbuss config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091827 (https://phabricator.wikimedia.org/T371994) [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241209T2100). Please do the needful. [21:00:05] anzx, ebernhardson, Jdlrobson, and kostajh: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:09] \o [21:01:35] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T381742#10391603 (10Eevans) There is History™ here, see: {T362841}. The original drive that failed then was `/dev/sdg` (disk:2 of the second controller). Another disk was pulled in the process `sdf` (disk:1, second c... [21:02:09] o/ [21:03:54] hi [21:05:40] I'd prefer not to be the deployer, as it's late here [21:06:05] hi - sorry to be late - i'll deploy [21:06:22] thanks cjming! [21:06:35] np! [21:06:51] anzx: are you around? [21:07:08] cjming: any chance we could start with mine, if it's the same to the others? [21:07:15] sure! [21:07:19] thanks cjming for running it today! [21:07:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101541 (https://phabricator.wikimedia.org/T374105) (owner: 10Máté Szabó) [21:08:39] (03Merged) 10jenkins-bot: Actually load IRS in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101541 (https://phabricator.wikimedia.org/T374105) (owner: 10Máté Szabó) [21:08:57] thx [21:08:58] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1101541|Actually load IRS in production (T374105)]] [21:09:02] T374105: Incident Reporting System - MVP - https://phabricator.wikimedia.org/T374105 [21:12:51] kostajh: up on test servers if it's testable [21:13:20] cjming: thanks, looking [21:13:31] !log cjming@deploy2002 cjming, mszabo: Backport for [[gerrit:1101541|Actually load IRS in production (T374105)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:14:45] cjming: lgtm [21:14:50] cool - syncing [21:14:52] !log cjming@deploy2002 cjming, mszabo: Continuing with sync [21:17:04] ebernhardson: i'll do yours next - can you rebase? [21:17:25] cjming: sure, sec [21:17:44] (03PS4) 10Ebernhardson: cirrus: Enable mlr-2024 for select wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100158 (https://phabricator.wikimedia.org/T377128) [21:21:27] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1101541|Actually load IRS in production (T374105)]] (duration: 12m 29s) [21:21:31] T374105: Incident Reporting System - MVP - https://phabricator.wikimedia.org/T374105 [21:21:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100158 (https://phabricator.wikimedia.org/T377128) (owner: 10Ebernhardson) [21:22:05] kostajh: should be live :) [21:22:10] cjming: thanks! [21:22:17] yw! [21:22:42] (03Merged) 10jenkins-bot: cirrus: Enable mlr-2024 for select wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100158 (https://phabricator.wikimedia.org/T377128) (owner: 10Ebernhardson) [21:22:56] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1100158|cirrus: Enable mlr-2024 for select wikis (T377128)]] [21:23:00] T377128: Import recent MLR models built by MjoLniR in production and test them - https://phabricator.wikimedia.org/T377128 [21:25:58] (03PS1) 10Scott French: admin: Add suzannewood to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1101591 (https://phabricator.wikimedia.org/T380487) [21:25:58] (03CR) 10Scott French: "Thanks in advance, Reuven!" [puppet] - 10https://gerrit.wikimedia.org/r/1101591 (https://phabricator.wikimedia.org/T380487) (owner: 10Scott French) [21:26:34] ebernhardson: on mwdebug if verifiable [21:26:51] cjming: sorta, looking [21:27:13] (03PS2) 10Jdlrobson: Expand support for dark mode for anonymous users (itwiki, enwikivoyage) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101573 (https://phabricator.wikimedia.org/T379352) [21:27:15] !log cjming@deploy2002 cjming, ebernhardson: Backport for [[gerrit:1100158|cirrus: Enable mlr-2024 for select wikis (T377128)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:27:28] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [21:27:32] cjming: seems reasonable [21:27:40] great [21:27:44] !log cjming@deploy2002 cjming, ebernhardson: Continuing with sync [21:28:57] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [21:29:54] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [21:31:33] (03CR) 10RLazarus: [C:03+1] admin: Add suzannewood to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1101591 (https://phabricator.wikimedia.org/T380487) (owner: 10Scott French) [21:32:57] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [21:33:25] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1100158|cirrus: Enable mlr-2024 for select wikis (T377128)]] (duration: 10m 28s) [21:33:28] cjming: o/ [21:33:29] T377128: Import recent MLR models built by MjoLniR in production and test them - https://phabricator.wikimedia.org/T377128 [21:33:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101573 (https://phabricator.wikimedia.org/T379352) (owner: 10Jdlrobson) [21:34:04] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [21:34:09] (03CR) 10Scott French: [C:03+2] admin: Add suzannewood to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1101591 (https://phabricator.wikimedia.org/T380487) (owner: 10Scott French) [21:34:22] hi anzx: just finishing up Jdlrobson's patches and we can do yours - maybe in 10-15 minutes? [21:34:33] sure [21:34:37] (03Merged) 10jenkins-bot: Expand support for dark mode for anonymous users (itwiki, enwikivoyage) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101573 (https://phabricator.wikimedia.org/T379352) (owner: 10Jdlrobson) [21:34:45] ebernhardson: your patch should be live :) [21:34:51] cjming: awesome, thanks! [21:34:54] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1101573|Expand support for dark mode for anonymous users (itwiki, enwikivoyage) (T379352)]] [21:34:57] T379352: [Spike] Evaluate and provide feedback on itwiki automatic night mode color-darkening - https://phabricator.wikimedia.org/T379352 [21:38:09] Jdlrobson: 1st patch up on test servers if you want to check [21:38:44] !log cjming@deploy2002 jdlrobson, cjming: Backport for [[gerrit:1101573|Expand support for dark mode for anonymous users (itwiki, enwikivoyage) (T379352)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:39:04] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [21:39:13] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [21:39:54] cjming: on it [21:40:18] cjming: that one looks good to sync! [21:40:24] !log cjming@deploy2002 jdlrobson, cjming: Continuing with sync [21:40:51] (03PS2) 10Jdlrobson: Disable QuickSurveys for recommendations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101574 (https://phabricator.wikimedia.org/T379241) [21:41:28] Jdlrobson: can you rebase your 2nd patch? [21:41:34] cjming: done [21:41:39] ty [21:41:55] cjming: I also have a beta cluster patch https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1101094?usp=search that I need to get merged (I don't know if that's a case of you just hitting +2... ? If it's more than that I can get someone to merge that after the deploy window is done. [21:42:09] (typo fix) [21:42:29] np - i can do that real quick [21:44:03] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [21:46:02] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1101573|Expand support for dark mode for anonymous users (itwiki, enwikivoyage) (T379352)]] (duration: 11m 08s) [21:46:09] T379352: [Spike] Evaluate and provide feedback on itwiki automatic night mode color-darkening - https://phabricator.wikimedia.org/T379352 [21:46:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101574 (https://phabricator.wikimedia.org/T379241) (owner: 10Jdlrobson) [21:46:29] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [21:46:53] (03Merged) 10jenkins-bot: Disable QuickSurveys for recommendations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101574 (https://phabricator.wikimedia.org/T379241) (owner: 10Jdlrobson) [21:46:54] thanks cjming i really appreciate it! [21:47:06] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1101574|Disable QuickSurveys for recommendations (T379241 T380778)]] [21:47:16] T379241: Set up quicksurveys for non-UI experiment pt 2 - https://phabricator.wikimedia.org/T379241 [21:47:17] T380778: Simple summary experiment - Rerun QuickSurvey for browser extension - https://phabricator.wikimedia.org/T380778 [21:47:32] <_Gerges> Hi [21:49:17] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/wmde, ldap/nda for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T380487#10391918 (10Scott_French) 05In progress→03Resolved This should all be done now. I'll follow up on T380994 for the next part. [21:49:42] Jdlrobson: happy to help - it's all because of your encouragement that i'm even part of the regular deployment roster 😀 [21:50:29] 2nd patch on test servers [21:51:08] <_Gerges> If Deployer had time to Deploy my patch https://gerrit.wikimedia.org/r/c/mediawiki/extensions/UniversalLanguageSelector/+/1101592 [21:51:09] (03PS2) 10Jdlrobson: Fixes A/B test for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101094 (https://phabricator.wikimedia.org/T378115) [21:51:11] !log cjming@deploy2002 cjming, jdlrobson: Backport for [[gerrit:1101574|Disable QuickSurveys for recommendations (T379241 T380778)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:51:35] cjming: that one also looks good to sync! thanks! [21:51:40] cool beans [21:51:41] !log cjming@deploy2002 cjming, jdlrobson: Continuing with sync [21:51:48] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [21:52:16] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [21:57:22] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1101574|Disable QuickSurveys for recommendations (T379241 T380778)]] (duration: 10m 15s) [21:57:27] T379241: Set up quicksurveys for non-UI experiment pt 2 - https://phabricator.wikimedia.org/T379241 [21:57:28] T380778: Simple summary experiment - Rerun QuickSurvey for browser extension - https://phabricator.wikimedia.org/T380778 [21:57:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101094 (https://phabricator.wikimedia.org/T378115) (owner: 10Jdlrobson) [21:58:01] hi Gerges - if you can get a +2 on your patch, it should ride the train this week -- if you need it on 1.44.0-wmf.6, please create the backport patches and add them to one of the deployment windows -- i still have to do a few more patches for anzx [21:58:22] (03Merged) 10jenkins-bot: Fixes A/B test for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101094 (https://phabricator.wikimedia.org/T378115) (owner: 10Jdlrobson) [21:58:25] anzx: still around? [21:58:40] cjming: yes, i am around [21:58:55] (03PS7) 10Anzx: jawiki: lift IP cap on 2024-12-17 and 2025-01-14 for Edit-a-ton [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101231 (https://phabricator.wikimedia.org/T381729) [21:59:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101231 (https://phabricator.wikimedia.org/T381729) (owner: 10Anzx) [22:00:05] Reedy, sbassett, Maryum, and manfredi: It is that lovely time of the day again! You are hereby commanded to deploy Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241209T2200). [22:00:07] (03Merged) 10jenkins-bot: jawiki: lift IP cap on 2024-12-17 and 2025-01-14 for Edit-a-ton [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101231 (https://phabricator.wikimedia.org/T381729) (owner: 10Anzx) [22:00:10] cjming: no need for checking on this, you can sync [22:00:16] cool - thx [22:00:29] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1101231|jawiki: lift IP cap on 2024-12-17 and 2025-01-14 for Edit-a-ton (T381729)]] [22:00:33] T381729: Lift IP cap on 2024-12-17 and 2025-01-14 for Editation for jawiki - https://phabricator.wikimedia.org/T381729 [22:01:15] Jdlrobson: all your patches should be live, including the beta cluster one [22:03:28] thanks a bunch cjming really appreciate all your help here! [22:03:32] yw! [22:05:10] !log cjming@deploy2002 cjming, anzx: Backport for [[gerrit:1101231|jawiki: lift IP cap on 2024-12-17 and 2025-01-14 for Edit-a-ton (T381729)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:05:11] !log cjming@deploy2002 cjming, anzx: Continuing with sync [22:05:56] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for Suzanne Wood (WMDE) - https://phabricator.wikimedia.org/T380994#10391965 (10Scott_French) a:03Scott_French [22:07:14] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Suzanne Wood (WMDE) - https://phabricator.wikimedia.org/T380994#10391970 (10Scott_French) [22:07:36] Gerges: i don't thing there will be time for your backports -- i have one more config patch and we're already running over -- if you need backports for ULS on 1.44.0-wmf.6, you'll need to create those patches after you get the master patch merged. [22:08:01] (03PS3) 10Anzx: idwikivoyage: add timezone, sitename and project namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101185 (https://phabricator.wikimedia.org/T381080) [22:09:29] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:10:32] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1101231|jawiki: lift IP cap on 2024-12-17 and 2025-01-14 for Edit-a-ton (T381729)]] (duration: 10m 02s) [22:10:37] T381729: Lift IP cap on 2024-12-17 and 2025-01-14 for Editation for jawiki - https://phabricator.wikimedia.org/T381729 [22:11:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101185 (https://phabricator.wikimedia.org/T381080) (owner: 10Anzx) [22:11:54] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Suzanne Wood (WMDE) - https://phabricator.wikimedia.org/T380994#10391996 (10Scott_French) @odimitrijevic @Milimetric @Ahoelzl @Ottomata - Could one of you please approve access to `ana... [22:12:03] (03Merged) 10jenkins-bot: idwikivoyage: add timezone, sitename and project namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101185 (https://phabricator.wikimedia.org/T381080) (owner: 10Anzx) [22:12:19] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1101185|idwikivoyage: add timezone, sitename and project namespace (T381080)]] [22:12:24] T381080: Post-creation work for idwikivoyage - https://phabricator.wikimedia.org/T381080 [22:12:36] <_Gerges> cjming: Do should to upload woff font files? [22:13:17] Gerges: i'm not sure - i've never dealt with those before [22:13:19] (03PS1) 10Daimona Eaytoy: beta: Enable $wgCampaignEventsEnableEventWikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101596 (https://phabricator.wikimedia.org/T380077) [22:14:04] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Suzanne Wood (WMDE) - https://phabricator.wikimedia.org/T380994#10392008 (10Ottomata) Approved! [22:14:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 10 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101596 (https://phabricator.wikimedia.org/T380077) (owner: 10Daimona Eaytoy) [22:15:14] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Suzanne Wood (WMDE) - https://phabricator.wikimedia.org/T380994#10392011 (10Ottomata) > Also, if WMDE staff are similarly covered by the recent streamlining in T370424, it would be gre... [22:16:15] anzx: 2nd patch up on test servers [22:16:21] cjming: checking [22:16:24] !log cjming@deploy2002 cjming, anzx: Backport for [[gerrit:1101185|idwikivoyage: add timezone, sitename and project namespace (T381080)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:17:21] cjming: looks good [22:17:25] !log cjming@deploy2002 cjming, anzx: Continuing with sync [22:17:39] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Suzanne Wood (WMDE) - https://phabricator.wikimedia.org/T380994#10392013 (10Scott_French) [22:21:08] cjming: need to run namespacedupes for idwikivoyafe after deploy [22:21:28] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Suzanne Wood (WMDE) - https://phabricator.wikimedia.org/T380994#10392028 (10Scott_French) Great, thank you very much @Ottomata. [22:22:50] ah - thanks for the reminder - will do [22:23:06] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1101185|idwikivoyage: add timezone, sitename and project namespace (T381080)]] (duration: 10m 46s) [22:23:10] T381080: Post-creation work for idwikivoyage - https://phabricator.wikimedia.org/T381080 [22:25:28] (03PS1) 10Scott French: admin: add suzannewood to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/1101595 (https://phabricator.wikimedia.org/T380994) [22:25:28] (03CR) 10Scott French: "Thanks in advance for the review, Reuven." [puppet] - 10https://gerrit.wikimedia.org/r/1101595 (https://phabricator.wikimedia.org/T380994) (owner: 10Scott French) [22:26:39] anzx: your patches should be live - i ran namespacedupes for idwikivoyage [22:27:37] cjming: thanks for deployment [22:27:45] yw :) [22:28:23] (03CR) 10RLazarus: [C:03+1] admin: add suzannewood to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/1101595 (https://phabricator.wikimedia.org/T380994) (owner: 10Scott French) [22:28:50] !log end of UTC late backport window [22:28:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:35] !log [wdqs-internal graph split] Cleared away old categories units on 5 hosts (`wdqs20[18-20],wdqs202[6-7]`) [22:29:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:29] (03CR) 10Scott French: [C:03+2] admin: add suzannewood to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/1101595 (https://phabricator.wikimedia.org/T380994) (owner: 10Scott French) [22:34:28] RESOLVED: [4x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:40:52] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Suzanne Wood (WMDE) - https://phabricator.wikimedia.org/T380994#10392073 (10Scott_French) 05Stalled→03Resolved Alright, this should now be complete, though the underlying chang... [22:46:14] <_Gerges> cjming: my patch 1101592, Should I change to wmf/* branch? [22:53:58] 06SRE, 06Data-Engineering, 06Data-Platform-SRE: Data Platform access streamlining for WMDE staff - https://phabricator.wikimedia.org/T381824 (10Scott_French) 03NEW [22:54:48] 06SRE, 06Data-Platform-SRE, 10Data-Engineering (Q2 2024 October 1st - December 31th): Streamline Data Platform access approvals for WMF staff - https://phabricator.wikimedia.org/T370424#10392108 (10Scott_French) See T381824 for potentially extending the same streamlining to WMDE staff. [23:15:05] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:15:55] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.196 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:50:33] (03PS1) 10Bvibber: LanguageConverter: Ignore content inside and elements [core] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1101600 (https://phabricator.wikimedia.org/T381617) [23:52:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [core] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1101600 (https://phabricator.wikimedia.org/T381617) (owner: 10Bvibber)