[00:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250225T0000) [00:01:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 957.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:03:21] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:06:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 947.9ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:10:12] (03PS1) 10Zabe: Prepare sylwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122267 (https://phabricator.wikimedia.org/T386441) [00:10:14] (03PS1) 10Zabe: Activate sylwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122268 (https://phabricator.wikimedia.org/T386441) [00:12:05] (03CR) 10Zabe: [C:03+2] Prepare sylwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122267 (https://phabricator.wikimedia.org/T386441) (owner: 10Zabe) [00:13:29] (03Merged) 10jenkins-bot: Prepare sylwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122267 (https://phabricator.wikimedia.org/T386441) (owner: 10Zabe) [00:13:51] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1122267|Prepare sylwiki (T386441)]] [00:13:54] T386441: Create Wikipedia Sylheti - https://phabricator.wikimedia.org/T386441 [00:16:28] !log zabe@deploy2002 zabe: Backport for [[gerrit:1122267|Prepare sylwiki (T386441)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [00:16:39] !log zabe@deploy2002 zabe: Continuing with sync [00:23:17] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1122267|Prepare sylwiki (T386441)]] (duration: 09m 25s) [00:23:21] T386441: Create Wikipedia Sylheti - https://phabricator.wikimedia.org/T386441 [00:25:06] (03CR) 10Zabe: [C:03+2] Activate sylwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122268 (https://phabricator.wikimedia.org/T386441) (owner: 10Zabe) [00:25:52] (03Merged) 10jenkins-bot: Activate sylwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122268 (https://phabricator.wikimedia.org/T386441) (owner: 10Zabe) [00:26:45] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1122268|Activate sylwiki (T386441)]] [00:29:26] !log zabe@deploy2002 zabe: Backport for [[gerrit:1122268|Activate sylwiki (T386441)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [00:29:30] T386441: Create Wikipedia Sylheti - https://phabricator.wikimedia.org/T386441 [00:29:51] !log zabe@deploy2002 zabe: Continuing with sync [00:33:14] (03PS5) 10Scott French: php8.1: use pcre2 backport [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1120588 (https://phabricator.wikimedia.org/T386006) [00:33:21] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:36:34] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1122268|Activate sylwiki (T386441)]] (duration: 09m 48s) [00:36:38] T386441: Create Wikipedia Sylheti - https://phabricator.wikimedia.org/T386441 [00:38:14] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1122270 [00:38:14] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1122270 (owner: 10TrainBranchBot) [00:39:19] (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122271 (https://phabricator.wikimedia.org/T386441) [00:39:21] (03CR) 10Zabe: [C:03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122271 (https://phabricator.wikimedia.org/T386441) (owner: 10Zabe) [00:40:43] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122271 (https://phabricator.wikimedia.org/T386441) (owner: 10Zabe) [00:41:10] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1122271|Update interwiki cache (T386441)]] [00:42:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.079s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:43:51] !log zabe@deploy2002 zabe: Backport for [[gerrit:1122271|Update interwiki cache (T386441)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [00:43:55] T386441: Create Wikipedia Sylheti - https://phabricator.wikimedia.org/T386441 [00:44:26] !log zabe@deploy2002 zabe: Continuing with sync [00:45:25] (03CR) 10Scott French: [C:03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1122259 (https://phabricator.wikimedia.org/T378429) (owner: 10RLazarus) [00:47:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.091s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:48:53] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1122270 (owner: 10TrainBranchBot) [00:51:02] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1122271|Update interwiki cache (T386441)]] (duration: 09m 51s) [00:51:05] T386441: Create Wikipedia Sylheti - https://phabricator.wikimedia.org/T386441 [00:54:08] (03CR) 10RLazarus: [C:03+2] deployment_server: Add mw-script-restricted config to mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1122259 (https://phabricator.wikimedia.org/T378429) (owner: 10RLazarus) [01:06:16] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.005s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:08:34] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1122273 [01:08:34] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1122273 (owner: 10TrainBranchBot) [01:16:16] RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.046s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:24:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10577714 (10phaultfinder) [01:29:37] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1122273 (owner: 10TrainBranchBot) [01:51:13] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [01:53:21] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:54:46] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding backup2013-4 to codfw - jhancock@cumin2002" [01:54:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding backup2013-4 to codfw - jhancock@cumin2002" [01:54:52] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:55:28] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host backup2013 [01:55:34] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host backup2014 [01:55:40] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host backup2013 [01:55:43] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host backup2014 [01:56:44] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host backup2013.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [01:56:46] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host backup2014.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [02:07:31] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host backup2013.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [02:08:37] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.44.0-wmf.18 [core] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1122276 (https://phabricator.wikimedia.org/T382369) [02:08:39] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.44.0-wmf.18 [core] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1122276 (https://phabricator.wikimedia.org/T382369) (owner: 10TrainBranchBot) [02:13:37] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host backup2013.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [02:14:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host backup2013.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [02:15:11] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10577756 (10Jhancock.wm) [02:15:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host backup2014.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [02:17:48] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host backup2013.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [02:19:20] (03Merged) 10jenkins-bot: Branch commit for wmf/1.44.0-wmf.18 [core] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1122276 (https://phabricator.wikimedia.org/T382369) (owner: 10TrainBranchBot) [02:23:21] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:25:49] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host backup2013.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [02:28:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.181s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:33:16] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.128s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:34:16] (03PS2) 10Huji: New alias for Project namespace on Persian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122278 (https://phabricator.wikimedia.org/T387185) [02:35:34] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host backup2013.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [02:35:36] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host backup2014.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:38:16] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.085s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:42:45] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host backup2013.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [02:42:47] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host backup2014.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [02:43:16] RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.005s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:43:21] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:44:10] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['backup2014'] [02:44:12] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['backup2013'] [02:44:22] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['backup2013'] [02:44:24] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['backup2014'] [02:45:12] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host backup2013.codfw.wmnet with OS bookworm [02:45:14] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host backup2014.codfw.wmnet with OS bookworm [02:45:25] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10577765 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host backup2013.codfw.wmnet with OS bookworm [02:45:25] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10577766 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host backup2014.codfw.wmnet with OS bookworm [02:56:56] (03PS1) 10Pppery: Add various settings for new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122279 (https://phabricator.wikimedia.org/T386464) [02:57:22] (03PS2) 10Pppery: Add various settings for new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122279 (https://phabricator.wikimedia.org/T386464) [02:58:33] (03PS3) 10Pppery: Add various settings for new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122279 (https://phabricator.wikimedia.org/T386464) [02:58:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate grafana-labs.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250225T0300) [03:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:13:21] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:28:50] (03PS4) 10Pppery: Add various settings for new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122279 (https://phabricator.wikimedia.org/T386464) [03:52:22] FIRING: [5x] SystemdUnitFailed: user@0.service on testreduce1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250225T0400) [04:02:47] (03PS1) 10TrainBranchBot: testwikis to 1.44.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122280 (https://phabricator.wikimedia.org/T382369) [04:02:48] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.44.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122280 (https://phabricator.wikimedia.org/T382369) (owner: 10TrainBranchBot) [04:03:37] (03Merged) 10jenkins-bot: testwikis to 1.44.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122280 (https://phabricator.wikimedia.org/T382369) (owner: 10TrainBranchBot) [04:04:05] !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.44.0-wmf.18 refs T382369 [04:04:09] T382369: 1.44.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T382369 [04:05:30] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup2013.codfw.wmnet with OS bookworm [04:05:35] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup2014.codfw.wmnet with OS bookworm [04:05:36] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10577831 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host backup2013.codfw.wmnet with OS bookworm executed with errors: - backu... [04:05:41] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10577832 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host backup2014.codfw.wmnet with OS bookworm executed with errors: - backu... [04:18:19] PROBLEM - Disk space on deploy2002 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/dabe007b656e0142cc64c917886173235ebfd151464d06b58cb88e4e1ea40743/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy2002&var-datasource=codfw+prometheus/ops [04:38:19] RECOVERY - Disk space on deploy2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy2002&var-datasource=codfw+prometheus/ops [04:51:29] !log mwpresync@deploy2002 Finished scap sync-world: testwikis to 1.44.0-wmf.18 refs T382369 (duration: 47m 24s) [04:51:33] T382369: 1.44.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T382369 [05:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250225T0500) [05:03:02] !log mwpresync@deploy2002 Pruned MediaWiki: 1.44.0-wmf.15 (duration: 02m 59s) [05:19:16] (03PS5) 10Pppery: Add various settings for new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122279 (https://phabricator.wikimedia.org/T386464) [05:53:21] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:08:37] !log Sanitize sylwiki T386463 [06:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:41] T386463: Prepare and check storage layer for sylwiki - https://phabricator.wikimedia.org/T386463 [06:11:53] (03PS1) 10Marostegui: s2-pager: Remove from repo [software] - 10https://gerrit.wikimedia.org/r/1122288 [06:13:26] (03CR) 10Marostegui: [C:03+2] s2-pager: Remove from repo [software] - 10https://gerrit.wikimedia.org/r/1122288 (owner: 10Marostegui) [06:13:52] (03Merged) 10jenkins-bot: s2-pager: Remove from repo [software] - 10https://gerrit.wikimedia.org/r/1122288 (owner: 10Marostegui) [06:22:51] (03CR) 10Marostegui: clone.py: Add helper functions for later use (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1120213 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto) [06:23:21] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:32:28] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db2243 - https://phabricator.wikimedia.org/T382425#10577942 (10Marostegui) Thank you! [06:36:37] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [06:51:29] (03PS2) 10Anzx: sylwiki: add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122473 (https://phabricator.wikimedia.org/T386464) [06:52:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, February 25 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122473 (https://phabricator.wikimedia.org/T386464) (owner: 10Anzx) [06:58:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate grafana-labs.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250225T0700) [07:00:05] marostegui, Amir1, and federico3: May I have your attention please! Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250225T0700) [07:16:59] (03PS2) 10Anzx: lift of IP cap for UCLA Library event - 3/5/2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122478 (https://phabricator.wikimedia.org/T387181) [07:17:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, February 25 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122478 (https://phabricator.wikimedia.org/T387181) (owner: 10Anzx) [07:28:11] (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120152 (https://phabricator.wikimedia.org/T386622) (owner: 10LD) [07:34:51] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3500 MB (3% inode=98%): /tmp 3500 MB (3% inode=98%): /var/tmp 3500 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [07:52:22] FIRING: [5x] SystemdUnitFailed: user@0.service on testreduce1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:57:43] (03PS15) 10Vgutierrez: sre.loadbalancer: Add migrate-service-ipip cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1122152 (https://phabricator.wikimedia.org/T373020) [07:57:55] (03CR) 10Vgutierrez: sre.loadbalancer: Add migrate-service-ipip cookbook (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1122152 (https://phabricator.wikimedia.org/T373020) (owner: 10Vgutierrez) [08:00:05] Amir1, Urbanecm, and awight: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250225T0800). [08:00:05] LD and anzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:11] o/ [08:00:41] I can deploy today [08:02:01] (03CR) 10Vgutierrez: [C:04-1] hiera: send haproxy silent-drop logs to benthos (cp-upload_ulsfo) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1122157 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [08:02:36] LD: patch looks good but are you around? Or does someone else involved in the frwiki campaigns enablement want to stand in? [08:03:17] anzx: I can start with your logo patch [08:03:22] ok [08:04:30] (03PS4) 10Fabfur: hiera: send haproxy silent-drop logs to benthos (cp-upload_ulsfo) [puppet] - 10https://gerrit.wikimedia.org/r/1122157 (https://phabricator.wikimedia.org/T329332) [08:04:31] the text seems to disappear on a dark background, jfyi. maybe test in dark mode once it's on the test server [08:04:39] (03CR) 10Fabfur: hiera: send haproxy silent-drop logs to benthos (cp-upload_ulsfo) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1122157 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [08:05:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by awight@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122473 (https://phabricator.wikimedia.org/T386464) (owner: 10Anzx) [08:05:52] (03Merged) 10jenkins-bot: sylwiki: add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122473 (https://phabricator.wikimedia.org/T386464) (owner: 10Anzx) [08:06:50] !log awight@deploy2002 Started scap sync-world: Backport for [[gerrit:1122473|sylwiki: add logo (T386464)]] [08:06:54] T386464: Post-creation work for sylwiki - https://phabricator.wikimedia.org/T386464 [08:08:09] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122157 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [08:11:12] (I don't know if scap is updating logos.php automatically?) [08:13:12] Ah this was already included in the patch, sorry for the noise... [08:13:14] (03PS5) 10Fabfur: hiera: send haproxy silent-drop logs to benthos (cp-upload_ulsfo) [puppet] - 10https://gerrit.wikimedia.org/r/1122157 (https://phabricator.wikimedia.org/T329332) [08:13:27] !log awight@deploy2002 anzx, awight: Backport for [[gerrit:1122473|sylwiki: add logo (T386464)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:13:30] awight: checking [08:13:31] T386464: Post-creation work for sylwiki - https://phabricator.wikimedia.org/T386464 [08:13:53] anzx: Looks good to me, also works in dark mode [08:14:08] awight: looks good [08:14:12] ack [08:14:12] (03CR) 10Jelto: [C:03+2] aptrepo: update gitlab-runner Suite to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1119718 (https://phabricator.wikimedia.org/T386297) (owner: 10Jelto) [08:14:16] !log awight@deploy2002 anzx, awight: Continuing with sync [08:14:51] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3232 MB (3% inode=98%): /tmp 3232 MB (3% inode=98%): /var/tmp 3232 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [08:20:01] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 36547328 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [08:21:01] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 2320728 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [08:22:46] !log awight@deploy2002 Finished scap sync-world: Backport for [[gerrit:1122473|sylwiki: add logo (T386464)]] (duration: 15m 56s) [08:22:50] T386464: Post-creation work for sylwiki - https://phabricator.wikimedia.org/T386464 [08:25:47] anzx: Deploying the throttle exception now [08:27:17] awight: no need for testing on throttle change [08:27:24] ack, thanks [08:27:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by awight@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122478 (https://phabricator.wikimedia.org/T387181) (owner: 10Anzx) [08:28:25] (03Merged) 10jenkins-bot: lift of IP cap for UCLA Library event - 3/5/2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122478 (https://phabricator.wikimedia.org/T387181) (owner: 10Anzx) [08:28:52] !log awight@deploy2002 Started scap sync-world: Backport for [[gerrit:1122478|lift of IP cap for UCLA Library event - 3/5/2025 (T387181)]] [08:28:56] T387181: Requesting temporary lift of IP cap for UCLA Library event - 3/5/2025 - https://phabricator.wikimedia.org/T387181 [08:30:50] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122157 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [08:33:19] !log awight@deploy2002 awight, anzx: Backport for [[gerrit:1122478|lift of IP cap for UCLA Library event - 3/5/2025 (T387181)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:33:33] (03PS1) 10Brouberol: airflow-test-k8s: temporarily mimic airflow-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122514 (https://phabricator.wikimedia.org/T386282) [08:33:38] !log awight@deploy2002 awight, anzx: Continuing with sync [08:34:35] (03PS2) 10Brouberol: airflow-test-k8s: temporarily mimic airflow-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122514 (https://phabricator.wikimedia.org/T386282) [08:40:36] !log awight@deploy2002 Finished scap sync-world: Backport for [[gerrit:1122478|lift of IP cap for UCLA Library event - 3/5/2025 (T387181)]] (duration: 11m 43s) [08:40:40] T387181: Requesting temporary lift of IP cap for UCLA Library event - 3/5/2025 - https://phabricator.wikimedia.org/T387181 [08:40:44] awight: thanks for deploying [08:40:55] gladly! [08:41:21] !log UTC morning backport finished [08:41:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:58] (03PS1) 10Volans: service_catalog: allow to refresh from disk [software/spicerack] - 10https://gerrit.wikimedia.org/r/1122516 [08:42:24] (03CR) 10Vgutierrez: [C:03+1] hiera: send haproxy silent-drop logs to benthos (cp-upload_ulsfo) [puppet] - 10https://gerrit.wikimedia.org/r/1122157 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [08:42:49] (03CR) 10Fabfur: [C:03+2] hiera: send haproxy silent-drop logs to benthos (cp-upload_ulsfo) [puppet] - 10https://gerrit.wikimedia.org/r/1122157 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [08:42:55] (03CR) 10Fabfur: [C:03+2] "tnx!" [puppet] - 10https://gerrit.wikimedia.org/r/1122157 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [08:51:09] (03CR) 10Volans: "Nice addition! Some comments/replies inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1122152 (https://phabricator.wikimedia.org/T373020) (owner: 10Vgutierrez) [08:54:17] (03PS1) 10Jelto: sre.gitlab.upgrade: add a prompt before backups on replica [cookbooks] - 10https://gerrit.wikimedia.org/r/1122520 [09:03:21] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:06:34] (03CR) 10Jelto: [V:03+1] "`" [cookbooks] - 10https://gerrit.wikimedia.org/r/1122520 (owner: 10Jelto) [09:09:51] !log elukey@puppetserver1001 conftool action : set/weight=5; selector: name=maps1005.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [09:10:23] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db[1155,1158].eqiad.wmnet with reason: maintenance [09:10:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1158', diff saved to https://phabricator.wikimedia.org/P73521 and previous config saved to /var/cache/conftool/dbconfig/20250225-091025-marostegui.json [09:11:02] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1158.eqiad.wmnet [09:14:55] Hi! We want to run a maintenance script to add wikidata support for a new language wikipedia. Let us know if this is a bad time, otherwise we will proceed (#wikidata-for-wikimedia-projects at WMDE) https://phabricator.wikimedia.org/T386468 [09:15:57] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Idle - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:16:01] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:17:42] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1158.eqiad.wmnet [09:19:01] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1158.eqiad.wmnet with reason: Index rebuild [09:24:41] (03CR) 10Elukey: "Checked hostnames and IP ranges, LGTM. I left a couple of comments related to the service.yaml changes, lemme know :)" [puppet] - 10https://gerrit.wikimedia.org/r/1122170 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [09:24:48] 10ops-magru, 06SRE, 06Infrastructure-Foundations, 10netops: cr2-magru errors on xe-0/1/0 (EdgeUno Transit) - https://phabricator.wikimedia.org/T387006#10578164 (10ayounsi) 05Open→03Resolved a:03ayounsi No more errors. [09:26:00] (03CR) 10Elukey: [C:03+1] service_catalog: allow to refresh from disk [software/spicerack] - 10https://gerrit.wikimedia.org/r/1122516 (owner: 10Volans) [09:26:14] (03CR) 10Elukey: [C:03+2] knative-serving: fix drop capabilities [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122129 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [09:27:24] (03CR) 10Volans: [C:03+2] service_catalog: allow to refresh from disk [software/spicerack] - 10https://gerrit.wikimedia.org/r/1122516 (owner: 10Volans) [09:27:56] !log suzannewood@mwmaint2002:~$ foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https [09:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:13] !log elukey@puppetserver1001 conftool action : set/weight=5; selector: name=maps2005.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [09:30:42] FIRING: [3x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:33:21] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:36:20] (03PS1) 10Cathal Mooney: WMF-Plugin: Potential clean-up of b-end circuit finding logic [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1122524 (https://phabricator.wikimedia.org/T310577) [09:36:27] (03CR) 10CI reject: [V:04-1] WMF-Plugin: Potential clean-up of b-end circuit finding logic [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1122524 (https://phabricator.wikimedia.org/T310577) (owner: 10Cathal Mooney) [09:37:33] (03Merged) 10jenkins-bot: service_catalog: allow to refresh from disk [software/spicerack] - 10https://gerrit.wikimedia.org/r/1122516 (owner: 10Volans) [09:44:44] (03PS1) 10Volans: CHANGELOG: add changelogs for release v9.1.2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1122525 [09:45:22] (03CR) 10Volans: [C:03+2] CHANGELOG: add changelogs for release v9.1.2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1122525 (owner: 10Volans) [09:46:55] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:47:55] RECOVERY - BFD status on cr2-eqdfw is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:49:15] (03CR) 10Effie Mouzeli: [C:03+1] "nice catch!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1120588 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [09:50:09] 06SRE, 06Infrastructure-Foundations, 10netops, 10observability, and 3 others: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731#10578273 (10JMeybohm) >>! In T384731#10566953, @fgiunchedi wrote: >>>! In T384731#10563685, @ayounsi wrote: >> >>... [09:50:18] !log Finished populateSitesTable for [sylwiki] https://phabricator.wikimedia.org/T386468 [09:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:59] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/994164 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [09:51:48] (03CR) 10Filippo Giunchedi: [C:03+1] P:firewall absent check_conntrack script. [puppet] - 10https://gerrit.wikimedia.org/r/1087379 (https://phabricator.wikimedia.org/T374827) (owner: 10Slyngshede) [09:53:23] (03CR) 10Filippo Giunchedi: [C:03+1] P:systemd::timesyncd absent monitoring, handled by AlertManager [puppet] - 10https://gerrit.wikimedia.org/r/994172 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [09:55:48] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v9.1.2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1122525 (owner: 10Volans) [09:57:28] (03PS1) 10Volans: Upstream release v9.1.2 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1122529 [09:57:40] (03CR) 10Volans: [C:03+2] Upstream release v9.1.2 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1122529 (owner: 10Volans) [10:00:08] We are finished with the maintenance scripts [10:01:07] (03PS1) 10Marostegui: Revert "x1: Change format to STATEMENT" [puppet] - 10https://gerrit.wikimedia.org/r/1122530 [10:02:23] (03CR) 10Marostegui: [C:03+2] Revert "x1: Change format to STATEMENT" [puppet] - 10https://gerrit.wikimedia.org/r/1122530 (owner: 10Marostegui) [10:03:34] !log Move x1 back to RBR T385645 [10:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:37] T385645: Drop event_variant column from echo_event - https://phabricator.wikimedia.org/T385645 [10:05:41] (03CR) 10Stevemunene: [C:03+1] airflow-test-k8s: temporarily mimic airflow-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122514 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [10:08:15] (03Merged) 10jenkins-bot: Upstream release v9.1.2 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1122529 (owner: 10Volans) [10:10:05] (03PS1) 10Marostegui: db-production.php: Disable writes on es6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122533 (https://phabricator.wikimedia.org/T376905) [10:19:16] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1169.eqiad.wmnet with reason: maintenance [10:19:28] (03CR) 10Cathal Mooney: "Overall looks good to me. As we talked about on irc I think there are further improvements we can make with this as the starting point. " [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1094284 (https://phabricator.wikimedia.org/T310577) (owner: 10Ayounsi) [10:19:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1169', diff saved to https://phabricator.wikimedia.org/P73525 and previous config saved to /var/cache/conftool/dbconfig/20250225-101956-marostegui.json [10:20:20] !log Upgrade db1169 to 10.6.21 T385678 [10:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:25] T385678: Compile and package MariaDB 10.11.11 and MariaDB 10.6.21 - https://phabricator.wikimedia.org/T385678 [10:22:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1177', diff saved to https://phabricator.wikimedia.org/P73526 and previous config saved to /var/cache/conftool/dbconfig/20250225-102159-marostegui.json [10:22:20] !log Upgrade db1177 to 10.6.21 T385678 [10:22:23] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1177.eqiad.wmnet with reason: maintenance [10:22:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P73527 and previous config saved to /var/cache/conftool/dbconfig/20250225-102422-root.json [10:25:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1177 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P73528 and previous config saved to /var/cache/conftool/dbconfig/20250225-102522-root.json [10:28:38] (03CR) 10Effie Mouzeli: [C:03+2] php8.1: use pcre2 backport [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1120588 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [10:35:24] (03CR) 10Brouberol: [C:03+2] airflow-test-k8s: temporarily mimic airflow-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122514 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [10:37:07] (03CR) 10Ayounsi: [C:03+1] "That's great !!" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1122524 (https://phabricator.wikimedia.org/T310577) (owner: 10Cathal Mooney) [10:38:26] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:38:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es1031 to es3 master', diff saved to https://phabricator.wikimedia.org/P73529 and previous config saved to /var/cache/conftool/dbconfig/20250225-103849-root.json [10:39:16] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:39:23] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on es1034.eqiad.wmnet with reason: maintenance [10:39:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P73530 and previous config saved to /var/cache/conftool/dbconfig/20250225-103928-root.json [10:39:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1034', diff saved to https://phabricator.wikimedia.org/P73531 and previous config saved to /var/cache/conftool/dbconfig/20250225-103945-marostegui.json [10:39:51] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for es1034.eqiad.wmnet [10:40:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1177 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P73532 and previous config saved to /var/cache/conftool/dbconfig/20250225-104028-root.json [10:42:06] (03CR) 10Stevemunene: [C:03+1] analytics/html: update readme for MW history dump [puppet] - 10https://gerrit.wikimedia.org/r/1102848 (https://phabricator.wikimedia.org/T381390) (owner: 10Milimetric) [10:47:54] jouncebot: nowandnext [10:47:54] No deployments scheduled for the next 0 hour(s) and 12 minute(s) [10:47:54] In 0 hour(s) and 12 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250225T1100) [10:48:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool es1034', diff saved to https://phabricator.wikimedia.org/P73533 and previous config saved to /var/cache/conftool/dbconfig/20250225-104840-marostegui.json [10:48:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for es1034.eqiad.wmnet [10:49:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es1034 to es3 master', diff saved to https://phabricator.wikimedia.org/P73534 and previous config saved to /var/cache/conftool/dbconfig/20250225-104908-root.json [10:54:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P73535 and previous config saved to /var/cache/conftool/dbconfig/20250225-105433-root.json [10:54:58] (03CR) 10Hnowlan: [C:03+2] trafficserver: use mobileapps directly for hewiki APIs [puppet] - 10https://gerrit.wikimedia.org/r/1117508 (https://phabricator.wikimedia.org/T372746) (owner: 10Hnowlan) [10:55:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1177 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P73536 and previous config saved to /var/cache/conftool/dbconfig/20250225-105534-root.json [10:57:22] RESOLVED: [5x] SystemdUnitFailed: user@0.service on testreduce1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250225T1100) [11:02:37] 06SRE, 06Infrastructure-Foundations, 10netops: Gaps in gNMI network statistics in eqiad - https://phabricator.wikimedia.org/T386807#10578486 (10cmooney) 05Open→03Resolved Gonna close this one at this point. All has been ok in eqiad and codfw since the increase in thread count last week - gaps are no... [11:05:42] !log fnegri@cumin1002 START - Cookbook sre.wikireplicas.add-wiki for database satwiktionary (T386634) [11:05:45] T386634: [wikireplicas] Create views for new wiki satwiktionary - https://phabricator.wikimedia.org/T386634 [11:08:04] (03CR) 10Effie Mouzeli: [V:03+2 C:03+2] php8.1: use pcre2 backport [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1120588 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [11:08:20] (03PS16) 10Vgutierrez: sre.loadbalancer: Add migrate-service-ipip cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1122152 (https://phabricator.wikimedia.org/T373020) [11:08:32] !log Switched hewiki mobileapps APIs to rest-gateway, removing restbase from path [11:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:09] (03PS17) 10Vgutierrez: sre.loadbalancer: Add migrate-service-ipip cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1122152 (https://phabricator.wikimedia.org/T373020) [11:09:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P73537 and previous config saved to /var/cache/conftool/dbconfig/20250225-110938-root.json [11:09:51] (03CR) 10Vgutierrez: sre.loadbalancer: Add migrate-service-ipip cookbook (0311 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1122152 (https://phabricator.wikimedia.org/T373020) (owner: 10Vgutierrez) [11:10:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1177 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P73538 and previous config saved to /var/cache/conftool/dbconfig/20250225-111039-root.json [11:13:21] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:14:30] (03CR) 10Effie Mouzeli: [C:03+2] shellbox-media: 1 replica on 8.1 for each DC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116838 (https://phabricator.wikimedia.org/T377038) (owner: 10Effie Mouzeli) [11:14:46] (03PS1) 10Hnowlan: trafficserver: roll restbaseless citoid out to group0 wikis [puppet] - 10https://gerrit.wikimedia.org/r/1122542 (https://phabricator.wikimedia.org/T361576) [11:15:38] (03Merged) 10jenkins-bot: shellbox-media: 1 replica on 8.1 for each DC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116838 (https://phabricator.wikimedia.org/T377038) (owner: 10Effie Mouzeli) [11:15:42] FIRING: [3x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:16:16] (03CR) 10Volans: [C:03+1] "Great, LGTM." [cookbooks] - 10https://gerrit.wikimedia.org/r/1122152 (https://phabricator.wikimedia.org/T373020) (owner: 10Vgutierrez) [11:16:41] (03PS1) 10Cathal Mooney: Rename text interface state values returned by GNMI to ints [puppet] - 10https://gerrit.wikimedia.org/r/1122543 (https://phabricator.wikimedia.org/T372457) [11:17:39] (03CR) 10Elukey: "I may miss something related to the containerd migration, but in theory this recipe is not needed. We have currently this layout:" [puppet] - 10https://gerrit.wikimedia.org/r/1121335 (https://phabricator.wikimedia.org/T386900) (owner: 10Stevemunene) [11:18:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1035', diff saved to https://phabricator.wikimedia.org/P73539 and previous config saved to /var/cache/conftool/dbconfig/20250225-111805-marostegui.json [11:18:24] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for es1035.eqiad.wmnet [11:18:28] (03CR) 10Vgutierrez: [C:03+2] sre.loadbalancer: Add migrate-service-ipip cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1122152 (https://phabricator.wikimedia.org/T373020) (owner: 10Vgutierrez) [11:20:13] (03PS1) 10Marostegui: Revert^2 "x1: Change format to STATEMENT" [puppet] - 10https://gerrit.wikimedia.org/r/1122544 [11:20:42] RESOLVED: [3x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:21:42] (03PS1) 10Elukey: WIP geo-maps: deprioritize eqiad to depool traffic from it [dns] - 10https://gerrit.wikimedia.org/r/1122545 (https://phabricator.wikimedia.org/T380858) [11:22:02] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [11:22:16] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [11:22:21] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [11:22:54] (03PS1) 10Vgutierrez: prometheus: Collect MSS metrics every minute [puppet] - 10https://gerrit.wikimedia.org/r/1122546 [11:22:57] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [11:23:14] (03PS2) 10Vgutierrez: prometheus: Collect MSS metrics every minute [puppet] - 10https://gerrit.wikimedia.org/r/1122546 [11:23:46] (03PS2) 10Elukey: WIP geo-maps: deprioritize eqiad to depool traffic from it [dns] - 10https://gerrit.wikimedia.org/r/1122545 (https://phabricator.wikimedia.org/T380858) [11:24:25] (03CR) 10Vgutierrez: alerts: add alert for ferm_mss_cfg Prometheus metric (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [11:24:43] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122546 (owner: 10Vgutierrez) [11:24:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P73540 and previous config saved to /var/cache/conftool/dbconfig/20250225-112447-root.json [11:25:34] (03Merged) 10jenkins-bot: sre.loadbalancer: Add migrate-service-ipip cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1122152 (https://phabricator.wikimedia.org/T373020) (owner: 10Vgutierrez) [11:25:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1177 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P73541 and previous config saved to /var/cache/conftool/dbconfig/20250225-112545-root.json [11:25:46] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for es1035.eqiad.wmnet [11:26:37] (03PS1) 10Effie Mouzeli: shellbox-timeline: 1 replica on 8.1 for each DC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122547 (https://phabricator.wikimedia.org/T377038) [11:29:33] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on es1035.eqiad.wmnet with reason: maintenance [11:29:58] (03CR) 10Fabfur: [C:03+1] "Ok to me, using "minutely" is much more readable than old cron syntax btw" [puppet] - 10https://gerrit.wikimedia.org/r/1122546 (owner: 10Vgutierrez) [11:30:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P73542 and previous config saved to /var/cache/conftool/dbconfig/20250225-113029-root.json [11:30:56] !log jiji@deploy2002 Started scap sync-world: T386006 - use pcre2 backport in php8.1 images [11:31:00] T386006: Update PCRE in PHP 8.1 images to PCRE 10.39 or newer - https://phabricator.wikimedia.org/T386006 [11:31:34] (03CR) 10Vgutierrez: [C:03+2] prometheus: Collect MSS metrics every minute [puppet] - 10https://gerrit.wikimedia.org/r/1122546 (owner: 10Vgutierrez) [11:31:35] !log fnegri@cumin1002 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database satwiktionary (T386634) [11:31:39] T386634: [wikireplicas] Create views for new wiki satwiktionary - https://phabricator.wikimedia.org/T386634 [11:32:11] !log fnegri@cumin1002 START - Cookbook sre.wikireplicas.add-wiki for database sylwiki (T386467) [11:32:15] T386467: [wikireplicas] Create views for new wiki sylwiki - https://phabricator.wikimedia.org/T386467 [11:32:18] (03CR) 10Marostegui: [C:03+2] Revert^2 "x1: Change format to STATEMENT" [puppet] - 10https://gerrit.wikimedia.org/r/1122544 (owner: 10Marostegui) [11:32:23] !log fnegri@cumin1002 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database sylwiki (T386467) [11:35:29] (03CR) 10Effie Mouzeli: "self +2ing this, as it is similar to If2296418565caa0ad58f4dd612d009c44ad4dd07" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122547 (https://phabricator.wikimedia.org/T377038) (owner: 10Effie Mouzeli) [11:35:40] (03CR) 10Effie Mouzeli: [C:03+2] shellbox-timeline: 1 replica on 8.1 for each DC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122547 (https://phabricator.wikimedia.org/T377038) (owner: 10Effie Mouzeli) [11:36:08] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [11:36:11] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [11:37:20] (03Merged) 10jenkins-bot: shellbox-timeline: 1 replica on 8.1 for each DC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122547 (https://phabricator.wikimedia.org/T377038) (owner: 10Effie Mouzeli) [11:37:34] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [11:37:38] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [11:39:04] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [11:39:28] !log Deploy schema change on x1 db1179 eqiad dbmaint T385645 [11:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:31] T385645: Drop event_variant column from echo_event - https://phabricator.wikimedia.org/T385645 [11:39:41] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [11:41:35] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [11:41:52] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [11:43:21] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:43:54] !jouncebot next [11:43:54] a Python reminder bot for deployments. see https://wikitech.wikimedia.org/wiki/Tool:Jouncebot [11:44:02] jouncebot: next [11:44:02] In 1 hour(s) and 15 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250225T1300) [11:45:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1035 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P73543 and previous config saved to /var/cache/conftool/dbconfig/20250225-114534-root.json [11:45:44] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1179.eqiad.wmnet with reason: maintenance [11:51:52] (03PS2) 10Hnowlan: trafficserver: roll restbaseless citoid out to group0 wikis [puppet] - 10https://gerrit.wikimedia.org/r/1122542 (https://phabricator.wikimedia.org/T361576) [11:55:27] !log jiji@deploy2002 Finished scap sync-world: T386006 - use pcre2 backport in php8.1 images (duration: 25m 34s) [11:55:31] T386006: Update PCRE in PHP 8.1 images to PCRE 10.39 or newer - https://phabricator.wikimedia.org/T386006 [12:00:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1035 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P73544 and previous config saved to /var/cache/conftool/dbconfig/20250225-120040-root.json [12:03:45] 10SRE-swift-storage, 06Commons, 10MediaWiki-Uploading, 07Unstewarded-production-error, 07Wikimedia-production-error: An unknown error occurred in storage backend "local-swift-eqiad" - https://phabricator.wikimedia.org/T341007#10578627 (10Yann) I got again this error while trying to upload a big PNG file.... [12:04:31] (03PS1) 10Fabfur: workaround for T256098 [debs/benthos] - 10https://gerrit.wikimedia.org/r/1122557 (https://phabricator.wikimedia.org/T256098) [12:06:28] Lucas_WMDE do you think we can deploy 1120152 on the next window (14utc?) [12:09:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1179', diff saved to https://phabricator.wikimedia.org/P73545 and previous config saved to /var/cache/conftool/dbconfig/20250225-120953-marostegui.json [12:14:00] awight_ sorry about this morning, I had a call conf [12:15:12] (03CR) 10Arnaudb: [C:03+1] sre.gitlab.upgrade: add a prompt before backups on replica [cookbooks] - 10https://gerrit.wikimedia.org/r/1122520 (owner: 10Jelto) [12:15:31] (03PS1) 10Hnowlan: mw-api-ext, mw-web: right-size clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122561 (https://phabricator.wikimedia.org/T380858) [12:15:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P73546 and previous config saved to /var/cache/conftool/dbconfig/20250225-121545-root.json [12:16:27] (03PS1) 10Slyngshede: Ensure that the LDAP user is parsed as an Entry object. [software/bitu] - 10https://gerrit.wikimedia.org/r/1122562 (https://phabricator.wikimedia.org/T385947) [12:17:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, February 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120152 (https://phabricator.wikimedia.org/T386622) (owner: 10LD) [12:17:30] (03PS2) 10Slyngshede: Ensure that the LDAP user is parsed as an Entry object. [software/bitu] - 10https://gerrit.wikimedia.org/r/1122562 (https://phabricator.wikimedia.org/T385947) [12:18:32] (03PS1) 10Clément Goubert: mediawiki: CronJob name as Job label [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122563 (https://phabricator.wikimedia.org/T385709) [12:20:12] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Install and cable Nokia test devices and test servers in codfw - https://phabricator.wikimedia.org/T385217#10578641 (10cmooney) >>! In T385217#10572967, @cmooney wrote: > DC-Ops folks Nokia reccomend trying to interrupt the grub bootlo... [12:20:18] (03PS1) 10Hnowlan: mw-jobrunner: scale down [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122565 [12:20:43] (03PS2) 10Clément Goubert: mediawiki: CronJob name as Job label [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122563 (https://phabricator.wikimedia.org/T385709) [12:30:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P73547 and previous config saved to /var/cache/conftool/dbconfig/20250225-123050-root.json [12:37:09] (03Abandoned) 10Andrew Bogott: cloud-vps: increase # of attempts with dns resolving [puppet] - 10https://gerrit.wikimedia.org/r/1105945 (https://phabricator.wikimedia.org/T374830) (owner: 10Andrew Bogott) [12:37:54] (03CR) 10Ladsgroup: [C:03+1] db-production.php: Disable writes on es6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122533 (https://phabricator.wikimedia.org/T376905) (owner: 10Marostegui) [12:41:30] 10SRE-swift-storage, 06Commons, 10MediaWiki-Uploading, 07Unstewarded-production-error, 07Wikimedia-production-error: An unknown error occurred in storage backend "local-swift-eqiad" - https://phabricator.wikimedia.org/T341007#10578688 (10MatthewVernon) I'm not seeing any elevated errors from swift today. [12:48:20] jouncebot: next [12:48:21] In 0 hour(s) and 11 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250225T1300) [12:48:38] (03CR) 10Marostegui: [C:03+2] db-production.php: Disable writes on es6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122533 (https://phabricator.wikimedia.org/T376905) (owner: 10Marostegui) [12:49:26] (03Merged) 10jenkins-bot: db-production.php: Disable writes on es6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122533 (https://phabricator.wikimedia.org/T376905) (owner: 10Marostegui) [12:50:13] !log marostegui@deploy2002 Started scap sync-world: Backport for [[gerrit:1122533|db-production.php: Disable writes on es6 (T376905)]] [12:50:58] (03PS1) 10Gerrit maintenance bot: mariadb: Promote es2037 to es6 master [puppet] - 10https://gerrit.wikimedia.org/r/1122568 (https://phabricator.wikimedia.org/T387211) [12:51:02] (03PS1) 10Gerrit maintenance bot: wmnet: Update es6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1122569 (https://phabricator.wikimedia.org/T387211) [12:52:37] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es6 T387211 [12:52:41] T387211: Switchover es6 master (es2035 -> es2037) - https://phabricator.wikimedia.org/T387211 [12:53:06] !log marostegui@deploy2002 marostegui: Backport for [[gerrit:1122533|db-production.php: Disable writes on es6 (T376905)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:53:18] !log marostegui@deploy2002 marostegui: Continuing with sync [12:56:39] (03PS3) 10Elukey: WIP geo-maps: deprioritize eqiad to depool traffic from it [dns] - 10https://gerrit.wikimedia.org/r/1122545 (https://phabricator.wikimedia.org/T380858) [12:58:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1179 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P73548 and previous config saved to /var/cache/conftool/dbconfig/20250225-125813-root.json [12:58:33] !log elukey@puppetserver1001 conftool action : set/weight=5; selector: name=maps1006.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [12:58:46] !log elukey@puppetserver1001 conftool action : set/weight=5; selector: name=maps2006.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [12:59:19] !log elukey@puppetserver1001 conftool action : set/weight=10; selector: name=maps1006.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [12:59:20] (03PS1) 10Marostegui: Revert "db-production.php: Disable writes on es6" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122570 [12:59:26] !log elukey@puppetserver1001 conftool action : set/weight=10; selector: name=maps2006.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [12:59:49] !log elukey@puppetserver1001 conftool action : set/pooled=inactive; selector: name=maps1006.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [13:00:04] !log elukey@puppetserver1001 conftool action : set/pooled=yes; selector: name=maps1006.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250225T1300) [13:00:11] !log elukey@puppetserver1001 conftool action : set/pooled=inactive; selector: name=maps1005.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [13:00:14] !log marostegui@deploy2002 Finished scap sync-world: Backport for [[gerrit:1122533|db-production.php: Disable writes on es6 (T376905)]] (duration: 10m 01s) [13:00:49] !log elukey@puppetserver1001 conftool action : set/pooled=inactive; selector: name=maps2005.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [13:01:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set es2037 with weight 0 T387211', diff saved to https://phabricator.wikimedia.org/P73549 and previous config saved to /var/cache/conftool/dbconfig/20250225-130138-root.json [13:01:42] T387211: Switchover es6 master (es2035 -> es2037) - https://phabricator.wikimedia.org/T387211 [13:02:22] (03CR) 10Marostegui: [C:03+2] mariadb: Promote es2037 to es6 master [puppet] - 10https://gerrit.wikimedia.org/r/1122568 (https://phabricator.wikimedia.org/T387211) (owner: 10Gerrit maintenance bot) [13:03:21] !log Starting es6 codfw failover from es2035 to es2037 - T387211 [13:03:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:47] (03PS2) 10Jelto: Build helm3.17 with new upstream version [debs/helm3] - 10https://gerrit.wikimedia.org/r/1115388 (https://phabricator.wikimedia.org/T341984) [13:03:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es2037 to es6 primary T387211', diff saved to https://phabricator.wikimedia.org/P73550 and previous config saved to /var/cache/conftool/dbconfig/20250225-130348-root.json [13:04:25] (03CR) 10Marostegui: [C:03+2] wmnet: Update es6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1122569 (https://phabricator.wikimedia.org/T387211) (owner: 10Gerrit maintenance bot) [13:04:38] !log marostegui@dns1006 START - running authdns-update [13:06:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2035 T387211', diff saved to https://phabricator.wikimedia.org/P73551 and previous config saved to /var/cache/conftool/dbconfig/20250225-130619-root.json [13:06:35] !log marostegui@dns1006 END - running authdns-update [13:06:58] 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on schoolwiki.in - https://phabricator.wikimedia.org/T383210#10578780 (10Gnoeee) >>! In T383210#10575269, @ssingh wrote: > @Gnoeee: This has been rolled out and should now be live. Please feel free to re-open this task if there are any issues. Thank yo... [13:07:00] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for es2035.codfw.wmnet [13:07:52] (03CR) 10Marostegui: [C:03+2] Revert "db-production.php: Disable writes on es6" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122570 (owner: 10Marostegui) [13:08:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1158 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73552 and previous config saved to /var/cache/conftool/dbconfig/20250225-130821-root.json [13:08:25] (03PS3) 10Slyngshede: Ensure that the LDAP user is parsed as an Entry object. [software/bitu] - 10https://gerrit.wikimedia.org/r/1122562 (https://phabricator.wikimedia.org/T385947) [13:08:46] (03Merged) 10jenkins-bot: Revert "db-production.php: Disable writes on es6" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122570 (owner: 10Marostegui) [13:09:22] !log marostegui@deploy2002 Started scap sync-world: Backport for [[gerrit:1122570|Revert "db-production.php: Disable writes on es6"]] [13:09:53] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudgw100[12] - https://phabricator.wikimedia.org/T386810#10578798 (10Andrew) >>! In T386810#10577268, @VRiley-WMF wrote: > Unracked and removed the following servers. However, the script has failed and returning... [13:11:04] (03CR) 10Mvolz: [C:03+1] trafficserver: roll restbaseless citoid out to group0 wikis [puppet] - 10https://gerrit.wikimedia.org/r/1122542 (https://phabricator.wikimedia.org/T361576) (owner: 10Hnowlan) [13:13:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1179 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P73553 and previous config saved to /var/cache/conftool/dbconfig/20250225-131318-root.json [13:13:31] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for es2035.codfw.wmnet [13:15:44] !log marostegui@deploy2002 marostegui: Backport for [[gerrit:1122570|Revert "db-production.php: Disable writes on es6"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:15:54] !log marostegui@deploy2002 marostegui: Continuing with sync [13:17:21] (03CR) 10David Caro: "LGTM, though I'm quite unfamiliar with gnmic so might want input from someone else too. The syntax looks ok (matching the other `event-str" [puppet] - 10https://gerrit.wikimedia.org/r/1122543 (https://phabricator.wikimedia.org/T372457) (owner: 10Cathal Mooney) [13:22:45] !log marostegui@deploy2002 Finished scap sync-world: Backport for [[gerrit:1122570|Revert "db-production.php: Disable writes on es6"]] (duration: 13m 23s) [13:23:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1158 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73554 and previous config saved to /var/cache/conftool/dbconfig/20250225-132326-root.json [13:28:01] (03PS1) 10Vgutierrez: lvs_realserver: Wait at least for two consecutive MSS errors [alerts] - 10https://gerrit.wikimedia.org/r/1122575 [13:28:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1179 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P73555 and previous config saved to /var/cache/conftool/dbconfig/20250225-132823-root.json [13:30:04] (03CR) 10Vgutierrez: "I'd rather be consistent and include FB ranges and friends" [dns] - 10https://gerrit.wikimedia.org/r/1122545 (https://phabricator.wikimedia.org/T380858) (owner: 10Elukey) [13:32:51] (03PS3) 10Jelto: Build helm3.17 with new upstream version [debs/helm3] - 10https://gerrit.wikimedia.org/r/1115388 (https://phabricator.wikimedia.org/T341984) [13:33:44] (03PS1) 10Giuseppe Lavagetto: When executing cli scripts, wait for the service mesh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122578 (https://phabricator.wikimedia.org/T387208) [13:33:54] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1236 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1122579 (https://phabricator.wikimedia.org/T387216) [13:35:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1236 with weight 0 T387216', diff saved to https://phabricator.wikimedia.org/P73556 and previous config saved to /var/cache/conftool/dbconfig/20250225-133500-marostegui.json [13:35:04] T387216: Switchover s7 master (db1181 -> db1236) - https://phabricator.wikimedia.org/T387216 [13:35:14] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s7 T387216 [13:35:20] (03CR) 10JMeybohm: mediawiki: introduce feature flags (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639 (owner: 10Giuseppe Lavagetto) [13:36:40] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1236 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1122579 (https://phabricator.wikimedia.org/T387216) (owner: 10Gerrit maintenance bot) [13:37:22] FIRING: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:38:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1158 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73557 and previous config saved to /var/cache/conftool/dbconfig/20250225-133831-root.json [13:39:31] (03PS1) 10Effie Mouzeli: trafficserver: re-enable cookie-enrolled traffic to 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1122584 (https://phabricator.wikimedia.org/T383845) [13:39:34] RESOLVED: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:40:59] (03PS1) 10Effie Mouzeli: Re-enable cookie-based enrollment in 8.1 at 50% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122585 (https://phabricator.wikimedia.org/T385395) [13:41:24] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122242 (https://phabricator.wikimedia.org/T386969) (owner: 10Eevans) [13:42:32] !log Starting s7 eqiad failover from db1181 to db1236 - T387216 [13:42:32] (03PS2) 10Effie Mouzeli: trafficserver: re-enable cookie-enrolled traffic to 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1122584 (https://phabricator.wikimedia.org/T383845) [13:42:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:35] T387216: Switchover s7 master (db1181 -> db1236) - https://phabricator.wikimedia.org/T387216 [13:42:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1236 to s7 primary T387216', diff saved to https://phabricator.wikimedia.org/P73558 and previous config saved to /var/cache/conftool/dbconfig/20250225-134256-marostegui.json [13:43:21] (03CR) 10MVernon: [C:03+1] restbase: upgrade cluster to 'dev' (Cassandra 4.1.8) [puppet] - 10https://gerrit.wikimedia.org/r/1122242 (https://phabricator.wikimedia.org/T386969) (owner: 10Eevans) [13:43:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1179 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P73559 and previous config saved to /var/cache/conftool/dbconfig/20250225-134328-root.json [13:43:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1181 T387216', diff saved to https://phabricator.wikimedia.org/P73560 and previous config saved to /var/cache/conftool/dbconfig/20250225-134349-marostegui.json [13:44:20] (03PS4) 10Jelto: Build helm3.17 with new upstream version [debs/helm3] - 10https://gerrit.wikimedia.org/r/1115388 (https://phabricator.wikimedia.org/T341984) [13:45:32] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1181.eqiad.wmnet [13:45:58] 06SRE, 06Infrastructure-Foundations, 10netops: gNMIc connection not working for cloudsw2-d5-eqiad - https://phabricator.wikimedia.org/T387018#10578920 (10cmooney) >>! In T387018#10574426, @ayounsi wrote: > Enabling traceoptions shows a `no shared cipher` error on the switch : > ` > Feb 24 09:33:58 ssl_transp... [13:46:30] (03CR) 10Eevans: [C:03+2] restbase: upgrade cluster to 'dev' (Cassandra 4.1.8) [puppet] - 10https://gerrit.wikimedia.org/r/1122242 (https://phabricator.wikimedia.org/T386969) (owner: 10Eevans) [13:48:43] (03CR) 10JMeybohm: [C:03+1] mediawiki-common: introduce chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117547 (owner: 10Giuseppe Lavagetto) [13:48:52] (03CR) 10JMeybohm: Add a mediawiki-common release to mw-script (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117548 (owner: 10Giuseppe Lavagetto) [13:51:15] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-codfw: Upgrading to Cassandra 4.1.8 — T385819 - eevans@cumin1002 [13:52:11] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1181.eqiad.wmnet [13:52:55] (03PS1) 10Effie Mouzeli: mw-(api-int|jobrunner|parsoid): resume php8.1 rollout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122587 (https://phabricator.wikimedia.org/T383845) [13:53:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1158 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73561 and previous config saved to /var/cache/conftool/dbconfig/20250225-135336-root.json [13:57:53] (03CR) 10Ayounsi: [C:03+1] Rename text interface state values returned by GNMI to ints [puppet] - 10https://gerrit.wikimedia.org/r/1122543 (https://phabricator.wikimedia.org/T372457) (owner: 10Cathal Mooney) [13:58:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1179 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P73562 and previous config saved to /var/cache/conftool/dbconfig/20250225-135834-root.json [13:59:15] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1181.eqiad.wmnet with reason: Rebuild index [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250225T1400). [14:00:05] LD: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:37] (y) [14:01:55] (03CR) 10Kamila Součková: [C:03+1] mediawiki: CronJob name as Job label [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122563 (https://phabricator.wikimedia.org/T385709) (owner: 10Clément Goubert) [14:03:50] (03CR) 10Cathal Mooney: [C:03+2] Rename text interface state values returned by GNMI to ints [puppet] - 10https://gerrit.wikimedia.org/r/1122543 (https://phabricator.wikimedia.org/T372457) (owner: 10Cathal Mooney) [14:04:19] (03CR) 10Ayounsi: [C:03+1] Rename YAML var "evpn_bgp" to "switch_ibgp" [homer/public] - 10https://gerrit.wikimedia.org/r/1122208 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [14:06:45] (03CR) 10Ayounsi: [C:04-1] Rename YAML var "evpn_bgp" to "switch_ibgp" (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1122208 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [14:07:45] !log drop module_deps table in all of s5 (T385997) [14:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:50] T385997: Drop module_deps table in WMF prod - https://phabricator.wikimedia.org/T385997 [14:08:04] (03PS1) 10Filippo Giunchedi: prometheus: add 90pct to envoy recording rules [puppet] - 10https://gerrit.wikimedia.org/r/1122589 (https://phabricator.wikimedia.org/T385693) [14:08:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1158 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73563 and previous config saved to /var/cache/conftool/dbconfig/20250225-140841-root.json [14:08:42] Hi LD! I'm not a deployer and therefore can't deploy your patch, but I'm here if you need any help. [14:08:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P73564 and previous config saved to /var/cache/conftool/dbconfig/20250225-140856-root.json [14:09:25] (03PS1) 10Brouberol: airflow-analytics-product: migrate the scheduler and the DB to Kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122591 (https://phabricator.wikimedia.org/T380623) [14:10:30] Ayo Daimona, you prolly might review the commit once more, the last patchset has undone the review, I just solved the merging conflict lmao [14:10:36] (03PS1) 10Brouberol: airflow-analytics-product: disable and remove the airflow systemd services [puppet] - 10https://gerrit.wikimedia.org/r/1122592 (https://phabricator.wikimedia.org/T380623) [14:11:12] (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: add 90pct to envoy recording rules [puppet] - 10https://gerrit.wikimedia.org/r/1122589 (https://phabricator.wikimedia.org/T385693) (owner: 10Filippo Giunchedi) [14:11:25] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4978/co" [puppet] - 10https://gerrit.wikimedia.org/r/1122592 (https://phabricator.wikimedia.org/T380623) (owner: 10Brouberol) [14:11:57] (03CR) 10Daimona Eaytoy: [C:03+1] frwiki: Enable the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120152 (https://phabricator.wikimedia.org/T386622) (owner: 10LD) [14:12:09] Bien sûr, +1ed [14:12:17] (03Abandoned) 10Ssingh: geo-maps: put eqiad at lowest priority for T380858 [dns] - 10https://gerrit.wikimedia.org/r/1113205 (https://phabricator.wikimedia.org/T380858) (owner: 10Ssingh) [14:12:35] merci [14:12:53] (03PS2) 10Brouberol: airflow-analytics-product: migrate the scheduler and the DB to Kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122591 (https://phabricator.wikimedia.org/T380623) [14:14:39] (03CR) 10Filippo Giunchedi: "LGTM, see inline though" [debs/benthos] - 10https://gerrit.wikimedia.org/r/1122557 (https://phabricator.wikimedia.org/T256098) (owner: 10Fabfur) [14:14:49] (03CR) 10Ssingh: [C:03+1] lvs_realserver: Wait at least for two consecutive MSS errors [alerts] - 10https://gerrit.wikimedia.org/r/1122575 (owner: 10Vgutierrez) [14:14:57] (03CR) 10Kamila Součková: [C:03+1] mw-jobrunner: scale down [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122565 (owner: 10Hnowlan) [14:14:59] The question is though: are there any deployers around? [14:15:24] I dont think so, unfortunately [14:15:27] (03CR) 10Ssingh: "+1 ^" [dns] - 10https://gerrit.wikimedia.org/r/1122545 (https://phabricator.wikimedia.org/T380858) (owner: 10Elukey) [14:15:30] (03CR) 10Fabfur: [C:03+1] lvs_realserver: Wait at least for two consecutive MSS errors [alerts] - 10https://gerrit.wikimedia.org/r/1122575 (owner: 10Vgutierrez) [14:15:54] (03CR) 10Vgutierrez: [C:03+2] lvs_realserver: Wait at least for two consecutive MSS errors [alerts] - 10https://gerrit.wikimedia.org/r/1122575 (owner: 10Vgutierrez) [14:16:03] (03CR) 10Lucas Werkmeister (WMDE): "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122578 (https://phabricator.wikimedia.org/T387208) (owner: 10Giuseppe Lavagetto) [14:16:50] LD, Daimona: I can deploy if nobody else pops up [14:17:47] Thank you Kamila, that would be very much appreciated! [14:21:19] ok, on it :-) [14:21:29] thanks :) [14:23:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kamila@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120152 (https://phabricator.wikimedia.org/T386622) (owner: 10LD) [14:24:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2035 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P73565 and previous config saved to /var/cache/conftool/dbconfig/20250225-142402-root.json [14:24:25] 06SRE, 06Infrastructure-Foundations, 10netops, 10observability, and 3 others: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731#10579051 (10fgiunchedi) >>! In T384731#10578273, @JMeybohm wrote: >>>! In T384731#10566953, @fgiunchedi wrote: >>>>... [14:24:31] FIRING: [2x] Emergency syslog message: Alert for device cloudsw1-e4-eqiad.mgmt.eqiad.wmnet - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [14:24:37] (03Merged) 10jenkins-bot: frwiki: Enable the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120152 (https://phabricator.wikimedia.org/T386622) (owner: 10LD) [14:25:04] !log kamila@deploy2002 Started scap sync-world: Backport for [[gerrit:1120152|frwiki: Enable the CampaignEvents extension (T386622)]] [14:25:36] (03PS4) 10Elukey: WIP geo-maps: deprioritize eqiad to depool traffic from it [dns] - 10https://gerrit.wikimedia.org/r/1122545 (https://phabricator.wikimedia.org/T380858) [14:26:45] (03CR) 10Elukey: "should be fixed :)" [dns] - 10https://gerrit.wikimedia.org/r/1122545 (https://phabricator.wikimedia.org/T380858) (owner: 10Elukey) [14:29:31] RESOLVED: [2x] Emergency syslog message: Device cloudsw1-e4-eqiad.mgmt.eqiad.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [14:31:23] kamila_ thanks, looks like merged, I'm a still a bit confused tho, when do it comes on live? [14:31:25] !log kamila@deploy2002 kamila, wpld: Backport for [[gerrit:1120152|frwiki: Enable the CampaignEvents extension (T386622)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:31:33] (03PS1) 10Vgutierrez: site,hiera: Reimage lvs7007 as liberica LB [puppet] - 10https://gerrit.wikimedia.org/r/1122597 (https://phabricator.wikimedia.org/T384477) [14:31:37] nvm its still pending [14:31:39] (03PS3) 10Cathal Mooney: Rename YAML var "evpn_bgp" to "switch_ibgp" [homer/public] - 10https://gerrit.wikimedia.org/r/1122208 (https://phabricator.wikimedia.org/T371088) [14:31:53] (03PS2) 10Vgutierrez: site,hiera: Reimage lvs7003 as liberica LB [puppet] - 10https://gerrit.wikimedia.org/r/1122597 (https://phabricator.wikimedia.org/T384477) [14:31:57] LD: just got deployed to mwdebug, but I think you can't test this there, is that correct? [14:31:58] LD: it's been synced to the test servers so you can test with XWD [14:32:03] (if you can) [14:32:36] (03CR) 10Cathal Mooney: "Thanks, I set it back the way it was in latest patchset. That template still evpn-specific so it should keep that name for now, may refac" [homer/public] - 10https://gerrit.wikimedia.org/r/1122208 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [14:32:40] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122597 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [14:32:59] (03CR) 10Cathal Mooney: Rename YAML var "evpn_bgp" to "switch_ibgp" (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1122208 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [14:33:22] well as its frwiki config I thought no test was needed, cannot test with XWD anyway [14:33:29] !log kamila@deploy2002 kamila, wpld: Continuing with sync [14:33:44] LD: ok, continuing with full deployment then [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:36:42] (03CR) 10Ssingh: WIP geo-maps: deprioritize eqiad to depool traffic from it (033 comments) [dns] - 10https://gerrit.wikimedia.org/r/1122545 (https://phabricator.wikimedia.org/T380858) (owner: 10Elukey) [14:37:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1036', diff saved to https://phabricator.wikimedia.org/P73566 and previous config saved to /var/cache/conftool/dbconfig/20250225-143749-root.json [14:37:59] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for es1036.eqiad.wmnet [14:39:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2035 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P73567 and previous config saved to /var/cache/conftool/dbconfig/20250225-143908-root.json [14:40:42] !log kamila@deploy2002 Finished scap sync-world: Backport for [[gerrit:1120152|frwiki: Enable the CampaignEvents extension (T386622)]] (duration: 15m 38s) [14:40:46] T386622: Release CampaignEvents extension to French Wikipedia - https://phabricator.wikimedia.org/T386622 [14:40:58] (03PS7) 10Herron: aux-k8s-ctrl codfw: apply role [puppet] - 10https://gerrit.wikimedia.org/r/1122170 (https://phabricator.wikimedia.org/T381417) [14:40:59] LD: and done [14:41:14] thanks, LGTM [14:41:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2219', diff saved to https://phabricator.wikimedia.org/P73568 and previous config saved to /var/cache/conftool/dbconfig/20250225-144137-root.json [14:41:46] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2219.codfw.wmnet [14:42:19] \o/ [14:42:20] (03PS5) 10Elukey: WIP geo-maps: deprioritize eqiad to depool traffic from it [dns] - 10https://gerrit.wikimedia.org/r/1122545 (https://phabricator.wikimedia.org/T380858) [14:42:39] (03CR) 10Elukey: WIP geo-maps: deprioritize eqiad to depool traffic from it (033 comments) [dns] - 10https://gerrit.wikimedia.org/r/1122545 (https://phabricator.wikimedia.org/T380858) (owner: 10Elukey) [14:42:46] (03PS6) 10Elukey: WIP geo-maps: deprioritize eqiad to depool traffic from it [dns] - 10https://gerrit.wikimedia.org/r/1122545 (https://phabricator.wikimedia.org/T380858) [14:43:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1186', diff saved to https://phabricator.wikimedia.org/P73569 and previous config saved to /var/cache/conftool/dbconfig/20250225-144341-root.json [14:43:49] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1186.eqiad.wmnet [14:44:16] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for es1036.eqiad.wmnet [14:45:59] (03PS8) 10Herron: aux-k8s-ctrl codfw: apply role [puppet] - 10https://gerrit.wikimedia.org/r/1122170 (https://phabricator.wikimedia.org/T381417) [14:46:59] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2219.codfw.wmnet [14:47:02] (03CR) 10Ssingh: [C:03+1] "Looks good! Verified based on magru and existing ulsfo config." [puppet] - 10https://gerrit.wikimedia.org/r/1122597 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [14:47:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1036 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P73570 and previous config saved to /var/cache/conftool/dbconfig/20250225-144722-root.json [14:47:53] (03CR) 10Vgutierrez: [C:03+2] site,hiera: Reimage lvs7003 as liberica LB [puppet] - 10https://gerrit.wikimedia.org/r/1122597 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [14:48:16] (03CR) 10Ssingh: WIP geo-maps: deprioritize eqiad to depool traffic from it (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1122545 (https://phabricator.wikimedia.org/T380858) (owner: 10Elukey) [14:48:43] 06SRE, 06Infrastructure-Foundations, 10netops, 10observability, and 3 others: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731#10579181 (10ayounsi) >> And what happens if peer_descr is missing or empty ? > good question, in that case the inst... [14:49:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2219 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P73571 and previous config saved to /var/cache/conftool/dbconfig/20250225-144945-root.json [14:49:54] (03CR) 10Federico Ceratto: [C:03+2] pool.py: Add basic typing to allow mypy checks [cookbooks] - 10https://gerrit.wikimedia.org/r/1122099 (https://phabricator.wikimedia.org/T383760) (owner: 10Federico Ceratto) [14:50:28] (03CR) 10Federico Ceratto: [V:03+2 C:03+2] pool.py: Add basic typing to allow mypy checks [cookbooks] - 10https://gerrit.wikimedia.org/r/1122099 (https://phabricator.wikimedia.org/T383760) (owner: 10Federico Ceratto) [14:50:35] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reimage for host lvs7003.magru.wmnet with OS bookworm [14:51:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1186.eqiad.wmnet [14:51:48] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1186.eqiad.wmnet with reason: Index rebuild [14:53:21] jouncebot: nowandnext [14:53:21] For the next 0 hour(s) and 6 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250225T1400) [14:53:21] In 1 hour(s) and 6 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250225T1600) [14:54:02] (03CR) 10Hnowlan: [C:03+2] mw-jobrunner: scale down [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122565 (owner: 10Hnowlan) [14:54:07] (03PS1) 10Fabfur: benthos: fix schema name [puppet] - 10https://gerrit.wikimedia.org/r/1122603 (https://phabricator.wikimedia.org/T329332) [14:54:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P73572 and previous config saved to /var/cache/conftool/dbconfig/20250225-145413-root.json [14:54:33] PROBLEM - BGP status on asw1-b3-magru.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:56:07] (03Merged) 10jenkins-bot: mw-jobrunner: scale down [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122565 (owner: 10Hnowlan) [14:57:20] (03CR) 10Giuseppe Lavagetto: [C:03+1] benthos: fix schema name [puppet] - 10https://gerrit.wikimedia.org/r/1122603 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [14:58:19] (03CR) 10Fabfur: [C:03+2] benthos: fix schema name [puppet] - 10https://gerrit.wikimedia.org/r/1122603 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [14:58:48] (03CR) 10Federico Ceratto: clone.py: Add helper functions for later use (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1120213 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto) [14:58:54] (03PS1) 10Scott French: php8.1: rebuild to pick up newer php8.1 packages [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1122604 (https://phabricator.wikimedia.org/T386006) [15:02:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1036 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P73573 and previous config saved to /var/cache/conftool/dbconfig/20250225-150228-root.json [15:03:16] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:03:57] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:04:10] (03CR) 10Ayounsi: [C:03+1] "lgtm!" [homer/public] - 10https://gerrit.wikimedia.org/r/1122208 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [15:04:27] (03PS1) 10Elukey: services: Increase capacity and specs for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122605 (https://phabricator.wikimedia.org/T386926) [15:04:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2219 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P73574 and previous config saved to /var/cache/conftool/dbconfig/20250225-150450-root.json [15:06:11] (03CR) 10Clément Goubert: "F" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122605 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey) [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:45] (03CR) 10CI reject: [V:04-1] services: Increase capacity and specs for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122605 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey) [15:08:18] (03CR) 10Marostegui: [C:03+1] "I think this looks good, it is hard to test this without a real wiki creation. But for the next one (we had one last night, what a pity) w" [cookbooks] - 10https://gerrit.wikimedia.org/r/1080129 (https://phabricator.wikimedia.org/T366146) (owner: 10Arnaudb) [15:08:25] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [15:08:31] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [15:08:48] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [15:09:01] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [15:09:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P73575 and previous config saved to /var/cache/conftool/dbconfig/20250225-150919-root.json [15:11:48] (03PS1) 10Giuseppe Lavagetto: mwscript: do not run mesh checks when running in a loop [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1122606 (https://phabricator.wikimedia.org/T387208) [15:12:48] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs7003.magru.wmnet with reason: host reimage [15:14:35] (03PS2) 10Elukey: services: Increase capacity and specs for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122605 (https://phabricator.wikimedia.org/T386926) [15:15:08] (03CR) 10Elukey: services: Increase capacity and specs for Kartotherian (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122605 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey) [15:15:22] (03PS62) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (https://phabricator.wikimedia.org/T367204) [15:15:42] (03CR) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [15:16:37] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs7003.magru.wmnet with reason: host reimage [15:16:56] 10SRE-swift-storage, 06Commons, 10MediaWiki-Uploading, 07Unstewarded-production-error, 07Wikimedia-production-error: An unknown error occurred in storage backend "local-swift-eqiad" - https://phabricator.wikimedia.org/T341007#10579276 (10MatthewVernon) I looked for the first of those two files (`sudo cum... [15:16:57] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [15:17:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1036 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P73576 and previous config saved to /var/cache/conftool/dbconfig/20250225-151733-root.json [15:18:02] (03PS9) 10Herron: aux-k8s-ctrl codfw: apply role [puppet] - 10https://gerrit.wikimedia.org/r/1122170 (https://phabricator.wikimedia.org/T381417) [15:18:02] (03CR) 10Herron: [V:03+1] "Thx for the review! Please see a few replies inline" [puppet] - 10https://gerrit.wikimedia.org/r/1122170 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [15:18:49] (03PS63) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (https://phabricator.wikimedia.org/T367204) [15:19:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2219 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P73577 and previous config saved to /var/cache/conftool/dbconfig/20250225-151956-root.json [15:21:25] (03PS1) 10Gerrit maintenance bot: mariadb: Promote es2039 to es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1122607 (https://phabricator.wikimedia.org/T387224) [15:21:29] (03PS1) 10Gerrit maintenance bot: wmnet: Update es7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1122608 (https://phabricator.wikimedia.org/T387224) [15:21:41] (03PS1) 10Marostegui: db-production.php: Disable writes on es7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122609 (https://phabricator.wikimedia.org/T387224) [15:21:42] FIRING: JobUnavailable: Reduced availability for job liberica in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:23:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:25:32] brouberol: ^^ is this somethign DPE - SRE could help with/ or who usually responds to cert expires? [15:26:36] RECOVERY - BGP status on asw1-b3-magru.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:27:15] (03CR) 10Giuseppe Lavagetto: When executing cli scripts, wait for the service mesh (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122578 (https://phabricator.wikimedia.org/T387208) (owner: 10Giuseppe Lavagetto) [15:27:28] (03PS2) 10Giuseppe Lavagetto: When executing cli scripts, wait for the service mesh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122578 (https://phabricator.wikimedia.org/T387208) [15:27:38] (03PS1) 10Volans: setup.py: revert conftool dependency [software/spicerack] - 10https://gerrit.wikimedia.org/r/1122610 [15:28:51] (03CR) 10Scott French: [C:03+1] "Thanks for catching this!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1122610 (owner: 10Volans) [15:29:38] (03CR) 10Volans: [C:03+2] setup.py: revert conftool dependency [software/spicerack] - 10https://gerrit.wikimedia.org/r/1122610 (owner: 10Volans) [15:31:04] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['backup2013'] [15:31:17] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['backup2013'] [15:31:42] RESOLVED: JobUnavailable: Reduced availability for job liberica in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:32:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1036 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P73578 and previous config saved to /var/cache/conftool/dbconfig/20250225-153239-root.json [15:33:36] PROBLEM - BGP status on asw1-b3-magru.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:34:06] that's expected (lvs7003 being reimaged) [15:34:10] 10SRE-swift-storage, 06Commons, 10MediaWiki-Uploading, 07Unstewarded-production-error, 07Wikimedia-production-error: An unknown error occurred in storage backend "local-swift-eqiad" - https://phabricator.wikimedia.org/T341007#10579425 (10Yann) It worked after I disabled https://commons.wikimedia.org/w/in... [15:34:36] RECOVERY - BGP status on asw1-b3-magru.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:35:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2219 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P73579 and previous config saved to /var/cache/conftool/dbconfig/20250225-153501-root.json [15:35:06] 10SRE-swift-storage, 06Commons, 10MediaWiki-Uploading, 07Unstewarded-production-error, 07Wikimedia-production-error: An unknown error occurred in storage backend "local-swift-eqiad" - https://phabricator.wikimedia.org/T341007#10579426 (10Yann) >>! In T341007#10579276, @MatthewVernon wrote: > I looked for... [15:37:12] (03CR) 10Federico Ceratto: [V:03+2] sre.mysql.sanitize-wiki: sanitize wiki cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1080129 (https://phabricator.wikimedia.org/T366146) (owner: 10Arnaudb) [15:37:35] (03CR) 10Federico Ceratto: [V:03+2 C:03+2] sre.mysql.sanitize-wiki: sanitize wiki cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1080129 (https://phabricator.wikimedia.org/T366146) (owner: 10Arnaudb) [15:40:12] (03Merged) 10jenkins-bot: setup.py: revert conftool dependency [software/spicerack] - 10https://gerrit.wikimedia.org/r/1122610 (owner: 10Volans) [15:47:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1036 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P73580 and previous config saved to /var/cache/conftool/dbconfig/20250225-154744-root.json [15:47:56] !log reprepro include php8.1_8.1.31-1+wmf11u4 into component/php81 - T386006 [15:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:59] T386006: Update PCRE in PHP 8.1 images to PCRE 10.39 or newer - https://phabricator.wikimedia.org/T386006 [15:48:40] (03PS1) 10Volans: CHANGELOG: add changelogs for release v9.1.3 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1122614 [15:48:56] (03CR) 10Volans: [C:03+2] CHANGELOG: add changelogs for release v9.1.3 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1122614 (owner: 10Volans) [15:49:51] (03PS1) 10Vgutierrez: hiera: Fix NIC name for liberica@magru [puppet] - 10https://gerrit.wikimedia.org/r/1122615 (https://phabricator.wikimedia.org/T384477) [15:50:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2219 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P73581 and previous config saved to /var/cache/conftool/dbconfig/20250225-155006-root.json [15:50:08] (03CR) 10Ssingh: [C:03+1] hiera: Fix NIC name for liberica@magru [puppet] - 10https://gerrit.wikimedia.org/r/1122615 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [15:50:24] (03CR) 10Vgutierrez: [C:03+2] hiera: Fix NIC name for liberica@magru [puppet] - 10https://gerrit.wikimedia.org/r/1122615 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [15:50:33] (03CR) 10Ladsgroup: [C:03+1] db-production.php: Disable writes on es7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122609 (https://phabricator.wikimedia.org/T387224) (owner: 10Marostegui) [15:50:50] jouncebot: next [15:50:50] In 0 hour(s) and 9 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250225T1600) [15:51:14] (03CR) 10Marostegui: [C:03+2] db-production.php: Disable writes on es7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122609 (https://phabricator.wikimedia.org/T387224) (owner: 10Marostegui) [15:51:31] !log reprepro include php-apcu_5.1.23-1+wmf11u4 into component/php81 - T386006 [15:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:39] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es7 T387224 [15:51:42] T387224: Switchover es7 master (es2038 -> es2039) - https://phabricator.wikimedia.org/T387224 [15:51:55] (03Merged) 10jenkins-bot: db-production.php: Disable writes on es7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122609 (https://phabricator.wikimedia.org/T387224) (owner: 10Marostegui) [15:52:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set es2039 with weight 0 T387224', diff saved to https://phabricator.wikimedia.org/P73582 and previous config saved to /var/cache/conftool/dbconfig/20250225-155229-marostegui.json [15:53:04] !log marostegui@deploy2002 Started scap sync-world: Backport for [[gerrit:1122609|db-production.php: Disable writes on es7 (T387224)]] [15:56:56] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs7003.magru.wmnet with OS bookworm [15:57:21] (03CR) 10Vgutierrez: [C:03+1] "let's get this one merged cause it's becoming more important now that we are migrating low traffic services to IPIP encapsulation. Nice jo" [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [15:57:43] !log marostegui@deploy2002 marostegui: Backport for [[gerrit:1122609|db-production.php: Disable writes on es7 (T387224)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:57:47] T387224: Switchover es7 master (es2038 -> es2039) - https://phabricator.wikimedia.org/T387224 [15:57:47] !log marostegui@deploy2002 marostegui: Continuing with sync [15:58:03] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10579615 (10Jhancock.wm) neither server will pxe. pxe is set, config on switches is correct. neither nic will come up. could be firmware issue again? roped papaul in this via irc [15:58:30] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-codfw: Upgrading to Cassandra 4.1.8 — T385819 - eevans@cumin1002 [15:58:39] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v9.1.3 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1122614 (owner: 10Volans) [16:00:05] jelto, arnoldokoth, and mutante: #bothumor My software never has bugs. It just develops random features. Rise for SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250225T1600). [16:00:13] (03PS1) 10Volans: Upstream release v9.1.3 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1122616 [16:00:27] (03CR) 10Volans: [C:03+2] Upstream release v9.1.3 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1122616 (owner: 10Volans) [16:00:34] LD: sorry, I was busy and didn’t look at IRC at all today [16:00:38] thanks kamila_ for deploying \o/ [16:00:54] sure :-) [16:02:01] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, and 2 others: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10579629 (10Andrew) @MoritzMuehlenhoff ping, is ganeti1044 ready to be moved? [16:04:13] !log marostegui@deploy2002 Finished scap sync-world: Backport for [[gerrit:1122609|db-production.php: Disable writes on es7 (T387224)]] (duration: 11m 09s) [16:04:17] T387224: Switchover es7 master (es2038 -> es2039) - https://phabricator.wikimedia.org/T387224 [16:05:14] (03CR) 10Marostegui: [C:03+2] mariadb: Promote es2039 to es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1122607 (https://phabricator.wikimedia.org/T387224) (owner: 10Gerrit maintenance bot) [16:06:30] !log Starting es7 codfw failover from es2038 to es2039 - T387224 [16:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es2039 to es7 primary T387224', diff saved to https://phabricator.wikimedia.org/P73583 and previous config saved to /var/cache/conftool/dbconfig/20250225-160659-marostegui.json [16:09:01] (03PS1) 10ZhaoFJx: cowikimedia: Change the logo v2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122622 (https://phabricator.wikimedia.org/T386872) [16:09:40] !log set bgp to true on lvs6002 - T380469 [16:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:44] T380469: eqiad/esams/drmrs LVS: use Netbox BGP flag - https://phabricator.wikimedia.org/T380469 [16:10:23] (03Merged) 10jenkins-bot: Upstream release v9.1.3 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1122616 (owner: 10Volans) [16:10:52] (03PS1) 10Vgutierrez: hiera: Reimage lvs7002 as liberica LB [puppet] - 10https://gerrit.wikimedia.org/r/1122623 (https://phabricator.wikimedia.org/T384477) [16:10:56] (03PS1) 10Vgutierrez: hiera: Reimage lvs7001 as liberica LB [puppet] - 10https://gerrit.wikimedia.org/r/1122624 (https://phabricator.wikimedia.org/T384477) [16:11:23] (03CR) 10Hnowlan: [C:03+2] trafficserver: roll restbaseless citoid out to group0 wikis [puppet] - 10https://gerrit.wikimedia.org/r/1122542 (https://phabricator.wikimedia.org/T361576) (owner: 10Hnowlan) [16:11:39] XioNoX: hmmm that's interesting.. we don't have that feature on liberica (turning off BGP entirely) [16:11:52] XioNoX: is that something that we could need? [16:12:24] vgutierrez: you tell me :) [16:12:28] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es7 T387224 [16:12:33] T387224: Switchover es7 master (es2038 -> es2039) - https://phabricator.wikimedia.org/T387224 [16:13:09] vgutierrez: I don't think it's needed, or we've needed that on pybal [16:13:16] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122623 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [16:13:20] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122624 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [16:13:34] (03PS1) 10Marostegui: Revert "mariadb: Promote es2039 to es7 master" [puppet] - 10https://gerrit.wikimedia.org/r/1122625 [16:14:13] XioNoX: ohhh I misread your last !log entry... I was thinking of bgp: true|false pybal config setting [16:14:35] 10SRE-swift-storage, 06Commons, 10MediaWiki-Uploading, 07Unstewarded-production-error, 07Wikimedia-production-error: An unknown error occurred in storage backend "local-swift-eqiad" - https://phabricator.wikimedia.org/T341007#10579684 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon >>! In... [16:14:44] !log hashar@deploy2002 Started deploy [integration/docroot@50f623d]: build: make Phan stricter [16:14:50] (03CR) 10Marostegui: [C:03+2] Revert "mariadb: Promote es2039 to es7 master" [puppet] - 10https://gerrit.wikimedia.org/r/1122625 (owner: 10Marostegui) [16:14:54] !log hashar@deploy2002 Finished deploy [integration/docroot@50f623d]: build: make Phan stricter (duration: 00m 10s) [16:16:41] !log set bgp to true on lvs6001 - T380469 [16:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:45] T380469: eqiad/esams/drmrs LVS: use Netbox BGP flag - https://phabricator.wikimedia.org/T380469 [16:16:53] !log uploaded spicerack_9.1.3 to apt.wikimedia.org bullseye-wikimedia [16:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:02] !log route citoid via rest-gateway (and not restbase) for most group0 wikis [16:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:08] !log set bgp to true on lvs6003 - T380469 [16:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set es2038 with weight 0 T387224', diff saved to https://phabricator.wikimedia.org/P73584 and previous config saved to /var/cache/conftool/dbconfig/20250225-161823-marostegui.json [16:18:27] T387224: Switchover es7 master (es2038 -> es2039) - https://phabricator.wikimedia.org/T387224 [16:20:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es2038 to es7 primary T387224', diff saved to https://phabricator.wikimedia.org/P73586 and previous config saved to /var/cache/conftool/dbconfig/20250225-162001-marostegui.json [16:21:12] (03PS1) 10Marostegui: Revert "db-production.php: Disable writes on es7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122626 [16:21:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, February 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122279 (https://phabricator.wikimedia.org/T386464) (owner: 10Pppery) [16:21:29] !log set bgp to true on esams LVS - T380469 [16:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:24] (03CR) 10Marostegui: [C:03+2] Revert "db-production.php: Disable writes on es7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122626 (owner: 10Marostegui) [16:23:18] (03Merged) 10jenkins-bot: Revert "db-production.php: Disable writes on es7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122626 (owner: 10Marostegui) [16:23:52] !log marostegui@deploy2002 Started scap sync-world: Backport for [[gerrit:1122626|Revert "db-production.php: Disable writes on es7"]] [16:26:39] (03CR) 10Scott French: [C:03+1] "Thank you, effie! I can merge and deploy this during my day today." [puppet] - 10https://gerrit.wikimedia.org/r/1122584 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [16:29:00] !log marostegui@deploy2002 marostegui: Backport for [[gerrit:1122626|Revert "db-production.php: Disable writes on es7"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:29:39] !log marostegui@deploy2002 marostegui: Continuing with sync [16:30:30] !log set bgp to true on eqiad LVS - T380469 [16:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:36] T380469: eqiad/esams/drmrs LVS: use Netbox BGP flag - https://phabricator.wikimedia.org/T380469 [16:30:43] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [16:31:05] (03CR) 10Ssingh: [C:03+1] "Looks good, checked asw1-b4-magru gateway." [puppet] - 10https://gerrit.wikimedia.org/r/1122623 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [16:31:24] (03CR) 10Scott French: "Thanks, effie! Agreed that we should be able to get back to where we were quite quickly, since the risk of surprises from the PCRE2 upgrad" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122585 (https://phabricator.wikimedia.org/T385395) (owner: 10Effie Mouzeli) [16:33:46] (03PS1) 10Elukey: profile::dns::auth::discovery-map: prefer codfw over eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1122627 (https://phabricator.wikimedia.org/T380858) [16:33:55] (03CR) 10Ssingh: [C:03+1] hiera: Reimage lvs7001 as liberica LB [puppet] - 10https://gerrit.wikimedia.org/r/1122624 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [16:34:32] (03CR) 10Vgutierrez: [C:04-2] "do not merge till 2025-02-26" [puppet] - 10https://gerrit.wikimedia.org/r/1122623 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [16:34:37] (03CR) 10Vgutierrez: [C:04-2] "do not merge till 2025-02-26" [puppet] - 10https://gerrit.wikimedia.org/r/1122624 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [16:34:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10579751 (10phaultfinder) [16:34:57] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [16:35:37] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Install and cable Nokia test devices and test servers in codfw - https://phabricator.wikimedia.org/T385217#10579756 (10cmooney) Myself and Jenn went on a call with Brooke, Saju and some of the other Nokia technical folks. They couldn'... [16:35:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2039', diff saved to https://phabricator.wikimedia.org/P73587 and previous config saved to /var/cache/conftool/dbconfig/20250225-163543-marostegui.json [16:35:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P73588 and previous config saved to /var/cache/conftool/dbconfig/20250225-163556-root.json [16:36:13] !log marostegui@deploy2002 Finished scap sync-world: Backport for [[gerrit:1122626|Revert "db-production.php: Disable writes on es7"]] (duration: 12m 20s) [16:36:20] (03CR) 10Scott French: [C:03+1] "Thank you, effie!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122587 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [16:39:57] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [16:40:20] (03CR) 10Clément Goubert: [C:03+1] services: Increase capacity and specs for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122605 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey) [16:40:43] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [16:43:02] (03PS2) 10Fabfur: workaround for T256098 [debs/benthos] - 10https://gerrit.wikimedia.org/r/1122557 (https://phabricator.wikimedia.org/T256098) [16:43:05] (03CR) 10Fabfur: workaround for T256098 (031 comment) [debs/benthos] - 10https://gerrit.wikimedia.org/r/1122557 (https://phabricator.wikimedia.org/T256098) (owner: 10Fabfur) [16:45:26] (03CR) 10Ssingh: [C:03+1] "Looks good. To be extra sure, we can of course quickly verify the intended output once eqiad is depooled and this patch is merged in." [puppet] - 10https://gerrit.wikimedia.org/r/1122627 (https://phabricator.wikimedia.org/T380858) (owner: 10Elukey) [16:45:59] !log volans@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:05:00 on sretest1001.eqiad.wmnet with reason: test [16:47:15] (03CR) 10Effie Mouzeli: [C:03+1] "nits: I would suggest that this deserves a version of 8.1.34-1-s2 and mention the pcre2 version we have built against, given we are using " [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1122604 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [16:48:41] (03PS6) 10Jelto: Build helm3.17 with new upstream version [debs/helm3] - 10https://gerrit.wikimedia.org/r/1115388 (https://phabricator.wikimedia.org/T341984) [16:48:47] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.099e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [16:53:38] (03CR) 10Vgutierrez: workaround for T256098 (031 comment) [debs/benthos] - 10https://gerrit.wikimedia.org/r/1122557 (https://phabricator.wikimedia.org/T256098) (owner: 10Fabfur) [16:54:07] ^^ is that expected' [16:54:45] (03CR) 10Jelto: "one question in-line to @mmuhlenhoff@wikimedia.org regarding packages from backports." [debs/helm3] - 10https://gerrit.wikimedia.org/r/1115388 (https://phabricator.wikimedia.org/T341984) (owner: 10Jelto) [16:56:11] (03PS3) 10Fabfur: workaround for T256098 [debs/benthos] - 10https://gerrit.wikimedia.org/r/1122557 (https://phabricator.wikimedia.org/T256098) [16:58:00] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [17:00:04] jhathaway and rzl: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250225T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:00:54] (03CR) 10Vgutierrez: [C:03+1] workaround for T256098 [debs/benthos] - 10https://gerrit.wikimedia.org/r/1122557 (https://phabricator.wikimedia.org/T256098) (owner: 10Fabfur) [17:01:48] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-eqiad: Upgrading to Cassandra 4.1.8 — T385819 - eevans@cumin1002 [17:02:31] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for backup1013 - jclark@cumin1002" [17:02:38] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for backup1013 - jclark@cumin1002" [17:02:38] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:03:10] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [17:03:36] (03Abandoned) 10Elukey: WIP geo-maps: deprioritize eqiad to depool traffic from it [dns] - 10https://gerrit.wikimedia.org/r/1122545 (https://phabricator.wikimedia.org/T380858) (owner: 10Elukey) [17:03:48] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host backup1013.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:04:03] !log dzahn@cumin1002 DONE (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 1:00:00 on vrts2002.codfw.wmnet with reason: znuny upgrade [17:04:09] !log dzahn@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on vrts2002.codfw.wmnet with reason: znuny upgrade [17:05:00] (03CR) 10Scott French: [C:03+1] "Thanks, Hugh! These numbers look good, projecting from the last ~ week of usage." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122561 (https://phabricator.wikimedia.org/T380858) (owner: 10Hnowlan) [17:09:21] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for backup1014 - jclark@cumin1002" [17:09:26] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for backup1014 - jclark@cumin1002" [17:09:26] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:10:01] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host backup1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:11:35] !log dzahn@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on vrts1003.eqiad.wmnet with reason: znuny upgrade [17:14:19] (03PS1) 10Ssingh: wikimedia-dns.org: add test TYPE65 record [dns] - 10https://gerrit.wikimedia.org/r/1122630 [17:14:21] !log temp disabling puppet on cp4050 to test benthos configuration (T329332) [17:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 9.375% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:15:58] (03PS2) 10Scott French: php8.1: rebuild to pick up newer php8.1 packages [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1122604 (https://phabricator.wikimedia.org/T386006) [17:16:13] (03CR) 10Ssingh: [C:03+2] wikimedia-dns.org: add test TYPE65 record [dns] - 10https://gerrit.wikimedia.org/r/1122630 (owner: 10Ssingh) [17:16:22] !log sukhe@dns1004 START - running authdns-update [17:16:59] (03PS3) 10Scott French: php8.1: rebuild to pick up newer php8.1 packages [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1122604 (https://phabricator.wikimedia.org/T386006) [17:17:19] (03CR) 10Scott French: "Thanks for the review!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1122604 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [17:18:15] (03CR) 10Effie Mouzeli: [C:03+1] mw-api-ext, mw-web: right-size clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122561 (https://phabricator.wikimedia.org/T380858) (owner: 10Hnowlan) [17:18:17] !log sukhe@dns1004 END - running authdns-update [17:18:42] (03CR) 10Scott French: [V:03+2] "Verified the expected packages are installed when building locally." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1122604 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [17:18:57] jouncebot: nowandnext [17:18:57] For the next 0 hour(s) and 41 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250225T1700) [17:18:58] In 0 hour(s) and 41 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250225T1800) [17:19:04] (03PS1) 10Ssingh: Revert "wikimedia-dns.org: add test TYPE65 record" [dns] - 10https://gerrit.wikimedia.org/r/1122631 [17:19:59] since it does not appear that there are any puppet patches for today, I'd like to deploy in order to pick up a newer php 8.1 base image [17:20:02] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host backup1013.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:20:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 6.25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:22:01] (03CR) 10Scott French: [V:03+2 C:03+2] php8.1: rebuild to pick up newer php8.1 packages [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1122604 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [17:22:39] (03CR) 10Ssingh: [C:03+2] Revert "wikimedia-dns.org: add test TYPE65 record" [dns] - 10https://gerrit.wikimedia.org/r/1122631 (owner: 10Ssingh) [17:22:46] !log sukhe@dns1004 START - running authdns-update [17:23:47] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [17:24:01] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host backup1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:24:30] ottomata: TBH I'm not 100% sure. I'm out until tomorrow atm [17:25:11] !log sukhe@dns1004 END - running authdns-update [17:26:38] (03PS1) 10Bernard Wang: Deploy Search AB test to french wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122633 [17:27:51] !log built 8.1.34-1-s2 php8.1 production images - T386006 [17:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:55] T386006: Update PCRE in PHP 8.1 images to PCRE 10.39 or newer - https://phabricator.wikimedia.org/T386006 [17:29:23] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host backup1013.eqiad.wmnet with OS bookworm [17:29:24] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host backup1014.eqiad.wmnet with OS bookworm [17:29:36] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup101[34] - https://phabricator.wikimedia.org/T384977#10579902 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host backup1013.eqiad.wmnet with OS bookworm [17:29:37] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup101[34] - https://phabricator.wikimedia.org/T384977#10579904 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host backup1014.eqiad.wmnet with OS bookworm [17:30:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10579912 (10phaultfinder) [17:32:44] !log swfrench@deploy2002 Started scap sync-world: Use php packages built against pcre2 backport - T386006 [17:37:29] !log jhathaway@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be2088.codfw.wmnet with reason: T381919 [17:37:34] T381919: Supermicro: unable to set boot order after using Redfish to boot once - https://phabricator.wikimedia.org/T381919 [17:40:23] (03PS1) 10Ssingh: wikimedia-dns.org: add test TYPE65 record (take two) [dns] - 10https://gerrit.wikimedia.org/r/1122635 [17:41:50] (03PS1) 10Elukey: kserve-inference: remove the need for the kserve container's securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122636 (https://phabricator.wikimedia.org/T369493) [17:42:38] (03CR) 10Ssingh: [C:03+2] wikimedia-dns.org: add test TYPE65 record (take two) [dns] - 10https://gerrit.wikimedia.org/r/1122635 (owner: 10Ssingh) [17:42:42] !log sukhe@dns1004 START - running authdns-update [17:43:17] (03CR) 10CI reject: [V:04-1] kserve-inference: remove the need for the kserve container's securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122636 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [17:44:39] !log sukhe@dns1004 END - running authdns-update [17:45:27] (03CR) 10JMeybohm: Build helm3.17 with new upstream version (031 comment) [debs/helm3] - 10https://gerrit.wikimedia.org/r/1115388 (https://phabricator.wikimedia.org/T341984) (owner: 10Jelto) [17:47:03] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1013.eqiad.wmnet with reason: host reimage [17:47:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1181 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73590 and previous config saved to /var/cache/conftool/dbconfig/20250225-174737-root.json [17:48:03] jouncebot: nowandnex [17:48:04] jouncebot: nowandnext [17:48:05] For the next 0 hour(s) and 11 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250225T1700) [17:48:05] In 0 hour(s) and 11 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250225T1800) [17:48:48] (03PS1) 10DLynch: DiscussionTools: enable thanking comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122638 (https://phabricator.wikimedia.org/T366095) [17:49:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 6.25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:49:22] (03PS3) 10Ladsgroup: Remove special-casing of CentralAuth for labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118561 (https://phabricator.wikimedia.org/T161859) [17:49:32] (03PS4) 10Ladsgroup: Remove special-casing of CentralAuth for labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118561 (https://phabricator.wikimedia.org/T161859) [17:50:04] (03CR) 10Ladsgroup: [C:03+2] Remove special-casing of CentralAuth for labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118561 (https://phabricator.wikimedia.org/T161859) (owner: 10Ladsgroup) [17:50:26] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1013.eqiad.wmnet with reason: host reimage [17:50:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118561 (https://phabricator.wikimedia.org/T161859) (owner: 10Ladsgroup) [17:50:42] Amir1: I have a deployment in progress, but you should be good to go once it completes (prod update in flight) [17:50:45] (03Merged) 10jenkins-bot: Remove special-casing of CentralAuth for labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118561 (https://phabricator.wikimedia.org/T161859) (owner: 10Ladsgroup) [17:50:57] no worries [17:51:08] i.e., once you hit the locked part, your backport will stop :) [17:51:19] (until mine completes, which should be soon) [17:51:26] no worries [17:53:20] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:54:18] (03PS1) 10Ssingh: Revert "wikimedia-dns.org: add test TYPE65 record (take two)" [dns] - 10https://gerrit.wikimedia.org/r/1122639 [17:56:16] (03CR) 10Ssingh: [C:03+2] Revert "wikimedia-dns.org: add test TYPE65 record (take two)" [dns] - 10https://gerrit.wikimedia.org/r/1122639 (owner: 10Ssingh) [17:56:27] !log sukhe@dns1004 START - running authdns-update [17:56:46] !log swfrench@deploy2002 Finished scap sync-world: Use php packages built against pcre2 backport - T386006 (duration: 26m 35s) [17:56:51] T386006: Update PCRE in PHP 8.1 images to PCRE 10.39 or newer - https://phabricator.wikimedia.org/T386006 [17:57:10] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1118561|Remove special-casing of CentralAuth for labswiki (T161859)]] [17:57:14] T161859: Make Wikitech an SUL wiki - https://phabricator.wikimedia.org/T161859 [17:58:17] (03CR) 10Dzahn: [C:03+1] "lgtm - per https://www.freedesktop.org/software/systemd/man/latest/systemd.service.html" [puppet] - 10https://gerrit.wikimedia.org/r/1112011 (https://phabricator.wikimedia.org/T323754) (owner: 10Hashar) [17:58:24] !log sukhe@dns1004 END - running authdns-update [17:59:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 9.375% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:00:05] swfrench-wmf: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki infrastructure (UTC late) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250225T1800). [18:00:06] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1118561|Remove special-casing of CentralAuth for labswiki (T161859)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [18:00:20] o/ [18:00:27] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [18:00:50] I'm done with my deployment, and thus will not need to use the infra window [18:02:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1181 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73591 and previous config saved to /var/cache/conftool/dbconfig/20250225-180242-root.json [18:05:41] (03CR) 10Ladsgroup: [C:03+2] Allow users to sign up on Wikitech (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077048 (https://phabricator.wikimedia.org/T377074) (owner: 10Majavah) [18:06:26] (03Merged) 10jenkins-bot: Allow users to sign up on Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077048 (https://phabricator.wikimedia.org/T377074) (owner: 10Majavah) [18:07:10] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1118561|Remove special-casing of CentralAuth for labswiki (T161859)]] (duration: 09m 59s) [18:07:14] T161859: Make Wikitech an SUL wiki - https://phabricator.wikimedia.org/T161859 [18:08:05] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1077048|Allow users to sign up on Wikitech (T377074)]] [18:08:08] T377074: Re-enable account creation on Wikitech - https://phabricator.wikimedia.org/T377074 [18:11:49] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [18:12:07] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [18:12:08] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup1013.eqiad.wmnet with OS bookworm [18:12:18] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup101[34] - https://phabricator.wikimedia.org/T384977#10580058 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host backup1013.eqiad.wmnet with OS bookworm completed: - backup1013 (**PASS... [18:13:48] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup101[34] - https://phabricator.wikimedia.org/T384977#10580063 (10Jclark-ctr) [18:14:32] !log ladsgroup@deploy2002 ladsgroup, taavi: Backport for [[gerrit:1077048|Allow users to sign up on Wikitech (T377074)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [18:14:36] T377074: Re-enable account creation on Wikitech - https://phabricator.wikimedia.org/T377074 [18:15:33] !log ladsgroup@deploy2002 ladsgroup, taavi: Continuing with sync [18:17:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1181 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73592 and previous config saved to /var/cache/conftool/dbconfig/20250225-181747-root.json [18:19:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10580080 (10phaultfinder) [18:19:43] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Install and cable Nokia test devices and test servers in codfw - https://phabricator.wikimedia.org/T385217#10580081 (10cmooney) Ok all devices are back online and reachable via SSH, all running SR Linux v24.7.2. Tomorrow I'll try to f... [18:21:24] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1014.eqiad.wmnet with OS bookworm [18:21:29] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup101[34] - https://phabricator.wikimedia.org/T384977#10580096 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host backup1014.eqiad.wmnet with OS bookworm executed with errors: - backup1... [18:22:10] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1077048|Allow users to sign up on Wikitech (T377074)]] (duration: 14m 05s) [18:22:14] T377074: Re-enable account creation on Wikitech - https://phabricator.wikimedia.org/T377074 [18:22:28] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup101[34] - https://phabricator.wikimedia.org/T384977#10580102 (10Jclark-ctr) @papaul i am having issues with backup1014 failing grub install on sdb do you have any recommendations [18:22:41] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host backup1014.eqiad.wmnet with OS bookworm [18:22:47] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup101[34] - https://phabricator.wikimedia.org/T384977#10580104 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host backup1014.eqiad.wmnet with OS bookworm [18:23:20] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:29:30] (03CR) 10CDobbins: [C:03+2] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [18:30:43] (03Merged) 10jenkins-bot: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [18:31:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73593 and previous config saved to /var/cache/conftool/dbconfig/20250225-183134-root.json [18:32:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1181 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73594 and previous config saved to /var/cache/conftool/dbconfig/20250225-183252-root.json [18:36:50] !log re-enabled puppet on cp4050 (T329332) [18:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:31] (03PS1) 10Fabfur: benthos: fix header capitalization and stricter timeouts [puppet] - 10https://gerrit.wikimedia.org/r/1122644 (https://phabricator.wikimedia.org/T329332) [18:46:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73595 and previous config saved to /var/cache/conftool/dbconfig/20250225-184640-root.json [18:47:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1181 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73596 and previous config saved to /var/cache/conftool/dbconfig/20250225-184758-root.json [18:48:45] (03PS2) 10Fabfur: benthos: fix header capitalization and stricter timeouts [puppet] - 10https://gerrit.wikimedia.org/r/1122644 (https://phabricator.wikimedia.org/T329332) [18:52:02] (03CR) 10Ssingh: [C:03+1] benthos: fix header capitalization and stricter timeouts [puppet] - 10https://gerrit.wikimedia.org/r/1122644 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [18:55:16] (03CR) 10Fabfur: [C:03+2] benthos: fix header capitalization and stricter timeouts [puppet] - 10https://gerrit.wikimedia.org/r/1122644 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [19:00:05] dduvall and andre: Your horoscope predicts another MediaWiki train - Utc-7+Utc-0 Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250225T1900). [19:01:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73597 and previous config saved to /var/cache/conftool/dbconfig/20250225-190145-root.json [19:06:41] (03PS1) 10Ssingh: Revert^2 "wikimedia-dns.org: add test TYPE65 record (take two)" [dns] - 10https://gerrit.wikimedia.org/r/1122645 [19:07:20] PROBLEM - Host cp4047 is DOWN: PING CRITICAL - Packet loss = 100% [19:07:33] huh [19:08:05] !log sukhe@puppetserver1001 conftool action : set/pooled=no; selector: name=--reason,service=(cdn|ats-be) [19:08:06] !log sukhe@puppetserver1001 conftool action : set/pooled=no; selector: name=host down,service=(cdn|ats-be) [19:08:09] !log sukhe@puppetserver1001 conftool action : set/pooled=no; selector: name=--reason,service=(cdn|ats-be) [19:08:10] !log sukhe@puppetserver1001 conftool action : set/pooled=no; selector: name=host down,service=(cdn|ats-be) [19:08:11] !log sukhe@puppetserver1001 conftool action : set/pooled=no; selector: name=cp4047.ulsfo.wmnet,service=(cdn|ats-be) [19:08:55] (03CR) 10Ssingh: [C:03+2] Revert^2 "wikimedia-dns.org: add test TYPE65 record (take two)" [dns] - 10https://gerrit.wikimedia.org/r/1122645 (owner: 10Ssingh) [19:09:02] !log sukhe@dns1004 START - running authdns-update [19:09:20] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-eqiad: Upgrading to Cassandra 4.1.8 — T385819 - eevans@cumin1002 [19:10:16] RECOVERY - Host cp4047 is UP: PING OK - Packet loss = 0%, RTA = 71.25 ms [19:11:00] !log sukhe@dns1004 END - running authdns-update [19:16:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73598 and previous config saved to /var/cache/conftool/dbconfig/20250225-191650-root.json [19:18:15] 10ops-ulsfo, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238 (10ssingh) 03NEW [19:20:17] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1014.eqiad.wmnet with OS bookworm [19:20:25] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup101[34] - https://phabricator.wikimedia.org/T384977#10580336 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host backup1014.eqiad.wmnet with OS bookworm executed with errors: - backup1... [19:20:37] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host backup1014.eqiad.wmnet with OS bookworm [19:20:47] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup101[34] - https://phabricator.wikimedia.org/T384977#10580337 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host backup1014.eqiad.wmnet with OS bookworm [19:31:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73599 and previous config saved to /var/cache/conftool/dbconfig/20250225-193155-root.json [19:35:27] (03CR) 10BCornwall: [C:03+1] wmnet: Update es7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1122608 (https://phabricator.wikimedia.org/T387224) (owner: 10Gerrit maintenance bot) [19:36:51] (03PS1) 10TrainBranchBot: group0 to 1.44.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122647 (https://phabricator.wikimedia.org/T382369) [19:36:53] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.44.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122647 (https://phabricator.wikimedia.org/T382369) (owner: 10TrainBranchBot) [19:37:35] (03Merged) 10jenkins-bot: group0 to 1.44.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122647 (https://phabricator.wikimedia.org/T382369) (owner: 10TrainBranchBot) [19:38:10] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1014.eqiad.wmnet with OS bookworm [19:38:22] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup101[34] - https://phabricator.wikimedia.org/T384977#10580368 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host backup1014.eqiad.wmnet with OS bookworm executed with errors: - backup1... [19:38:35] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host backup1014.eqiad.wmnet with OS bookworm [19:38:43] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup101[34] - https://phabricator.wikimedia.org/T384977#10580369 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host backup1014.eqiad.wmnet with OS bookworm [19:41:03] (03PS1) 10Ssingh: wikimedia-dns.org: add test TYPE65 record (take three, in proper format) [dns] - 10https://gerrit.wikimedia.org/r/1122649 [19:42:45] (03CR) 10Ssingh: [C:03+2] wikimedia-dns.org: add test TYPE65 record (take three, in proper format) [dns] - 10https://gerrit.wikimedia.org/r/1122649 (owner: 10Ssingh) [19:42:53] !log sukhe@dns1004 START - running authdns-update [19:44:53] !log sukhe@dns1004 END - running authdns-update [19:50:32] !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.44.0-wmf.18 refs T382369 [19:50:36] T382369: 1.44.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T382369 [19:52:25] (03PS1) 10Ssingh: Revert "wikimedia-dns.org: add test TYPE65 record (take three, in proper format)" [dns] - 10https://gerrit.wikimedia.org/r/1122651 [19:54:29] (03CR) 10Ssingh: [C:03+2] Revert "wikimedia-dns.org: add test TYPE65 record (take three, in proper format)" [dns] - 10https://gerrit.wikimedia.org/r/1122651 (owner: 10Ssingh) [19:54:44] !log sukhe@dns1004 START - running authdns-update [19:55:00] !log sukhe@dns1004 START - running authdns-update [19:55:40] !log sukhe@dns1004 START - running authdns-update [19:55:49] (03CR) 10Eevans: [C:03+2] ml-cache: upgrade cluster to 'dev' (Cassandra 4.1.8) [puppet] - 10https://gerrit.wikimedia.org/r/1122243 (https://phabricator.wikimedia.org/T386969) (owner: 10Eevans) [19:56:32] (03PS1) 10Ssingh: wikimedia-dns.org: remove TYPE65 record [dns] - 10https://gerrit.wikimedia.org/r/1122653 [19:57:37] !log sukhe@dns1004 END - running authdns-update [19:58:27] (03CR) 10Ssingh: [C:03+2] wikimedia-dns.org: remove TYPE65 record [dns] - 10https://gerrit.wikimedia.org/r/1122653 (owner: 10Ssingh) [19:59:02] !log sukhe@dns1004 START - running authdns-update [19:59:33] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-codfw: Upgrading to Cassandra 4.1.8 — T385819 - eevans@cumin1002 [20:01:01] !log sukhe@dns1004 END - running authdns-update [20:04:50] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [20:15:41] (03PS1) 10Scott French: Re-enroll 5% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122655 (https://phabricator.wikimedia.org/T383845) [20:15:41] (03CR) 10Scott French: "Thanks for prepping the other patches!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122655 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [20:17:18] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-codfw: Upgrading to Cassandra 4.1.8 — T385819 - eevans@cumin1002 [20:19:16] (03CR) 10Scott French: "Idb67a57b5541af9c4584d5ea6e1b9fec661ac432 proposes to start with 5%, which would then be followed soon after by this one. As noted there, " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122585 (https://phabricator.wikimedia.org/T385395) (owner: 10Effie Mouzeli) [20:23:04] (03CR) 10Scott French: [C:03+1] "Actually, I'm going to hold until shortly before I move the enrollment fraction forward again, given the likely presence of broken clients" [puppet] - 10https://gerrit.wikimedia.org/r/1122584 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [20:23:07] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-eqiad: Upgrading to Cassandra 4.1.8 — T385819 - eevans@cumin1002 [20:25:20] !log jhathaway@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be2088.codfw.wmnet with reason: T381919 [20:25:26] T381919: Supermicro: unable to set boot order after using Redfish to boot once - https://phabricator.wikimedia.org/T381919 [20:30:44] (03PS1) 10Bking: elastic: enable perf governor, remove unused host hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1122660 (https://phabricator.wikimedia.org/T386860) [20:31:02] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122660 (https://phabricator.wikimedia.org/T386860) (owner: 10Bking) [20:31:10] (03CR) 10CI reject: [V:04-1] elastic: enable perf governor, remove unused host hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1122660 (https://phabricator.wikimedia.org/T386860) (owner: 10Bking) [20:34:44] (03PS1) 10Ladsgroup: Remove more wikitech specific stuff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122662 [20:41:20] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-eqiad: Upgrading to Cassandra 4.1.8 — T385819 - eevans@cumin1002 [20:42:27] (03PS2) 10Bking: elastic: enable perf governor, remove unused host hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1122660 (https://phabricator.wikimedia.org/T386860) [20:42:38] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122660 (https://phabricator.wikimedia.org/T386860) (owner: 10Bking) [20:44:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10580556 (10phaultfinder) [20:48:03] (03PS3) 10Bking: elastic: enable perf governor, remove unused host hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1122660 (https://phabricator.wikimedia.org/T386860) [20:48:37] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122660 (https://phabricator.wikimedia.org/T386860) (owner: 10Bking) [20:50:57] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1014.eqiad.wmnet with OS bookworm [20:51:08] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup101[34] - https://phabricator.wikimedia.org/T384977#10580570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host backup1014.eqiad.wmnet with OS bookworm executed with errors: - backup1... [20:52:09] (03CR) 10Ryan Kemper: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1122660 (https://phabricator.wikimedia.org/T386860) (owner: 10Bking) [20:53:20] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:53:28] (03PS4) 10Bking: elastic: enable perf governor, remove unused host hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1122660 (https://phabricator.wikimedia.org/T386860) [20:57:48] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, February 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122622 (https://phabricator.wikimedia.org/T386872) (owner: 10ZhaoFJx) [20:59:45] (03PS1) 10RLazarus: deployment_server: Pass kubeConfig in helmfile state values [puppet] - 10https://gerrit.wikimedia.org/r/1122666 (https://phabricator.wikimedia.org/T378429) [20:59:48] (03CR) 10Bking: [C:03+2] elastic: enable perf governor, remove unused host hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1122660 (https://phabricator.wikimedia.org/T386860) (owner: 10Bking) [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250225T2100). [21:00:05] Pppery and ZhaoFJx: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:09] here [21:00:19] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host backup1014.eqiad.wmnet with OS bookworm [21:00:27] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup101[34] - https://phabricator.wikimedia.org/T384977#10580621 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host backup1014.eqiad.wmnet with OS bookworm [21:01:11] here [21:02:46] (03PS2) 10Ladsgroup: Remove more wikitech specific stuff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122662 [21:04:51] (03CR) 10Ladsgroup: [C:03+2] Add various settings for new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122279 (https://phabricator.wikimedia.org/T386464) (owner: 10Pppery) [21:05:40] (03Merged) 10jenkins-bot: Add various settings for new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122279 (https://phabricator.wikimedia.org/T386464) (owner: 10Pppery) [21:06:23] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1122279|Add various settings for new wikis (T386464 T386631)]] [21:06:28] T386464: Post-creation work for sylwiki - https://phabricator.wikimedia.org/T386464 [21:06:29] T386631: Post-creation work for satwiktionary - https://phabricator.wikimedia.org/T386631 [21:08:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.274s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:10:30] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122668 [21:11:13] !log ladsgroup@deploy2002 pppery, ladsgroup: Backport for [[gerrit:1122279|Add various settings for new wikis (T386464 T386631)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:11:19] looking [21:11:27] Thanks [21:13:16] RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.085s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:14:41] Checked a few things, seems to work [21:14:43] so proceed [21:14:46] thanks [21:14:48] !log ladsgroup@deploy2002 pppery, ladsgroup: Continuing with sync [21:16:16] (03CR) 10Simon04: "I'd like to learn more about this secret. 😊" [puppet] - 10https://gerrit.wikimedia.org/r/1080357 (https://phabricator.wikimedia.org/T318285) (owner: 10Simon04) [21:16:38] (03CR) 10Fabfur: workaround for T256098 (031 comment) [debs/benthos] - 10https://gerrit.wikimedia.org/r/1122557 (https://phabricator.wikimedia.org/T256098) (owner: 10Fabfur) [21:19:59] (03CR) 10Scott French: [C:03+1] "Good find and thanks for the cleanup! I totally overlooked the `helmfile` invocation when reviewing Icd8437d6a68d928c04abe1b8ed23bbc95a59d" [puppet] - 10https://gerrit.wikimedia.org/r/1122666 (https://phabricator.wikimedia.org/T378429) (owner: 10RLazarus) [21:21:15] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.225s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:21:21] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1122279|Add various settings for new wikis (T386464 T386631)]] (duration: 14m 58s) [21:21:26] T386464: Post-creation work for sylwiki - https://phabricator.wikimedia.org/T386464 [21:21:27] T386631: Post-creation work for satwiktionary - https://phabricator.wikimedia.org/T386631 [21:22:31] Could a deployer take a look on patch 1122622? Thanks in advance :) [21:23:20] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:23:47] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [21:23:48] ZhaoFJx: about to do that [21:24:04] thanks a lot [21:24:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122622 (https://phabricator.wikimedia.org/T386872) (owner: 10ZhaoFJx) [21:25:07] (03Merged) 10jenkins-bot: cowikimedia: Change the logo v2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122622 (https://phabricator.wikimedia.org/T386872) (owner: 10ZhaoFJx) [21:25:38] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1122622|cowikimedia: Change the logo v2 (T386872)]] [21:25:42] T386872: Requesting logo change for co.wikimedia.org - https://phabricator.wikimedia.org/T386872 [21:25:56] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudgw100[12] - https://phabricator.wikimedia.org/T386810#10580766 (10VRiley-WMF) After trying to rerun it again, I keep getti{F58493864}ng this error (screenshot attached) @cmooney would you have an idea what m... [21:26:15] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.092s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:27:43] (03CR) 10RLazarus: [C:03+2] deployment_server: Pass kubeConfig in helmfile state values [puppet] - 10https://gerrit.wikimedia.org/r/1122666 (https://phabricator.wikimedia.org/T378429) (owner: 10RLazarus) [21:28:33] !log ladsgroup@deploy2002 ladsgroup, zhaofjx: Backport for [[gerrit:1122622|cowikimedia: Change the logo v2 (T386872)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:28:47] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1014.eqiad.wmnet with OS bookworm [21:28:53] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup101[34] - https://phabricator.wikimedia.org/T384977#10580781 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host backup1014.eqiad.wmnet with OS bookworm executed with errors: - backup1... [21:28:56] checking [21:29:15] Amir1all good [21:29:20] Amir1 all good [21:29:20] !log upgraded spicerack on the cumin hosts to v9.1.3 [21:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:21] thanks [21:30:23] !log ladsgroup@deploy2002 ladsgroup, zhaofjx: Continuing with sync [21:31:15] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.133s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:36:50] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1122622|cowikimedia: Change the logo v2 (T386872)]] (duration: 11m 12s) [21:36:54] T386872: Requesting logo change for co.wikimedia.org - https://phabricator.wikimedia.org/T386872 [21:40:01] (03PS1) 10Kimberly Sarabia: Add config for donate banner to be enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122671 (https://phabricator.wikimedia.org/T386767) [21:41:15] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.125s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:41:31] Amir1 thank you for deployment [21:42:21] (03PS2) 10Kimberly Sarabia: Add config for donate banner to be enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122671 (https://phabricator.wikimedia.org/T386767) [21:43:20] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:46:15] RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.093s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:46:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1115:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1115 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:49:10] :) [21:50:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122662 (owner: 10Ladsgroup) [21:51:21] (03Merged) 10jenkins-bot: Remove more wikitech specific stuff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122662 (owner: 10Ladsgroup) [21:51:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1115:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1115 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:51:51] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1122662|Remove more wikitech specific stuff]] [21:54:05] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.02.10 - 2025.02.28): Enable CPU performance governor on Relforge, Cloudelastic, and Elasticsearch hosts - https://phabricator.wikimedia.org/T386860#10580861 (10bking) Unfortunately, I just now remembered that the Performance governor on... [21:56:20] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1122662|Remove more wikitech specific stuff]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:56:22] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [21:59:57] Amir1 sorry for bother again, but could you help purge the caches for the logo on cowikimedia? [21:59:58] Since looks like https://co.wikimedia.org/static/images/project-logos/cowikimedia.png is still displaying the old version, even though it looks fine on the testserver or add ?purge after the url [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250225T2200) [22:00:22] We will be using the deploy window today! [22:00:25] ZhaoFJx: is that the only url [22:00:36] toyofuku: we are almost done, give us a sec [22:01:07] Sounds good [22:01:11] Amir1 not sure, I only know its a file under /static [22:01:26] there is a guide on https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers#Purging [22:02:53] ZhaoFJx: done, I also did it with mobile domain, just in case [22:02:57] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1122662|Remove more wikitech specific stuff]] (duration: 11m 06s) [22:03:09] also my wikitech stuff is now deployed [22:04:23] Amir1 still the old image on my side somehow [22:04:27] ZhaoFJx: i think your local machine might be caching the old version - i had to force-refresh the logo file you linked before it updated to the new version for me [22:04:44] !log jhathaway@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be2088.codfw.wmnet with reason: T381919 [22:04:47] Yeah ^ [22:04:48] T381919: Supermicro: unable to set boot order after using Redfish to boot once - https://phabricator.wikimedia.org/T381919 [22:04:54] ctrl + shift + r [22:05:58] (03CR) 10Jdlrobson: [C:04-1] Add config for donate banner to be enabled (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122671 (https://phabricator.wikimedia.org/T386767) (owner: 10Kimberly Sarabia) [22:07:09] It doesn't work sadly... But if the image works fine on you two's end, then that's problem of my pc I guess [22:07:38] (03CR) 10Jdrewniak: [C:03+1] Deploy Search AB test to french wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122633 (owner: 10Bernard Wang) [22:09:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.159s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:09:34] Amir1: am I still waiting? [22:12:24] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host backup1014.eqiad.wmnet with OS bookworm [22:12:38] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup101[34] - https://phabricator.wikimedia.org/T384977#10580899 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host backup1014.eqiad.wmnet with OS bookworm [22:13:20] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:14:07] Ordinarily I would wait for an explicit handoff, but since we're a bit pressed for time and I don't see an in progress deploy, we're gonna get started [22:14:10] yolo [22:14:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.167s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:15:23] toyofuku: oh I'm so sorry, after 12 hours of work, I forgot to hand over [22:15:33] No worries at all!!! [22:15:40] and went for dinner [22:15:45] Please go rest if you can 12 hours of work sounds like approx 4 too many [22:16:01] 12 too many if it were up to me 😪 [22:16:41] yeah, ttyl! [22:16:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy2002 using scap backport" [extensions/MobileFrontend] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1122254 (https://phabricator.wikimedia.org/T386735) (owner: 10Jdlrobson) [22:16:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122633 (owner: 10Bernard Wang) [22:17:40] (03Merged) 10jenkins-bot: Deploy Search AB test to french wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122633 (owner: 10Bernard Wang) [22:19:18] While we're waiting for `gate-and-submit-wmf` - I'm listening to EoO by Bad Bunny off his latest album [22:23:24] Now Chimbita by Feid off Inter Shibuya [22:25:45] Crush ft Jorja Smith by AJ Tracey [22:26:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 992.5ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:27:08] (03PS1) 10Ryan Kemper: wdqs: Create DNS entry for one full graph host [dns] - 10https://gerrit.wikimedia.org/r/1122676 (https://phabricator.wikimedia.org/T384422) [22:28:22] (03Merged) 10jenkins-bot: Update ext.MobileFrontend.searchOverlay.empty hook to fire after ext.MobileFrontend.searchOverlay.open [extensions/MobileFrontend] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1122254 (https://phabricator.wikimedia.org/T386735) (owner: 10Jdlrobson) [22:28:52] !log toyofuku@deploy2002 Started scap sync-world: Backport for [[gerrit:1122254|Update ext.MobileFrontend.searchOverlay.empty hook to fire after ext.MobileFrontend.searchOverlay.open (T386735)]], [[gerrit:1122633|Deploy Search AB test to french wiki]] [22:28:56] T386735: Show empty search recommendation event is missing funnel data - https://phabricator.wikimedia.org/T386735 [22:28:58] Perfect timing [22:29:56] (03PS2) 10Ryan Kemper: wdqs: add routing for legacy full graph host [puppet] - 10https://gerrit.wikimedia.org/r/1121726 (https://phabricator.wikimedia.org/T384422) [22:31:16] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.213s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:31:50] !log toyofuku@deploy2002 bwang, toyofuku, jdlrobson: Backport for [[gerrit:1122254|Update ext.MobileFrontend.searchOverlay.empty hook to fire after ext.MobileFrontend.searchOverlay.open (T386735)]], [[gerrit:1122633|Deploy Search AB test to french wiki]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:32:00] coordinating testing via slack - brb [22:35:43] continuing to hold [22:36:16] RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.101s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:38:36] (03PS1) 10Bernard Wang: Deploy Search AB test to french wiki including eventstreams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122677 [22:41:09] (03PS1) 10Ryan Kemper: wdqs: create new ui for wdqs legacy full [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122678 (https://phabricator.wikimedia.org/T384422) [22:41:20] We caught something on test servers (shoutout test servers!!) [22:41:35] Will likely be proceeding with the deploy, followed by another deploy to fix what we caught [22:41:39] {◕ ◡ ◕} [22:41:49] Will keep the void updated as possible [22:42:33] While we wait, I'm listening to this really weird song: https://open.spotify.com/track/0MxPT9xJ89g4j0IleXXWwY [22:42:40] Wouldn't say I necessarily recommend it but it's cute [22:42:57] (03PS2) 10Ryan Kemper: wdqs: create new ui for wdqs legacy full [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122678 (https://phabricator.wikimedia.org/T384422) [22:43:16] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.491s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:43:31] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on backup1013 - https://phabricator.wikimedia.org/T387252 (10ops-monitoring-bot) 03NEW [22:43:32] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host backup1013.eqiad.wmnet with OS bookworm [22:43:38] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup101[34] - https://phabricator.wikimedia.org/T384977#10580965 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host backup1013.eqiad.wmnet with OS bookworm [22:44:06] (03CR) 10Ryan Kemper: "I *think* this is all that's required to set up a new UI, although this change feels a little too easy so there very well could be somethi" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122678 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper) [22:44:16] (03CR) 10Bking: [C:03+1] "LGTM, probably want someone from serviceops-collab to confirm though." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122678 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper) [22:45:35] We're proceeding [22:45:38] !log toyofuku@deploy2002 bwang, toyofuku, jdlrobson: Continuing with sync [22:46:14] This song also by ay3demi is kind of a bop: https://open.spotify.com/track/515UNMgW9krZGvvVnQ8XuD [22:47:39] (03CR) 10Jdrewniak: [C:03+1] Deploy Search AB test to french wiki including eventstreams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122677 (owner: 10Bernard Wang) [22:50:53] (03PS1) 10Bernard Wang: Deploy Search AB test to everywhere but English wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122680 [22:51:33] As I mentioned, we'll likely be doing another deploy after this one finishes [22:52:02] Hopefully that's okay since nothing appears to be scheduled after this, but yell at me if it's not pls [22:52:19] !log toyofuku@deploy2002 Finished scap sync-world: Backport for [[gerrit:1122254|Update ext.MobileFrontend.searchOverlay.empty hook to fire after ext.MobileFrontend.searchOverlay.open (T386735)]], [[gerrit:1122633|Deploy Search AB test to french wiki]] (duration: 23m 26s) [22:52:23] T386735: Show empty search recommendation event is missing funnel data - https://phabricator.wikimedia.org/T386735 [22:52:52] (03PS2) 10Bernard Wang: Deploy Search AB test to everywhere but English wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122680 (https://phabricator.wikimedia.org/T386849) [22:54:00] First deploy done, second deploy starting soon [22:54:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122677 (owner: 10Bernard Wang) [22:55:05] Second deploy starting NOW [22:55:38] (03Merged) 10jenkins-bot: Deploy Search AB test to french wiki including eventstreams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122677 (owner: 10Bernard Wang) [22:56:05] !log toyofuku@deploy2002 Started scap sync-world: Backport for [[gerrit:1122677|Deploy Search AB test to french wiki including eventstreams]] [22:59:00] !log toyofuku@deploy2002 toyofuku, bwang: Backport for [[gerrit:1122677|Deploy Search AB test to french wiki including eventstreams]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:59:39] (03CR) 10Bking: [C:03+1] wdqs: Create DNS entry for one full graph host [dns] - 10https://gerrit.wikimedia.org/r/1122676 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper) [22:59:47] Once again coordinating testing via slack [22:59:58] (03CR) 10Bking: [C:03+1] wdqs: add routing for legacy full graph host [puppet] - 10https://gerrit.wikimedia.org/r/1121726 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper) [23:00:02] will keep all zero of you updated [23:00:05] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1013.eqiad.wmnet with reason: host reimage [23:01:17] (03CR) 10Bking: wdqs: add routing for legacy full graph host [puppet] - 10https://gerrit.wikimedia.org/r/1121726 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper) [23:03:21] (03CR) 10Bking: wdqs: add routing for legacy full graph host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1121726 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper) [23:03:50] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1013.eqiad.wmnet with reason: host reimage [23:04:42] We're still in the middle of a deploy and still testing on test servers, coordinated via slack [23:06:00] (03CR) 10Ryan Kemper: [C:04-1] "Putting a -1 until I/we figure out the cert provisioning" [puppet] - 10https://gerrit.wikimedia.org/r/1121726 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper) [23:08:16] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.256s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:08:33] toyofuku: thanks for keeping channel up-to-date [23:08:56] 🫡🫡 [23:10:35] We're proceeding! [23:10:38] !log toyofuku@deploy2002 toyofuku, bwang: Continuing with sync [23:13:16] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.364s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:14:25] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.02.10 - 2025.02.28): Enable CPU performance governor on Relforge, Cloudelastic, and Elasticsearch hosts - https://phabricator.wikimedia.org/T386860#10581045 (10Jclark-ctr) those are racked in d4 ,f5 should not have any problems with pow... [23:14:44] (03PS1) 10Bartosz Dziewoński: Remove unused config variable $wgJsonConfigInterwikiPrefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122683 [23:15:28] (03PS1) 10Bernard Wang: Deploy Search AB test to everywhere but English wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122684 [23:16:23] (03Abandoned) 10Bernard Wang: Deploy Search AB test to everywhere but English wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122680 (https://phabricator.wikimedia.org/T386849) (owner: 10Bernard Wang) [23:16:34] (03PS2) 10Bernard Wang: Deploy Search AB test to everywhere but English wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122684 (https://phabricator.wikimedia.org/T386849) [23:16:50] !log toyofuku@deploy2002 Finished scap sync-world: Backport for [[gerrit:1122677|Deploy Search AB test to french wiki including eventstreams]] (duration: 20m 44s) [23:17:17] Apologies for running a bit over, but we should be done now! [23:17:19] Thanks all [23:18:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 933.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:21:02] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup1013.eqiad.wmnet with OS bookworm [23:21:16] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup101[34] - https://phabricator.wikimedia.org/T384977#10581067 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host backup1013.eqiad.wmnet with OS bookworm completed: - backup1013 (**WARN... [23:22:10] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup101[34] - https://phabricator.wikimedia.org/T384977#10581071 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host backup1014.eqiad.wmnet with OS bookworm executed with errors: - backup1... [23:22:27] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host backup1014.eqiad.wmnet with OS bookworm [23:22:40] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup101[34] - https://phabricator.wikimedia.org/T384977#10581072 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host backup1014.eqiad.wmnet with OS bookworm [23:26:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.11s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:27:18] !log jhathaway@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be2088.codfw.wmnet with reason: T381919 [23:27:22] T381919: Supermicro: unable to set boot order after using Redfish to boot once - https://phabricator.wikimedia.org/T381919 [23:31:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.11s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:32:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.087s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:37:41] ACKNOWLEDGEMENT - MD RAID on ms-be2088 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T387257 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [23:37:45] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on ms-be2088 - https://phabricator.wikimedia.org/T387257 (10ops-monitoring-bot) 03NEW [23:41:31] FIRING: [3x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.311s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:46:31] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.063s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:47:00] (03PS1) 10Cwhite: site: clean up logstash102[6789] configs [puppet] - 10https://gerrit.wikimedia.org/r/1122691 (https://phabricator.wikimedia.org/T383287) [23:47:16] RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.063s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:49:33] !log cwhite@cumin2002 START - Cookbook sre.hosts.decommission for hosts logstash1029.eqiad.wmnet [23:52:12] !log cwhite@cumin2002 START - Cookbook sre.hosts.decommission for hosts logstash1028.eqiad.wmnet [23:53:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1151:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1151 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:53:54] !log cwhite@cumin2002 START - Cookbook sre.hosts.decommission for hosts logstash1027.eqiad.wmnet [23:54:31] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.183s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:56:17] !log cwhite@cumin2002 START - Cookbook sre.dns.netbox [23:58:45] RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.229s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:59:46] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.171s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded