[00:08:26] (03PS1) 10Krinkle: deployment-prep: update shadowed "default_php_version" overrides [puppet] - 10https://gerrit.wikimedia.org/r/1155793 [00:08:29] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1155794 [00:08:29] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1155794 (owner: 10TrainBranchBot) [00:10:17] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 609.29 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:27:20] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): SSD firmware update for an-coord100[3-4] - https://phabricator.wikimedia.org/T394499#10907163 (10BTullis) [00:27:45] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1155794 (owner: 10TrainBranchBot) [00:48:45] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/5f22ca2b2c5d17d7796a35254336bfd8da75d0a4fc5d3ecefdedd64adc58f7ce/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:08:45] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:07:17] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.29 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:18:21] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10907285 (10Andrew) a:05cmooney→03Jclark-ctr @jclark-ctr, we would like to wait until the 25G dacs come in, and then have each of these hosts reconnect... [02:18:43] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10907288 (10Andrew) [02:28:41] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10907303 (10Andrew) [02:30:08] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:eqiad:(6) Ceph cluster expansion - custom config 10g - https://phabricator.wikimedia.org/T378828#10907304 (10Andrew) [02:30:38] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:eqiad:(6) Ceph cluster expansion - custom config 10g - https://phabricator.wikimedia.org/T378828#10907305 (10Andrew) Note that I just updated the network plan. [02:32:16] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10907306 (10Andrew) [02:32:32] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10907307 (10Andrew) Note that we need two ports for each of these, I've just updated the task description. Does that make fitting them even harder? [02:57:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-magru:xe-0/1/2 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [03:16:15] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:17:05] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.206 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:22:25] FIRING: SystemdUnitFailed: wmf_auto_restart_atftpd.service on install7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:46:42] (03PS1) 10Bartosz Dziewoński: tables-catalog: Make test_validation.py actually validate when executed [puppet] - 10https://gerrit.wikimedia.org/r/1155885 [03:46:42] (03PS1) 10Bartosz Dziewoński: tables-catalog: Alphabetize, enforce this in test_validation.py [puppet] - 10https://gerrit.wikimedia.org/r/1155886 [03:46:42] (03PS1) 10Bartosz Dziewoński: tables-catalog: Add Flow tables [puppet] - 10https://gerrit.wikimedia.org/r/1155887 [03:46:42] (03PS1) 10Bartosz Dziewoński: tables-catalog: Add LiquidThreads tables [puppet] - 10https://gerrit.wikimedia.org/r/1155888 [03:46:43] (03PS1) 10Bartosz Dziewoński: tables-catalog: Add OATHAuth tables [puppet] - 10https://gerrit.wikimedia.org/r/1155889 (https://phabricator.wikimedia.org/T391490) [03:46:44] (03PS1) 10Bartosz Dziewoński: tables-catalog: Add OAuth tables [puppet] - 10https://gerrit.wikimedia.org/r/1155890 (https://phabricator.wikimedia.org/T391490) [03:46:48] (03PS1) 10Bartosz Dziewoński: tables-catalog: Add CentralAuth tables [puppet] - 10https://gerrit.wikimedia.org/r/1155891 (https://phabricator.wikimedia.org/T391490) [03:46:52] (03PS1) 10Bartosz Dziewoński: tables-catalog: Add a script to visualize it as a table [puppet] - 10https://gerrit.wikimedia.org/r/1155892 [03:49:32] (03CR) 10CI reject: [V:04-1] tables-catalog: Make test_validation.py actually validate when executed [puppet] - 10https://gerrit.wikimedia.org/r/1155885 (owner: 10Bartosz Dziewoński) [03:50:02] (03CR) 10CI reject: [V:04-1] tables-catalog: Alphabetize, enforce this in test_validation.py [puppet] - 10https://gerrit.wikimedia.org/r/1155886 (owner: 10Bartosz Dziewoński) [03:53:16] (03CR) 10CI reject: [V:04-1] tables-catalog: Add a script to visualize it as a table [puppet] - 10https://gerrit.wikimedia.org/r/1155892 (owner: 10Bartosz Dziewoński) [03:57:14] (03PS2) 10Bartosz Dziewoński: tables-catalog: Make test_validation.py actually validate when executed [puppet] - 10https://gerrit.wikimedia.org/r/1155885 [03:57:27] (03PS2) 10Bartosz Dziewoński: tables-catalog: Alphabetize, enforce this in test_validation.py [puppet] - 10https://gerrit.wikimedia.org/r/1155886 [03:57:27] (03PS2) 10Bartosz Dziewoński: tables-catalog: Add Flow tables [puppet] - 10https://gerrit.wikimedia.org/r/1155887 [03:57:27] (03PS2) 10Bartosz Dziewoński: tables-catalog: Add LiquidThreads tables [puppet] - 10https://gerrit.wikimedia.org/r/1155888 [03:57:28] (03PS2) 10Bartosz Dziewoński: tables-catalog: Add OATHAuth tables [puppet] - 10https://gerrit.wikimedia.org/r/1155889 (https://phabricator.wikimedia.org/T391490) [03:57:29] (03PS2) 10Bartosz Dziewoński: tables-catalog: Add OAuth tables [puppet] - 10https://gerrit.wikimedia.org/r/1155890 (https://phabricator.wikimedia.org/T391490) [03:57:30] (03PS2) 10Bartosz Dziewoński: tables-catalog: Add CentralAuth tables [puppet] - 10https://gerrit.wikimedia.org/r/1155891 (https://phabricator.wikimedia.org/T391490) [03:57:33] (03PS2) 10Bartosz Dziewoński: tables-catalog: Add a script to visualize it as a table [puppet] - 10https://gerrit.wikimedia.org/r/1155892 [03:57:51] (03PS3) 10Bartosz Dziewoński: tables-catalog: Add Flow tables [puppet] - 10https://gerrit.wikimedia.org/r/1155887 (https://phabricator.wikimedia.org/T363581) [03:58:00] (03PS3) 10Bartosz Dziewoński: tables-catalog: Add LiquidThreads tables [puppet] - 10https://gerrit.wikimedia.org/r/1155888 [03:58:06] (03PS4) 10Bartosz Dziewoński: tables-catalog: Add LiquidThreads tables [puppet] - 10https://gerrit.wikimedia.org/r/1155888 (https://phabricator.wikimedia.org/T363581) [03:58:17] (03PS3) 10Bartosz Dziewoński: tables-catalog: Add OATHAuth tables [puppet] - 10https://gerrit.wikimedia.org/r/1155889 (https://phabricator.wikimedia.org/T391490) [03:58:18] (03PS3) 10Bartosz Dziewoński: tables-catalog: Add OAuth tables [puppet] - 10https://gerrit.wikimedia.org/r/1155890 (https://phabricator.wikimedia.org/T391490) [03:58:18] (03PS3) 10Bartosz Dziewoński: tables-catalog: Add CentralAuth tables [puppet] - 10https://gerrit.wikimedia.org/r/1155891 (https://phabricator.wikimedia.org/T391490) [03:58:18] (03PS3) 10Bartosz Dziewoński: tables-catalog: Add a script to visualize it as a table [puppet] - 10https://gerrit.wikimedia.org/r/1155892 [04:00:38] (03CR) 10CI reject: [V:04-1] tables-catalog: Alphabetize, enforce this in test_validation.py [puppet] - 10https://gerrit.wikimedia.org/r/1155886 (owner: 10Bartosz Dziewoński) [04:02:04] (03CR) 10CI reject: [V:04-1] tables-catalog: Add OATHAuth tables [puppet] - 10https://gerrit.wikimedia.org/r/1155889 (https://phabricator.wikimedia.org/T391490) (owner: 10Bartosz Dziewoński) [04:05:27] (03CR) 10CI reject: [V:04-1] tables-catalog: Add a script to visualize it as a table [puppet] - 10https://gerrit.wikimedia.org/r/1155892 (owner: 10Bartosz Dziewoński) [04:14:59] (03CR) 10Bartosz Dziewoński: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1155886 (owner: 10Bartosz Dziewoński) [04:27:00] !log ran cleanupBlocks.php on all wikis for T373847 and T389301 [04:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:27:05] T373847: IP addresses at dewiki that cannot be unblocked - https://phabricator.wikimedia.org/T373847 [04:27:05] T389301: Clean up duplicate block_target rows in production - https://phabricator.wikimedia.org/T389301 [04:27:08] (03PS3) 10Bartosz Dziewoński: tables-catalog: Alphabetize, enforce this in test_validation.py [puppet] - 10https://gerrit.wikimedia.org/r/1155886 [04:33:48] (03PS4) 10Bartosz Dziewoński: tables-catalog: Add Flow tables [puppet] - 10https://gerrit.wikimedia.org/r/1155887 (https://phabricator.wikimedia.org/T363581) [04:33:48] (03PS5) 10Bartosz Dziewoński: tables-catalog: Add LiquidThreads tables [puppet] - 10https://gerrit.wikimedia.org/r/1155888 (https://phabricator.wikimedia.org/T363581) [04:33:48] (03PS4) 10Bartosz Dziewoński: tables-catalog: Add OATHAuth tables [puppet] - 10https://gerrit.wikimedia.org/r/1155889 (https://phabricator.wikimedia.org/T391490) [04:33:49] (03PS4) 10Bartosz Dziewoński: tables-catalog: Add OAuth tables [puppet] - 10https://gerrit.wikimedia.org/r/1155890 (https://phabricator.wikimedia.org/T391490) [04:33:50] (03PS4) 10Bartosz Dziewoński: tables-catalog: Add CentralAuth tables [puppet] - 10https://gerrit.wikimedia.org/r/1155891 (https://phabricator.wikimedia.org/T391490) [04:33:51] (03PS4) 10Bartosz Dziewoński: tables-catalog: Add a script to visualize it as a table [puppet] - 10https://gerrit.wikimedia.org/r/1155892 [04:38:20] (03CR) 10CI reject: [V:04-1] tables-catalog: Add a script to visualize it as a table [puppet] - 10https://gerrit.wikimedia.org/r/1155892 (owner: 10Bartosz Dziewoński) [05:06:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Lumen (2001:1900:2100::4b41) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:07:51] FIRING: [6x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:31:02] (03CR) 10Marostegui: [C:03+1] Remove obsolete Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1155629 (owner: 10Muehlenhoff) [05:33:07] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2239.codfw.wmnet with reason: Maintenance [05:34:34] (03PS1) 10Marostegui: db2226: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1155942 (https://phabricator.wikimedia.org/T396549) [05:34:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2226', diff saved to https://phabricator.wikimedia.org/P77755 and previous config saved to /var/cache/conftool/dbconfig/20250612-053450-marostegui.json [05:35:14] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2226.codfw.wmnet with reason: Maintenance [05:35:30] (03CR) 10Marostegui: [C:03+2] db2226: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1155942 (https://phabricator.wikimedia.org/T396549) (owner: 10Marostegui) [05:35:56] (03PS1) 10EggRoll97: Add arbcom group to ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155945 (https://phabricator.wikimedia.org/T396668) [05:40:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2226 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P77756 and previous config saved to /var/cache/conftool/dbconfig/20250612-054030-root.json [05:43:14] (03PS3) 10Anzx: enwiki: temporary lift of IP cap for event on 16 June 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155930 (https://phabricator.wikimedia.org/T396128) [05:43:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2026', diff saved to https://phabricator.wikimedia.org/P77757 and previous config saved to /var/cache/conftool/dbconfig/20250612-054315-marostegui.json [05:43:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 12 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155930 (https://phabricator.wikimedia.org/T396128) (owner: 10Anzx) [05:43:36] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2026.codfw.wmnet with reason: Maintenance [05:44:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155945 (https://phabricator.wikimedia.org/T396668) (owner: 10EggRoll97) [05:45:21] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10907555 (10ayounsi) p:05Medium→03High @VRiley-WMF @Jclark-ctr about https://netbox.wikimedia.org/dcim/devices/5725/ and https://netbox.wi... [05:50:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P77758 and previous config saved to /var/cache/conftool/dbconfig/20250612-055005-root.json [05:50:43] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2209.codfw.wmnet with reason: Maintenance [05:51:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1184 T396697', diff saved to https://phabricator.wikimedia.org/P77759 and previous config saved to /var/cache/conftool/dbconfig/20250612-055136-marostegui.json [05:51:40] T396697: Temporarily move db1184 to m1 - https://phabricator.wikimedia.org/T396697 [05:52:20] (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155945 (https://phabricator.wikimedia.org/T396668) (owner: 10EggRoll97) [05:52:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1184 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P77760 and previous config saved to /var/cache/conftool/dbconfig/20250612-055237-root.json [05:53:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1206 T396697', diff saved to https://phabricator.wikimedia.org/P77761 and previous config saved to /var/cache/conftool/dbconfig/20250612-055318-marostegui.json [05:53:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Pool db1206', diff saved to https://phabricator.wikimedia.org/P77762 and previous config saved to /var/cache/conftool/dbconfig/20250612-055339-marostegui.json [05:54:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1207 T396697', diff saved to https://phabricator.wikimedia.org/P77763 and previous config saved to /var/cache/conftool/dbconfig/20250612-055439-marostegui.json [05:55:30] (03PS1) 10Marostegui: db1207: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1155965 (https://phabricator.wikimedia.org/T396697) [05:55:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2226 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P77764 and previous config saved to /var/cache/conftool/dbconfig/20250612-055535-root.json [05:55:45] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1207.eqiad.wmnet with reason: Maintenance [05:56:05] (03CR) 10Marostegui: [C:03+2] db1207: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1155965 (https://phabricator.wikimedia.org/T396697) (owner: 10Marostegui) [05:58:51] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 143 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250612T0600) [06:00:06] marostegui, Amir1, and federico3: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Primary database switchover . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250612T0600). [06:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:01:49] (03CR) 10Anzx: [C:03+1] Add arbcom group to ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155945 (https://phabricator.wikimedia.org/T396668) (owner: 10EggRoll97) [06:02:24] (03CR) 10Anzx: [C:03+1] Add arbcom group to ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155945 (https://phabricator.wikimedia.org/T396668) (owner: 10EggRoll97) [06:05:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77765 and previous config saved to /var/cache/conftool/dbconfig/20250612-060510-root.json [06:07:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1184 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P77766 and previous config saved to /var/cache/conftool/dbconfig/20250612-060743-root.json [06:07:51] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [06:10:22] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1217.eqiad.wmnet with reason: Maintenance [06:10:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2226 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77767 and previous config saved to /var/cache/conftool/dbconfig/20250612-061041-root.json [06:13:58] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [06:13:59] PROBLEM - haproxy failover on dbproxy1022 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [06:13:59] PROBLEM - haproxy failover on dbproxy1024 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [06:14:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1157 (T396130)', diff saved to https://phabricator.wikimedia.org/P77769 and previous config saved to /var/cache/conftool/dbconfig/20250612-061405-marostegui.json [06:14:09] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [06:14:14] haproxy alerts are expected [06:14:58] (03PS1) 10Marostegui: instances.yaml: Remove db1207 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1155979 (https://phabricator.wikimedia.org/T396697) [06:15:29] PROBLEM - Host mr1-ulsfo.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [06:15:50] (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove db1207 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1155979 (https://phabricator.wikimedia.org/T396697) (owner: 10Marostegui) [06:17:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db1207 from dbctl T396697', diff saved to https://phabricator.wikimedia.org/P77770 and previous config saved to /var/cache/conftool/dbconfig/20250612-061700-marostegui.json [06:17:05] T396697: Temporarily move db1207 to m1 - https://phabricator.wikimedia.org/T396697 [06:18:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T396130)', diff saved to https://phabricator.wikimedia.org/P77771 and previous config saved to /var/cache/conftool/dbconfig/20250612-061843-marostegui.json [06:19:01] (03PS1) 10Marostegui: mariadb: Move db1207 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/1155984 (https://phabricator.wikimedia.org/T396697) [06:19:26] (03CR) 10CI reject: [V:04-1] mariadb: Move db1207 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/1155984 (https://phabricator.wikimedia.org/T396697) (owner: 10Marostegui) [06:20:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77772 and previous config saved to /var/cache/conftool/dbconfig/20250612-062016-root.json [06:20:18] (03PS2) 10Marostegui: mariadb: Move db1207 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/1155984 (https://phabricator.wikimedia.org/T396697) [06:20:57] (03CR) 10Marostegui: [C:03+2] mariadb: Move db1207 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/1155984 (https://phabricator.wikimedia.org/T396697) (owner: 10Marostegui) [06:22:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1184 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77773 and previous config saved to /var/cache/conftool/dbconfig/20250612-062248-root.json [06:25:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2226 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77774 and previous config saved to /var/cache/conftool/dbconfig/20250612-062546-root.json [06:26:32] (03CR) 10Andriy.v: [C:03+1] Add arbcom group to ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155945 (https://phabricator.wikimedia.org/T396668) (owner: 10EggRoll97) [06:33:22] PROBLEM - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1175 is CRITICAL: communication: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [06:33:23] ACKNOWLEDGEMENT - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1175 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T396703 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [06:33:33] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1175 - https://phabricator.wikimedia.org/T396703 (10ops-monitoring-bot) 03NEW [06:33:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P77775 and previous config saved to /var/cache/conftool/dbconfig/20250612-063350-marostegui.json [06:35:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77776 and previous config saved to /var/cache/conftool/dbconfig/20250612-063522-root.json [06:35:25] (03PS1) 10Jcrespo: dbbackups: Upgrade db1225, dbprov1004 to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1156079 (https://phabricator.wikimedia.org/T395989) [06:37:13] (03CR) 10Jcrespo: [C:03+1] dbbackups: Upgrade db1225, dbprov1004 to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1156079 (https://phabricator.wikimedia.org/T395989) (owner: 10Jcrespo) [06:37:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1184 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77777 and previous config saved to /var/cache/conftool/dbconfig/20250612-063755-root.json [06:41:00] (03CR) 10Ayounsi: Promote the TransitPeeringIn/OutSaturation alerts to p.aging (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1155620 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [06:42:54] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1225.eqiad.wmnet,dbprov1004.eqiad.wmnet with reason: Downtime hosts for MariaDB 10.11 upgrade [06:43:56] (03Abandoned) 10Ayounsi: Spicerack module for gNMI [software/spicerack] - 10https://gerrit.wikimedia.org/r/1015334 (https://phabricator.wikimedia.org/T344325) (owner: 10Ayounsi) [06:46:30] RECOVERY - Host mr1-ulsfo.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 68.00 ms [06:48:00] RECOVERY - haproxy failover on dbproxy1024 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [06:48:00] RECOVERY - haproxy failover on dbproxy1022 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [06:48:16] (03Abandoned) 10Ayounsi: Example cookbook using gNMI module [cookbooks] - 10https://gerrit.wikimedia.org/r/1015335 (https://phabricator.wikimedia.org/T344325) (owner: 10Ayounsi) [06:48:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P77778 and previous config saved to /var/cache/conftool/dbconfig/20250612-064858-marostegui.json [06:50:07] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1155629 (owner: 10Muehlenhoff) [06:50:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77779 and previous config saved to /var/cache/conftool/dbconfig/20250612-065028-root.json [06:52:54] (03PS3) 10Anzx: mrwiki: add मसूदा (draft) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156092 (https://phabricator.wikimedia.org/T396551) [06:53:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 12 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156092 (https://phabricator.wikimedia.org/T396551) (owner: 10Anzx) [06:57:31] !log tappof@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus1006.eqiad.wmnet [06:58:24] !log tappof@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus2006.codfw.wmnet [07:00:07] Amir1, Urbanecm, and awight: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250612T0700) [07:00:07] anzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:27] o/ [07:01:13] jmm@cumin1003 reimage (PID 1243989) is awaiting input [07:04:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T396130)', diff saved to https://phabricator.wikimedia.org/P77780 and previous config saved to /var/cache/conftool/dbconfig/20250612-070405-marostegui.json [07:04:09] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [07:04:21] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [07:04:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1166 (T396130)', diff saved to https://phabricator.wikimedia.org/P77781 and previous config saved to /var/cache/conftool/dbconfig/20250612-070427-marostegui.json [07:04:54] !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host ganeti2047.codfw.wmnet with OS bookworm [07:07:09] (03CR) 10Elukey: [C:03+2] "Had a chat with Marielle from Editing, we are good to go!" [puppet] - 10https://gerrit.wikimedia.org/r/1155316 (https://phabricator.wikimedia.org/T395916) (owner: 10Herron) [07:07:22] !log tappof@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1006.eqiad.wmnet [07:08:11] !log tappof@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2006.codfw.wmnet [07:09:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T396130)', diff saved to https://phabricator.wikimedia.org/P77782 and previous config saved to /var/cache/conftool/dbconfig/20250612-070914-marostegui.json [07:09:18] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [07:10:25] 10SRE-tools, 06Infrastructure-Foundations, 10observability: Cookbook sre.hosts.remove_downtime does not remove silences - https://phabricator.wikimedia.org/T395032#10907774 (10Volans) Just to clarify expectations here, while #sre-tools is happy to be included in the discussion/design, we think that this requ... [07:15:30] !log tappof@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus1008.eqiad.wmnet [07:15:39] !log tappof@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus2008.codfw.wmnet [07:15:49] (03PS1) 10Slyngshede: Docker: add python3-bs4 to Blubber build [software/bitu] - 10https://gerrit.wikimedia.org/r/1156120 (https://phabricator.wikimedia.org/T396103) [07:16:07] (03PS1) 10Jelto: apt: remove 718C1F180B5A84A3 ceph-octopus package [puppet] - 10https://gerrit.wikimedia.org/r/1156121 (https://phabricator.wikimedia.org/T396701) [07:16:18] 10SRE-tools, 06DBA, 06Infrastructure-Foundations: Improve database master switchover script - https://phabricator.wikimedia.org/T200306#10907781 (10Volans) Is this still relevant or superseded by more recent development/plans in this area? [07:18:10] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: Decommission script race condition - https://phabricator.wikimedia.org/T206448#10907782 (10Volans) 05Open→03Declined The script doesn't exists since long time, replaced by the related cookbook. [07:18:26] (03CR) 10Slyngshede: [C:03+2] Docker: add python3-bs4 to Blubber build [software/bitu] - 10https://gerrit.wikimedia.org/r/1156120 (https://phabricator.wikimedia.org/T396103) (owner: 10Slyngshede) [07:19:31] 10SRE-tools, 06DBA, 06Infrastructure-Foundations: Improve database master switchover script - https://phabricator.wikimedia.org/T200306#10907787 (10jcrespo) In terms of features requests, I think it is still relevant. If they want to merge its requirements into another ticket, that's ok (I don't think all of... [07:20:52] (03Merged) 10jenkins-bot: Docker: add python3-bs4 to Blubber build [software/bitu] - 10https://gerrit.wikimedia.org/r/1156120 (https://phabricator.wikimedia.org/T396103) (owner: 10Slyngshede) [07:21:30] 10SRE-tools, 06Infrastructure-Foundations: decommission cookbook: add support for decom spreadsheet - https://phabricator.wikimedia.org/T244315#10907807 (10Volans) @wiki_willy is this something still needed or the current workflow doesn't need it anymore? [07:21:35] 10SRE-tools, 06DBA, 06Infrastructure-Foundations: Improve database master switchover script - https://phabricator.wikimedia.org/T200306#10907808 (10jcrespo) [07:22:25] FIRING: SystemdUnitFailed: wmf_auto_restart_atftpd.service on install7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:23:05] !log tappof@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1008.eqiad.wmnet [07:23:27] !log tappof@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2008.codfw.wmnet [07:24:13] (03PS4) 10Anzx: mrwiki: add मसूदा (draft) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156092 (https://phabricator.wikimedia.org/T396551) [07:24:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P77783 and previous config saved to /var/cache/conftool/dbconfig/20250612-072422-marostegui.json [07:25:19] (03CR) 10Jcrespo: [C:03+2] dbbackups: Upgrade db1225, dbprov1004 to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1156079 (https://phabricator.wikimedia.org/T395989) (owner: 10Jcrespo) [07:26:34] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1156121 (https://phabricator.wikimedia.org/T396701) (owner: 10Jelto) [07:27:50] (03PS1) 10Marostegui: db2225: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1156178 (https://phabricator.wikimedia.org/T396549) [07:28:21] !log tappof@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus3003.esams.wmnet [07:28:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2225 T396549', diff saved to https://phabricator.wikimedia.org/P77784 and previous config saved to /var/cache/conftool/dbconfig/20250612-072827-marostegui.json [07:28:31] T396549: Migrate s2 to MariaDB 10.11 - https://phabricator.wikimedia.org/T396549 [07:28:38] !log tappof@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus4002.ulsfo.wmnet [07:28:44] (03PS1) 10Muehlenhoff: Remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/1156191 [07:28:54] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2225.codfw.wmnet with reason: Maintenance [07:28:57] (03CR) 10Marostegui: [C:03+2] db2225: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1156178 (https://phabricator.wikimedia.org/T396549) (owner: 10Marostegui) [07:29:04] (03CR) 10Jelto: [C:03+2] apt: remove 718C1F180B5A84A3 ceph-octopus package [puppet] - 10https://gerrit.wikimedia.org/r/1156121 (https://phabricator.wikimedia.org/T396701) (owner: 10Jelto) [07:29:18] jelto: ok to merge? [07:29:36] yes please merge the "apt: remove 718C1F180B5A84A3 ceph-octopus package" change :) [07:29:45] doing it [07:30:13] !log tappof@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus3003.esams.wmnet [07:30:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.48% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:30:33] 06SRE, 06Data-Engineering: WE 5.4 FY 25/26: Improve automata detection at the edge and pass it to the refinery pipeline - https://phabricator.wikimedia.org/T396562#10907844 (10JAllemandou) This idea is great, thank you @Joe for filling this. I concur with the idea that when the frontend computes some sort of... [07:31:22] ^ jelto that looks worrying [07:31:40] the PHPFPMTooBusy ? [07:31:49] yeah, doesn't it? [07:32:37] yes they are saturating since around 7:20 UTC [07:34:39] !log tappof@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus4002.ulsfo.wmnet [07:35:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.34% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:35:45] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.85% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:37:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2225 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P77785 and previous config saved to /var/cache/conftool/dbconfig/20250612-073705-root.json [07:38:20] !log tappof@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus5002.eqsin.wmnet [07:38:35] !log tappof@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus6002.drmrs.wmnet [07:39:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P77786 and previous config saved to /var/cache/conftool/dbconfig/20250612-073930-marostegui.json [07:40:36] !log jmm@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2047.codfw.wmnet with reason: host reimage [07:40:45] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.46% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:43:34] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2047.codfw.wmnet with reason: host reimage [07:44:33] !log tappof@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus5002.eqsin.wmnet [07:44:38] !log tappof@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus6002.drmrs.wmnet [07:45:01] !log jmm@cumin1003 START - Cookbook sre.hosts.decommission for hosts ncredir7001.magru.wmnet [07:46:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1046', diff saved to https://phabricator.wikimedia.org/P77787 and previous config saved to /var/cache/conftool/dbconfig/20250612-074624-marostegui.json [07:46:52] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es1046.eqiad.wmnet with reason: Maintenance [07:49:40] !log jmm@cumin1003 START - Cookbook sre.dns.netbox [07:50:24] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [07:50:25] (03PS1) 10Brouberol: WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156242 [07:52:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2225 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P77788 and previous config saved to /var/cache/conftool/dbconfig/20250612-075211-root.json [07:52:18] (03PS4) 10Brouberol: Removing WM Enterprise downloader Puppet configuration [puppet] - 10https://gerrit.wikimedia.org/r/1142712 (https://phabricator.wikimedia.org/T390556) (owner: 10Aleksandar Mastilovic) [07:52:22] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1142712 (https://phabricator.wikimedia.org/T390556) (owner: 10Aleksandar Mastilovic) [07:52:53] (03PS1) 10Muehlenhoff: Apply the ganeti role to ganeti2045/ganeti2046 [puppet] - 10https://gerrit.wikimedia.org/r/1156243 [07:53:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1046 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P77789 and previous config saved to /var/cache/conftool/dbconfig/20250612-075338-root.json [07:54:14] (03PS5) 10Brouberol: Removing WM Enterprise downloader Puppet configuration [puppet] - 10https://gerrit.wikimedia.org/r/1142712 (https://phabricator.wikimedia.org/T390556) (owner: 10Aleksandar Mastilovic) [07:54:17] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1142712 (https://phabricator.wikimedia.org/T390556) (owner: 10Aleksandar Mastilovic) [07:54:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T396130)', diff saved to https://phabricator.wikimedia.org/P77790 and previous config saved to /var/cache/conftool/dbconfig/20250612-075437-marostegui.json [07:54:42] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [07:54:54] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [07:55:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1175 (T396130)', diff saved to https://phabricator.wikimedia.org/P77791 and previous config saved to /var/cache/conftool/dbconfig/20250612-075501-marostegui.json [07:55:23] jmm@cumin1003 decommission (PID 1247618) is awaiting input [07:57:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 20.88% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:57:48] 10SRE-tools, 06Infrastructure-Foundations, 10netops: Evaluate automatic MAC-based DHCP for production servers - https://phabricator.wikimedia.org/T396712 (10Volans) 03NEW p:05Triage→03Medium [07:58:18] (03PS5) 10Brouberol: Airflow analytics-test: Add scheduler access to hadoop [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154071 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [07:58:29] (03Abandoned) 10Brouberol: WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156242 (owner: 10Brouberol) [07:58:39] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2047.codfw.wmnet with OS bookworm [07:58:52] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" [07:59:09] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" [07:59:09] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:59:10] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7001.magru.wmnet [07:59:15] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10907925 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin1003 for hosts: `ncredir7001.magru.wmnet` - ncredir7001.magru.wmnet (**PASS**) - Downtimed host o... [07:59:20] (03PS6) 10Brouberol: Airflow analytics-test: Add scheduler access to hadoop [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154071 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [08:00:05] PROBLEM - Hadoop NodeManager on an-worker1196 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [08:00:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T396130)', diff saved to https://phabricator.wikimedia.org/P77793 and previous config saved to /var/cache/conftool/dbconfig/20250612-080039-marostegui.json [08:00:43] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [08:01:06] !log jmm@cumin1003 START - Cookbook sre.ganeti.makevm for new host ncredir7004.magru.wmnet [08:01:08] !log jmm@cumin1003 START - Cookbook sre.dns.netbox [08:01:10] (03CR) 10Marostegui: "Do not remove the word objectstash from parsercache.pp and the .my.cnf.pp as we've merged both roles into the parsercache one." [puppet] - 10https://gerrit.wikimedia.org/r/1156191 (owner: 10Muehlenhoff) [08:02:13] 10SRE-tools, 06Infrastructure-Foundations, 10netops: Evaluate automatic MAC-based DHCP for production servers - https://phabricator.wikimedia.org/T396712#10907935 (10Volans) I've run some custom code with spicerack-shell and get the audit data for the whole fleet and comparing the MAC address retrieved from... [08:02:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 20.88% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:05:43] (03PS1) 10Muehlenhoff: Remove ceph-octopus-bullseye update definition in one more place [puppet] - 10https://gerrit.wikimedia.org/r/1156246 [08:06:43] (03PS7) 10Brouberol: Airflow analytics-test: Add scheduler access to hadoop [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154071 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [08:06:47] jmm@cumin1003 makevm (PID 1252039) is awaiting input [08:06:54] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1156246 (owner: 10Muehlenhoff) [08:07:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2225 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77794 and previous config saved to /var/cache/conftool/dbconfig/20250612-080717-root.json [08:08:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77795 and previous config saved to /var/cache/conftool/dbconfig/20250612-080843-root.json [08:10:07] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [08:10:55] (03PS2) 10Muehlenhoff: Also remove ceph-octopus from two update definitions [puppet] - 10https://gerrit.wikimedia.org/r/1156246 [08:11:00] (03PS1) 10Esanders: Support placeholders mangled by MF's HtmlFormatter [extensions/DiscussionTools] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1156247 (https://phabricator.wikimedia.org/T396695) [08:11:24] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [08:11:48] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [08:12:04] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [08:12:34] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [08:12:44] (03CR) 10Jelto: Also remove ceph-octopus from two update definitions (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1156246 (owner: 10Muehlenhoff) [08:12:51] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [08:12:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/DiscussionTools] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1156247 (https://phabricator.wikimedia.org/T396695) (owner: 10Esanders) [08:15:01] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [08:15:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P77796 and previous config saved to /var/cache/conftool/dbconfig/20250612-081546-marostegui.json [08:16:06] (03CR) 10Muehlenhoff: Also remove ceph-octopus from two update definitions (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1156246 (owner: 10Muehlenhoff) [08:17:10] (03CR) 10Jelto: [C:03+2] Also remove ceph-octopus from two update definitions (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1156246 (owner: 10Muehlenhoff) [08:17:48] jmm@cumin1003 reimage (PID 1252481) is awaiting input [08:19:15] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7004.magru.wmnet - jmm@cumin1003" [08:19:19] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7004.magru.wmnet - jmm@cumin1003" [08:19:19] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:19:19] !log jmm@cumin1003 START - Cookbook sre.dns.wipe-cache ncredir7004.magru.wmnet on all recursors [08:19:23] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7004.magru.wmnet on all recursors [08:19:52] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7004.magru.wmnet - jmm@cumin1003" [08:19:56] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7004.magru.wmnet - jmm@cumin1003" [08:21:51] jmm@cumin1003 reimage (PID 1252481) is awaiting input [08:22:16] !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host ganeti2048.codfw.wmnet with OS bookworm [08:22:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2225 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77797 and previous config saved to /var/cache/conftool/dbconfig/20250612-082223-root.json [08:22:58] jmm@cumin1003 makevm (PID 1252039) is awaiting input [08:23:12] !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host ncredir7004.magru.wmnet with OS bookworm [08:23:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77798 and previous config saved to /var/cache/conftool/dbconfig/20250612-082348-root.json [08:26:05] RECOVERY - Hadoop NodeManager on an-worker1196 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [08:26:22] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [08:29:44] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [08:30:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P77799 and previous config saved to /var/cache/conftool/dbconfig/20250612-083053-marostegui.json [08:32:23] (03PS1) 10Slyngshede: P:idp update regex for zarcillo [puppet] - 10https://gerrit.wikimedia.org/r/1156260 [08:35:17] !log jmm@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2048.codfw.wmnet with reason: host reimage [08:35:49] 10ops-codfw, 10ops-eqiad, 06DC-Ops: Fix PXE miss-configurations - https://phabricator.wikimedia.org/T396717 (10ayounsi) 03NEW p:05Triage→03High [08:38:15] (03PS1) 10Filippo Giunchedi: reimage: check for Monitoring::Host in puppetdb [cookbooks] - 10https://gerrit.wikimedia.org/r/1156264 (https://phabricator.wikimedia.org/T395449) [08:38:16] (03CR) 10Muehlenhoff: [C:03+2] profile::memcached::instance: Add support for passing firewall as an srange [puppet] - 10https://gerrit.wikimedia.org/r/1155609 (owner: 10Muehlenhoff) [08:38:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77800 and previous config saved to /var/cache/conftool/dbconfig/20250612-083854-root.json [08:38:57] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2048.codfw.wmnet with reason: host reimage [08:39:40] (03PS1) 10Filippo Giunchedi: monitoring: add note about reimage cookbook [puppet] - 10https://gerrit.wikimedia.org/r/1156265 (https://phabricator.wikimedia.org/T395449) [08:41:13] 10ops-codfw, 06DC-Ops: cirrussearch2079 iDRAC not working - https://phabricator.wikimedia.org/T396718 (10ayounsi) 03NEW [08:42:04] jelto@cumin1002 upgrade (PID 1417105) is awaiting input [08:43:22] (03CR) 10Volans: [C:03+1] "LGTM thanks for the update. The new query matches the same number of hosts from a quick cumin query." [cookbooks] - 10https://gerrit.wikimedia.org/r/1156264 (https://phabricator.wikimedia.org/T395449) (owner: 10Filippo Giunchedi) [08:44:17] (03CR) 10Volans: "Actually, wait a second" [cookbooks] - 10https://gerrit.wikimedia.org/r/1156264 (https://phabricator.wikimedia.org/T395449) (owner: 10Filippo Giunchedi) [08:45:35] (03PS1) 10Muehlenhoff: memcached: Switch to profile::memcached::firewall_src_sets [puppet] - 10https://gerrit.wikimedia.org/r/1156269 [08:46:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T396130)', diff saved to https://phabricator.wikimedia.org/P77801 and previous config saved to /var/cache/conftool/dbconfig/20250612-084600-marostegui.json [08:46:04] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [08:46:05] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1189.eqiad.wmnet with reason: Maintenance [08:46:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1189 (T396130)', diff saved to https://phabricator.wikimedia.org/P77802 and previous config saved to /var/cache/conftool/dbconfig/20250612-084611-marostegui.json [08:46:14] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade Replica to GitLab 17.10 [08:49:52] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1156269 (owner: 10Muehlenhoff) [08:50:14] (03Abandoned) 10Muehlenhoff: memcached: Switch to profile::memcached::firewall_src_sets [puppet] - 10https://gerrit.wikimedia.org/r/1153981 (https://phabricator.wikimedia.org/T371881) (owner: 10Muehlenhoff) [08:50:47] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [08:50:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T396130)', diff saved to https://phabricator.wikimedia.org/P77803 and previous config saved to /var/cache/conftool/dbconfig/20250612-085048-marostegui.json [08:54:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77804 and previous config saved to /var/cache/conftool/dbconfig/20250612-085359-root.json [08:54:12] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2048.codfw.wmnet with OS bookworm [08:56:21] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade Replica to GitLab 17.10 [08:56:39] !log jmm@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7004.magru.wmnet with OS bookworm [08:56:39] !log jmm@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=97) for new host ncredir7004.magru.wmnet [08:57:15] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade Replica to GitLab 17.10 [08:57:42] !log tappof@cumin1002 START - Cookbook sre.hosts.reboot-single for host centrallog1002.eqiad.wmnet [09:01:10] FIRING: BFDdown: BFD session down between cr2-eqiad and 10.64.16.86 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:02:22] PROBLEM - Squid on install7001 is CRITICAL: connect to address 195.200.68.7 and port 8080: Connection refused https://wikitech.wikimedia.org/wiki/HTTP_proxy [09:02:41] FIRING: [2x] ProbeDown: Service install7001:8080 has failed probes (http_squid_ip4) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:03:22] ^ install7001 is expected, I'm silencing it [09:04:05] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on install7001.wikimedia.org with reason: migration to install7002 [09:04:15] !log jmm@cumin1003 START - Cookbook sre.ganeti.makevm for new host ncredir7004.magru.wmnet [09:04:16] !log jmm@cumin1003 START - Cookbook sre.dns.netbox [09:04:20] !log tappof@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog1002.eqiad.wmnet [09:05:51] !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host ganeti2049.codfw.wmnet with OS bookworm [09:05:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P77805 and previous config saved to /var/cache/conftool/dbconfig/20250612-090555-marostegui.json [09:06:10] RESOLVED: [2x] BFDdown: BFD session down between cr1-eqiad and 10.64.16.86 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:07:19] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade Replica to GitLab 17.10 [09:07:41] RESOLVED: [2x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:07:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-magru:xe-0/1/2 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:07:56] !log tappof@cumin1002 START - Cookbook sre.hosts.reboot-single for host grafana1002.eqiad.wmnet [09:08:13] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to 17.10 [09:09:59] jmm@cumin1003 makevm (PID 1259982) is awaiting input [09:11:52] !log tappof@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana1002.eqiad.wmnet [09:13:42] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db224[12] - https://phabricator.wikimedia.org/T379757#10908264 (10Volans) 05Resolved→03Open FYI db2241 and db2242 have their MGMT DNS inverted, so `db2241.mgmt.codfw.wmnet` points to `db2242` iDRAC and viceversa. This is very dange... [09:14:15] (03CR) 10Clément Goubert: [C:03+1] httpd: introduce -bookworm track and cascade [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1081989 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [09:15:09] !log tappof@cumin1002 START - Cookbook sre.hosts.reboot-single for host titan1002.eqiad.wmnet [09:15:13] (03PS2) 10Muehlenhoff: cloudcontrol/eqiad1: Switch to profile::memcached::firewall_src_sets [puppet] - 10https://gerrit.wikimedia.org/r/1154037 [09:16:53] jmm@cumin1003 makevm (PID 1259982) is awaiting input [09:19:32] !log jmm@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [09:19:44] (03CR) 10Alexandros Kosiaris: [C:03+1] httpd: introduce -bookworm track and cascade [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1081989 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [09:19:46] !log jmm@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ncredir7004.magru.wmnet [09:20:01] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1154037 (owner: 10Muehlenhoff) [09:20:40] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:21:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P77806 and previous config saved to /var/cache/conftool/dbconfig/20250612-092103-marostegui.json [09:23:00] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:24:23] !log tappof@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host titan1002.eqiad.wmnet [09:26:11] !log jmm@cumin1003 START - Cookbook sre.ganeti.makevm for new host ncredir7004.magru.wmnet [09:26:12] !log jmm@cumin1003 START - Cookbook sre.dns.netbox [09:28:54] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:28:54] !log jmm@cumin1003 START - Cookbook sre.dns.wipe-cache ncredir7004.magru.wmnet on all recursors [09:28:57] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7004.magru.wmnet on all recursors [09:29:02] !log jmm@cumin1003 START - Cookbook sre.dns.netbox [09:32:01] !log tappof@cumin1002 START - Cookbook sre.hosts.reboot-single for host titan2002.codfw.wmnet [09:34:27] !log jmm@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2049.codfw.wmnet with reason: host reimage [09:34:38] jmm@cumin1003 makevm (PID 1262903) is awaiting input [09:34:39] (03PS3) 10Hnowlan: services_proxy: change mobileapps port [puppet] - 10https://gerrit.wikimedia.org/r/1155719 (https://phabricator.wikimedia.org/T367418) [09:34:57] (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1155719 (https://phabricator.wikimedia.org/T367418) (owner: 10Hnowlan) [09:34:58] Am I ok to self-deploy https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/1156247, a UBN train blocker? [09:36:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T396130)', diff saved to https://phabricator.wikimedia.org/P77808 and previous config saved to /var/cache/conftool/dbconfig/20250612-093609-marostegui.json [09:36:14] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [09:36:24] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1198.eqiad.wmnet with reason: Maintenance [09:36:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1198 (T396130)', diff saved to https://phabricator.wikimedia.org/P77809 and previous config saved to /var/cache/conftool/dbconfig/20250612-093631-marostegui.json [09:36:32] jouncebot: nowandnext [09:36:32] No deployments scheduled for the next 0 hour(s) and 23 minute(s) [09:36:32] In 0 hour(s) and 23 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250612T1000) [09:36:45] edsanders: yeah I think you can go ahead [09:37:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy1003 using scap backport" [extensions/DiscussionTools] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1156247 (https://phabricator.wikimedia.org/T396695) (owner: 10Esanders) [09:37:46] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2049.codfw.wmnet with reason: host reimage [09:38:57] (03Merged) 10jenkins-bot: Support placeholders mangled by MF's HtmlFormatter [extensions/DiscussionTools] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1156247 (https://phabricator.wikimedia.org/T396695) (owner: 10Esanders) [09:39:30] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM ncredir7004.magru.wmnet - jmm@cumin1003" [09:39:34] !log esanders@deploy1003 Started scap sync-world: Backport for [[gerrit:1156247|Support placeholders mangled by MF's HtmlFormatter (T396695)]] [09:39:37] T396695: DiscussionTools features aren't working on mobile web in June 9, 2025 MediaWiki deployment - https://phabricator.wikimedia.org/T396695 [09:39:39] !log tappof@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host titan2002.codfw.wmnet [09:41:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T396130)', diff saved to https://phabricator.wikimedia.org/P77811 and previous config saved to /var/cache/conftool/dbconfig/20250612-094109-marostegui.json [09:41:14] !log cmooney@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti1047.eqiad.wmnet [09:41:33] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti1047.eqiad.wmnet with reason: hw check [09:41:50] !log esanders@deploy1003 esanders: Backport for [[gerrit:1156247|Support placeholders mangled by MF's HtmlFormatter (T396695)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:42:05] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM ncredir7004.magru.wmnet - jmm@cumin1003" [09:42:05] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:42:05] !log jmm@cumin1003 START - Cookbook sre.dns.wipe-cache ncredir7004.magru.wmnet on all recursors [09:42:09] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7004.magru.wmnet on all recursors [09:42:13] !log jmm@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ncredir7004.magru.wmnet [09:42:38] (03PS1) 10Clément Goubert: mediawiki: Add job history limit control [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156288 (https://phabricator.wikimedia.org/T395885) [09:43:12] !log esanders@deploy1003 esanders: Continuing with sync [09:43:25] (03CR) 10Muehlenhoff: [C:03+2] cloudcontrol/eqiad1: Switch to profile::memcached::firewall_src_sets [puppet] - 10https://gerrit.wikimedia.org/r/1154037 (owner: 10Muehlenhoff) [09:43:55] (03PS2) 10Muehlenhoff: memcached: Switch to profile::memcached::firewall_src_sets [puppet] - 10https://gerrit.wikimedia.org/r/1156269 [09:44:02] (03Abandoned) 10Hnowlan: services_proxy: change mobileapps port [puppet] - 10https://gerrit.wikimedia.org/r/1155719 (https://phabricator.wikimedia.org/T367418) (owner: 10Hnowlan) [09:45:20] 10ops-eqiad, 06SRE, 06DC-Ops: Upgrade firmware (NIC and system) on ganeti1047 - https://phabricator.wikimedia.org/T396660#10908435 (10cmooney) For the record I ran some tests on this host today to see if there were signs of an issue with the NIC in general or possibly the DAC cable connecting it to asw2-c7-e... [09:46:29] !log cmooney@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1047.eqiad.wmnet [09:47:23] (03CR) 10Muehlenhoff: [C:03+2] "Confirmed to be a NOP using "iptables -L" pre and post merge" [puppet] - 10https://gerrit.wikimedia.org/r/1154037 (owner: 10Muehlenhoff) [09:47:29] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1156269 (owner: 10Muehlenhoff) [09:48:14] (03PS8) 10Brouberol: Airflow analytics-test: Add scheduler access to hadoop [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154071 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [09:50:11] !log esanders@deploy1003 Finished scap sync-world: Backport for [[gerrit:1156247|Support placeholders mangled by MF's HtmlFormatter (T396695)]] (duration: 10m 37s) [09:50:15] T396695: DiscussionTools features aren't working on mobile web in June 9, 2025 MediaWiki deployment - https://phabricator.wikimedia.org/T396695 [09:53:49] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2049.codfw.wmnet with OS bookworm [09:56:21] (03PS1) 10Jgiannelos: WIP - wikifeeds pcs request template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156289 [09:56:45] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10908475 (10MoritzMuehlenhoff) [09:59:19] (03Abandoned) 10Jgiannelos: WIP - wikifeeds pcs request template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156289 (owner: 10Jgiannelos) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250612T1000) [10:00:19] (03PS1) 10Hnowlan: wikifeeds: use mobileapps via service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156290 (https://phabricator.wikimedia.org/T367418) [10:01:28] FIRING: KeyholderUnarmed: 1 unarmed Keyholder key(s) on alert1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [10:02:37] (03PS1) 10Vgutierrez: hiera: Switch drmrs to unified cert issued by GTS [puppet] - 10https://gerrit.wikimedia.org/r/1156293 (https://phabricator.wikimedia.org/T395131) [10:03:14] (03CR) 10Cathal Mooney: [C:03+1] Promote the TransitPeeringIn/OutSaturation alerts to p.aging (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1155620 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [10:03:18] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1156293 (https://phabricator.wikimedia.org/T395131) (owner: 10Vgutierrez) [10:03:33] (03PS3) 10Muehlenhoff: memcached: Switch to profile::memcached::firewall_src_sets [puppet] - 10https://gerrit.wikimedia.org/r/1156269 [10:05:20] jmm@cumin1003 reimage (PID 1268022) is awaiting input [10:05:55] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1156269 (owner: 10Muehlenhoff) [10:06:28] RESOLVED: KeyholderUnarmed: 1 unarmed Keyholder key(s) on alert1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [10:06:50] !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host ganeti2050.codfw.wmnet with OS bookworm [10:07:08] !log jmm@cumin1003 START - Cookbook sre.ganeti.makevm for new host ncredir7004.magru.wmnet [10:07:10] !log jmm@cumin1003 START - Cookbook sre.dns.netbox [10:10:01] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1156293 (https://phabricator.wikimedia.org/T395131) (owner: 10Vgutierrez) [10:11:08] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance [10:11:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P77813 and previous config saved to /var/cache/conftool/dbconfig/20250612-101123-marostegui.json [10:12:45] jmm@cumin1003 makevm (PID 1268168) is awaiting input [10:14:01] !log installing Kerberos security updates [10:14:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:18] jmm@cumin1003 makevm (PID 1268168) is awaiting input [10:16:48] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2145.codfw.wmnet with reason: Maintenance [10:16:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2145 (T395241)', diff saved to https://phabricator.wikimedia.org/P77814 and previous config saved to /var/cache/conftool/dbconfig/20250612-101655-fceratto.json [10:20:45] (03Abandoned) 10Slyngshede: P:idp update regex for zarcillo [puppet] - 10https://gerrit.wikimedia.org/r/1156260 (owner: 10Slyngshede) [10:21:51] (03CR) 10Slyngshede: [V:03+1 C:03+2] P:idm Enable API [puppet] - 10https://gerrit.wikimedia.org/r/1154262 (https://phabricator.wikimedia.org/T364605) (owner: 10Slyngshede) [10:23:02] (03CR) 10Fabfur: [C:03+2] hiera: Switch drmrs to unified cert issued by GTS [puppet] - 10https://gerrit.wikimedia.org/r/1156293 (https://phabricator.wikimedia.org/T395131) (owner: 10Vgutierrez) [10:23:35] (03CR) 10Fabfur: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1156293 (https://phabricator.wikimedia.org/T395131) (owner: 10Vgutierrez) [10:23:46] !log jmm@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ganeti2050.codfw.wmnet with OS bookworm [10:23:58] !log jmm@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [10:24:04] !log jmm@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ncredir7004.magru.wmnet [10:25:09] !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host ganeti2050.codfw.wmnet with OS bookworm [10:26:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T396130)', diff saved to https://phabricator.wikimedia.org/P77815 and previous config saved to /var/cache/conftool/dbconfig/20250612-102630-marostegui.json [10:26:36] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [10:26:46] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1212.eqiad.wmnet with reason: Maintenance [10:26:53] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [10:27:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1212 (T396130)', diff saved to https://phabricator.wikimedia.org/P77816 and previous config saved to /var/cache/conftool/dbconfig/20250612-102700-marostegui.json [10:28:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T395241)', diff saved to https://phabricator.wikimedia.org/P77817 and previous config saved to /var/cache/conftool/dbconfig/20250612-102834-fceratto.json [10:28:53] (03CR) 10Hnowlan: [C:03+1] mediawiki: Add job history limit control [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156288 (https://phabricator.wikimedia.org/T395885) (owner: 10Clément Goubert) [10:29:24] (03CR) 10Clément Goubert: [C:03+2] mediawiki: Add job history limit control [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156288 (https://phabricator.wikimedia.org/T395885) (owner: 10Clément Goubert) [10:30:29] (03PS3) 10Bartosz Dziewoński: tables-catalog: Make test_validation.py actually validate when executed [puppet] - 10https://gerrit.wikimedia.org/r/1155885 [10:30:31] (03CR) 10Ladsgroup: [C:03+2] tables-catalog: Make test_validation.py actually validate when executed [puppet] - 10https://gerrit.wikimedia.org/r/1155885 (owner: 10Bartosz Dziewoński) [10:30:33] (03CR) 10Ladsgroup: [V:03+2 C:03+2] tables-catalog: Make test_validation.py actually validate when executed [puppet] - 10https://gerrit.wikimedia.org/r/1155885 (owner: 10Bartosz Dziewoński) [10:31:52] (03CR) 10Btullis: [V:03+1 C:03+2] Add a prometheus connector for thanos in the test presto cluster [puppet] - 10https://gerrit.wikimedia.org/r/1155278 (https://phabricator.wikimedia.org/T347430) (owner: 10Btullis) [10:31:58] (03Merged) 10jenkins-bot: mediawiki: Add job history limit control [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156288 (https://phabricator.wikimedia.org/T395885) (owner: 10Clément Goubert) [10:32:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T396130)', diff saved to https://phabricator.wikimedia.org/P77818 and previous config saved to /var/cache/conftool/dbconfig/20250612-103159-marostegui.json [10:32:05] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [10:32:32] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1049.eqiad.wmnet [10:33:55] !log cgoubert@deploy1003 Started scap sync-world: 1156288: mediawiki: Add job history limit control - T395885 [10:33:59] T395885: Improve alerting for flaky mw-cron jobs - https://phabricator.wikimedia.org/T395885 [10:34:31] (03PS1) 10Majavah: hieradata: Update striker-toolsbeta to 2025-06-12-103158-production [puppet] - 10https://gerrit.wikimedia.org/r/1156294 (https://phabricator.wikimedia.org/T364605) [10:35:01] (03CR) 10David Caro: [C:03+1] "LGTM /me loves types \o/" [puppet] - 10https://gerrit.wikimedia.org/r/1155602 (https://phabricator.wikimedia.org/T396448) (owner: 10Majavah) [10:36:43] !log cgoubert@deploy1003 Finished scap sync-world: 1156288: mediawiki: Add job history limit control - T395885 (duration: 02m 48s) [10:37:50] jmm@cumin1003 drain-node (PID 1271159) is awaiting input [10:37:59] (03CR) 10Majavah: [C:03+2] hieradata: Update striker-toolsbeta to 2025-06-12-103158-production [puppet] - 10https://gerrit.wikimedia.org/r/1156294 (https://phabricator.wikimedia.org/T364605) (owner: 10Majavah) [10:38:13] (03CR) 10Majavah: [C:03+2] P:openstack: pdns: Add type definition for host config [puppet] - 10https://gerrit.wikimedia.org/r/1155602 (https://phabricator.wikimedia.org/T396448) (owner: 10Majavah) [10:38:32] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti1049.eqiad.wmnet [10:42:10] (03PS1) 10Clément Goubert: mw::periodic_job: Add job history limit control [puppet] - 10https://gerrit.wikimedia.org/r/1156295 (https://phabricator.wikimedia.org/T395885) [10:42:26] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1156295 (https://phabricator.wikimedia.org/T395885) (owner: 10Clément Goubert) [10:42:44] !log jmm@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ganeti2050.codfw.wmnet with OS bookworm [10:43:24] !log jmm@cumin1003 START - Cookbook sre.ganeti.makevm for new host ncredir7004.magru.wmnet [10:43:25] !log jmm@cumin1003 START - Cookbook sre.dns.netbox [10:43:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P77819 and previous config saved to /var/cache/conftool/dbconfig/20250612-104341-fceratto.json [10:43:48] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1049.eqiad.wmnet [10:44:22] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1049.eqiad.wmnet [10:47:06] (03PS4) 10Majavah: P:openstack: pdns: auth: Bind the API on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1155219 (https://phabricator.wikimedia.org/T396448) [10:47:06] (03PS5) 10Majavah: P:openstack: pdns: auth: Support query_local_address for IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1155220 (https://phabricator.wikimedia.org/T396448) [10:47:06] (03PS2) 10Majavah: P:openstack: pdns: auth: Explicitely configure IPs to bind on [puppet] - 10https://gerrit.wikimedia.org/r/1155603 (https://phabricator.wikimedia.org/T396448) [10:47:07] (03PS4) 10Majavah: P:openstack: pdns: recursor: Support binding on multiple addresses [puppet] - 10https://gerrit.wikimedia.org/r/1155228 (https://phabricator.wikimedia.org/T396448) [10:47:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P77820 and previous config saved to /var/cache/conftool/dbconfig/20250612-104706-marostegui.json [10:47:08] (03PS2) 10Majavah: hieradata: Add codfw1dev v6 auth DNS IPs [puppet] - 10https://gerrit.wikimedia.org/r/1155613 (https://phabricator.wikimedia.org/T396448) [10:47:09] (03PS2) 10Majavah: hieradata: Add codfw1dev v6 recursive DNS IPs [puppet] - 10https://gerrit.wikimedia.org/r/1155614 (https://phabricator.wikimedia.org/T396448) [10:47:15] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1050.eqiad.wmnet [10:49:04] jmm@cumin1003 makevm (PID 1271566) is awaiting input [10:49:20] (03PS2) 10Clément Goubert: mw::periodic_job: Add job history limit control [puppet] - 10https://gerrit.wikimedia.org/r/1156295 (https://phabricator.wikimedia.org/T395885) [10:49:57] (03PS3) 10Clément Goubert: mw::periodic_job: Add job history limit control [puppet] - 10https://gerrit.wikimedia.org/r/1156295 (https://phabricator.wikimedia.org/T395885) [10:50:28] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:50:50] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7004.magru.wmnet - jmm@cumin1003" [10:50:58] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti1050.eqiad.wmnet [10:51:15] (03CR) 10Ladsgroup: "I wrote a script to double check nothing is removed or added or changed, it confirms this is just re-ordering." [puppet] - 10https://gerrit.wikimedia.org/r/1155886 (owner: 10Bartosz Dziewoński) [10:51:16] (03CR) 10Jgiannelos: [C:03+1] "Overall looks OK. Do we need any networking changes to allow connecting to PCS via service mesh ?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156290 (https://phabricator.wikimedia.org/T367418) (owner: 10Hnowlan) [10:51:22] (03PS4) 10Bartosz Dziewoński: tables-catalog: Alphabetize, enforce this in test_validation.py [puppet] - 10https://gerrit.wikimedia.org/r/1155886 [10:51:40] (03PS5) 10Majavah: P:openstack: pdns: auth: Bind the API on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1155219 (https://phabricator.wikimedia.org/T396448) [10:51:40] (03PS6) 10Majavah: P:openstack: pdns: auth: Support query_local_address for IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1155220 (https://phabricator.wikimedia.org/T396448) [10:51:40] (03PS3) 10Majavah: P:openstack: pdns: auth: Explicitely configure IPs to bind on [puppet] - 10https://gerrit.wikimedia.org/r/1155603 (https://phabricator.wikimedia.org/T396448) [10:51:41] (03PS5) 10Majavah: P:openstack: pdns: recursor: Support binding on multiple addresses [puppet] - 10https://gerrit.wikimedia.org/r/1155228 (https://phabricator.wikimedia.org/T396448) [10:51:42] (03PS3) 10Majavah: hieradata: Add codfw1dev v6 auth DNS IPs [puppet] - 10https://gerrit.wikimedia.org/r/1155613 (https://phabricator.wikimedia.org/T396448) [10:51:46] (03PS3) 10Majavah: hieradata: Add codfw1dev v6 recursive DNS IPs [puppet] - 10https://gerrit.wikimedia.org/r/1155614 (https://phabricator.wikimedia.org/T396448) [10:51:50] (03PS1) 10Majavah: P:openstack: pdns: recursor: Fix type of bgp_vip [puppet] - 10https://gerrit.wikimedia.org/r/1156298 [10:51:58] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1156295 (https://phabricator.wikimedia.org/T395885) (owner: 10Clément Goubert) [10:53:55] jmm@cumin1003 makevm (PID 1271566) is awaiting input [10:55:26] (03CR) 10Ladsgroup: [C:03+2] tables-catalog: Alphabetize, enforce this in test_validation.py [puppet] - 10https://gerrit.wikimedia.org/r/1155886 (owner: 10Bartosz Dziewoński) [10:56:14] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1050.eqiad.wmnet [10:56:20] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1050.eqiad.wmnet [10:57:03] (03PS5) 10Bartosz Dziewoński: tables-catalog: Add Flow tables [puppet] - 10https://gerrit.wikimedia.org/r/1155887 (https://phabricator.wikimedia.org/T363581) [10:57:21] (03CR) 10Ladsgroup: [C:03+2] tables-catalog: Add Flow tables [puppet] - 10https://gerrit.wikimedia.org/r/1155887 (https://phabricator.wikimedia.org/T363581) (owner: 10Bartosz Dziewoński) [10:57:23] (03CR) 10Ladsgroup: [V:03+2 C:03+2] tables-catalog: Add Flow tables [puppet] - 10https://gerrit.wikimedia.org/r/1155887 (https://phabricator.wikimedia.org/T363581) (owner: 10Bartosz Dziewoński) [10:57:38] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1051.eqiad.wmnet [10:57:51] (03CR) 10Majavah: [C:03+2] P:openstack: pdns: recursor: Fix type of bgp_vip [puppet] - 10https://gerrit.wikimedia.org/r/1156298 (owner: 10Majavah) [10:58:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P77821 and previous config saved to /var/cache/conftool/dbconfig/20250612-105848-fceratto.json [10:59:20] (03PS6) 10Bartosz Dziewoński: tables-catalog: Add LiquidThreads tables [puppet] - 10https://gerrit.wikimedia.org/r/1155888 (https://phabricator.wikimedia.org/T363581) [10:59:39] (03CR) 10Ladsgroup: [C:03+2] tables-catalog: Add LiquidThreads tables [puppet] - 10https://gerrit.wikimedia.org/r/1155888 (https://phabricator.wikimedia.org/T363581) (owner: 10Bartosz Dziewoński) [10:59:41] (03CR) 10Ladsgroup: [V:03+2 C:03+2] tables-catalog: Add LiquidThreads tables [puppet] - 10https://gerrit.wikimedia.org/r/1155888 (https://phabricator.wikimedia.org/T363581) (owner: 10Bartosz Dziewoński) [10:59:53] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7004.magru.wmnet - jmm@cumin1003" [10:59:53] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:59:53] !log jmm@cumin1003 START - Cookbook sre.dns.wipe-cache ncredir7004.magru.wmnet on all recursors [10:59:57] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7004.magru.wmnet on all recursors [11:00:26] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7004.magru.wmnet - jmm@cumin1003" [11:00:28] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [11:00:29] (03CR) 10Muehlenhoff: [C:03+2] Apply the ganeti role to ganeti2045/ganeti2046 [puppet] - 10https://gerrit.wikimedia.org/r/1156243 (owner: 10Muehlenhoff) [11:00:30] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7004.magru.wmnet - jmm@cumin1003" [11:01:09] !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host ncredir7004.magru.wmnet with OS bookworm [11:01:14] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5950/co" [puppet] - 10https://gerrit.wikimedia.org/r/1155219 (https://phabricator.wikimedia.org/T396448) (owner: 10Majavah) [11:02:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P77822 and previous config saved to /var/cache/conftool/dbconfig/20250612-110213-marostegui.json [11:03:23] jmm@cumin1003 drain-node (PID 1274007) is awaiting input [11:03:56] (03PS6) 10Majavah: P:openstack: pdns: auth: Bind the API on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1155219 (https://phabricator.wikimedia.org/T396448) [11:03:56] (03PS7) 10Majavah: P:openstack: pdns: auth: Support query_local_address for IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1155220 (https://phabricator.wikimedia.org/T396448) [11:03:56] (03PS4) 10Majavah: P:openstack: pdns: auth: Explicitely configure IPs to bind on [puppet] - 10https://gerrit.wikimedia.org/r/1155603 (https://phabricator.wikimedia.org/T396448) [11:03:57] (03PS6) 10Majavah: P:openstack: pdns: recursor: Support binding on multiple addresses [puppet] - 10https://gerrit.wikimedia.org/r/1155228 (https://phabricator.wikimedia.org/T396448) [11:03:58] (03PS4) 10Majavah: hieradata: Add codfw1dev v6 auth DNS IPs [puppet] - 10https://gerrit.wikimedia.org/r/1155613 (https://phabricator.wikimedia.org/T396448) [11:04:00] (03PS4) 10Majavah: hieradata: Add codfw1dev v6 recursive DNS IPs [puppet] - 10https://gerrit.wikimedia.org/r/1155614 (https://phabricator.wikimedia.org/T396448) [11:04:05] (03PS5) 10Bartosz Dziewoński: tables-catalog: Add OATHAuth tables [puppet] - 10https://gerrit.wikimedia.org/r/1155889 (https://phabricator.wikimedia.org/T391490) [11:04:11] (03CR) 10Ladsgroup: [C:03+2] tables-catalog: Add OATHAuth tables [puppet] - 10https://gerrit.wikimedia.org/r/1155889 (https://phabricator.wikimedia.org/T391490) (owner: 10Bartosz Dziewoński) [11:04:13] (03CR) 10Ladsgroup: [V:03+2 C:03+2] tables-catalog: Add OATHAuth tables [puppet] - 10https://gerrit.wikimedia.org/r/1155889 (https://phabricator.wikimedia.org/T391490) (owner: 10Bartosz Dziewoński) [11:05:05] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [11:05:14] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5951/co" [puppet] - 10https://gerrit.wikimedia.org/r/1155219 (https://phabricator.wikimedia.org/T396448) (owner: 10Majavah) [11:05:17] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti1051.eqiad.wmnet [11:05:27] (03CR) 10Vgutierrez: [C:03+2] hiera: Switch drmrs to unified cert issued by GTS [puppet] - 10https://gerrit.wikimedia.org/r/1156293 (https://phabricator.wikimedia.org/T395131) (owner: 10Vgutierrez) [11:06:28] (03PS2) 10Andrew Bogott: Add radosgw access for members of the new 'object_storage' role. [puppet] - 10https://gerrit.wikimedia.org/r/1155775 (https://phabricator.wikimedia.org/T396594) [11:06:28] (03PS1) 10Andrew Bogott: cloudcephosd1014 -> puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1156301 [11:06:35] (03PS2) 10Andrew Bogott: cloudcephosd1014 -> puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1156301 [11:06:51] andrew@cumin1002 reimage (PID 1530576) is awaiting input [11:07:07] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1014.eqiad.wmnet with OS bullseye [11:07:08] (03PS5) 10Bartosz Dziewoński: tables-catalog: Add OAuth tables [puppet] - 10https://gerrit.wikimedia.org/r/1155890 (https://phabricator.wikimedia.org/T391490) [11:07:18] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5952/co" [puppet] - 10https://gerrit.wikimedia.org/r/1155220 (https://phabricator.wikimedia.org/T396448) (owner: 10Majavah) [11:07:25] (03CR) 10Ladsgroup: [C:03+2] tables-catalog: Add OAuth tables [puppet] - 10https://gerrit.wikimedia.org/r/1155890 (https://phabricator.wikimedia.org/T391490) (owner: 10Bartosz Dziewoński) [11:07:29] (03CR) 10Ladsgroup: [V:03+2 C:03+2] tables-catalog: Add OAuth tables [puppet] - 10https://gerrit.wikimedia.org/r/1155890 (https://phabricator.wikimedia.org/T391490) (owner: 10Bartosz Dziewoński) [11:07:30] !log use Google Trust Services (GTS) unified TLS certificate on drmrs - T395131 [11:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:34] T395131: Replace Digicert TLS certs with Google Trust Services ones - https://phabricator.wikimedia.org/T395131 [11:07:47] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [11:08:33] (03CR) 10Andrew Bogott: [C:03+2] cloudcephosd1014 -> puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1156301 (owner: 10Andrew Bogott) [11:09:39] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5953/co" [puppet] - 10https://gerrit.wikimedia.org/r/1155613 (https://phabricator.wikimedia.org/T396448) (owner: 10Majavah) [11:10:10] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [11:10:29] (03PS5) 10Bartosz Dziewoński: tables-catalog: Add CentralAuth tables [puppet] - 10https://gerrit.wikimedia.org/r/1155891 (https://phabricator.wikimedia.org/T391490) [11:10:32] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1051.eqiad.wmnet [11:10:32] (03CR) 10Ladsgroup: [C:03+2] tables-catalog: Add CentralAuth tables [puppet] - 10https://gerrit.wikimedia.org/r/1155891 (https://phabricator.wikimedia.org/T391490) (owner: 10Bartosz Dziewoński) [11:10:34] (03CR) 10Ladsgroup: [V:03+2 C:03+2] tables-catalog: Add CentralAuth tables [puppet] - 10https://gerrit.wikimedia.org/r/1155891 (https://phabricator.wikimedia.org/T391490) (owner: 10Bartosz Dziewoński) [11:10:38] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1051.eqiad.wmnet [11:10:45] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5954/co" [puppet] - 10https://gerrit.wikimedia.org/r/1155614 (https://phabricator.wikimedia.org/T396448) (owner: 10Majavah) [11:12:58] (03CR) 10Ladsgroup: "I think it'd be easier to dump the file into the noc host(s) and then we can build a php output similar to what happens with db config (ht" [puppet] - 10https://gerrit.wikimedia.org/r/1155892 (owner: 10Bartosz Dziewoński) [11:13:20] 06SRE, 10SRE-Access-Requests: apine is a member of wmf and deployers but not spider pig - https://phabricator.wikimedia.org/T396669#10908681 (10Aklapper) 05Open→03Invalid [11:13:46] (03PS1) 10Andrew Bogott: cloudcephosd1015 -> puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1156302 (https://phabricator.wikimedia.org/T309789) [11:13:47] (03PS1) 10Andrew Bogott: cloudcephosd1016 -> puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1156303 (https://phabricator.wikimedia.org/T309789) [11:13:49] (03PS1) 10Andrew Bogott: cloudcephosd1017 -> puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1156304 (https://phabricator.wikimedia.org/T309789) [11:13:50] (03PS1) 10Andrew Bogott: cloudcephosd1018 -> puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1156305 (https://phabricator.wikimedia.org/T309789) [11:13:53] (03PS1) 10Andrew Bogott: cloudcephosd1019 -> puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1156306 (https://phabricator.wikimedia.org/T309789) [11:13:55] (03PS1) 10Andrew Bogott: cloudcephosd1020 -> puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1156307 (https://phabricator.wikimedia.org/T309789) [11:13:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T395241)', diff saved to https://phabricator.wikimedia.org/P77823 and previous config saved to /var/cache/conftool/dbconfig/20250612-111357-fceratto.json [11:14:16] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2146.codfw.wmnet with reason: Maintenance [11:14:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2146 (T395241)', diff saved to https://phabricator.wikimedia.org/P77824 and previous config saved to /var/cache/conftool/dbconfig/20250612-111423-fceratto.json [11:16:16] (03PS1) 10Muehlenhoff: Add ganeti204[56] to firewall list [puppet] - 10https://gerrit.wikimedia.org/r/1156308 (https://phabricator.wikimedia.org/T396590) [11:17:06] (03CR) 10Majavah: [C:03+1] Add radosgw access for members of the new 'object_storage' role. [puppet] - 10https://gerrit.wikimedia.org/r/1155775 (https://phabricator.wikimedia.org/T396594) (owner: 10Andrew Bogott) [11:17:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T396130)', diff saved to https://phabricator.wikimedia.org/P77825 and previous config saved to /var/cache/conftool/dbconfig/20250612-111722-marostegui.json [11:17:27] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [11:17:38] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1240.eqiad.wmnet with reason: Maintenance [11:18:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:19:06] (03CR) 10D3r1ck01: multivesion: Remove unused newFromDBName() (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154139 (owner: 10Krinkle) [11:19:58] (03PS4) 10Krinkle: multivesion: Remove unused newFromDBName() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154139 [11:20:28] (03CR) 10D3r1ck01: [C:03+1] "I've not looked in the private places but in a public CS search: https://codesearch.wmcloud.org/search/?q=MWMultiversion%3A%3AnewFromDBNam" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154139 (owner: 10Krinkle) [11:22:25] FIRING: SystemdUnitFailed: wmf_auto_restart_atftpd.service on install7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:22:57] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db224[12] - https://phabricator.wikimedia.org/T379757#10908710 (10Volans) I think that this has been a case of serial number swap and it probably "worked" because both hosts were setup at the same time maybe. ### Current status ####... [11:23:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:23:41] (03CR) 10Muehlenhoff: [C:03+2] Add ganeti204[56] to firewall list [puppet] - 10https://gerrit.wikimedia.org/r/1156308 (https://phabricator.wikimedia.org/T396590) (owner: 10Muehlenhoff) [11:23:56] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [11:24:11] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet [11:26:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T395241)', diff saved to https://phabricator.wikimedia.org/P77826 and previous config saved to /var/cache/conftool/dbconfig/20250612-112602-fceratto.json [11:28:30] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:28:47] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#10908726 (10Andrew) [11:30:54] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1052.eqiad.wmnet [11:31:20] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet [11:31:57] (03PS3) 10Clément Goubert: mw::maintenance::wikidate: Job history for alerting [puppet] - 10https://gerrit.wikimedia.org/r/1156296 (https://phabricator.wikimedia.org/T395814) [11:33:30] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:34:18] ml-etcd1003 will go down for a Ganeti reboot [11:34:23] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti1052.eqiad.wmnet [11:35:20] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1014.eqiad.wmnet with OS bullseye [11:35:53] (03PS1) 10Clément Goubert: team-sre/mw-cron: wikidata-updatequeryservicelag alert [alerts] - 10https://gerrit.wikimedia.org/r/1156310 (https://phabricator.wikimedia.org/T395814) [11:35:59] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1014.eqiad.wmnet with OS bullseye [11:38:35] (03CR) 10Alexandros Kosiaris: [C:03+1] wikifeeds: use mobileapps via service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156290 (https://phabricator.wikimedia.org/T367418) (owner: 10Hnowlan) [11:40:02] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1052.eqiad.wmnet [11:40:36] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1052.eqiad.wmnet [11:41:11] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P77828 and previous config saved to /var/cache/conftool/dbconfig/20250612-114110-fceratto.json [11:49:40] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet [11:55:00] !log jmm@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ncredir7004.magru.wmnet with OS bookworm [11:55:00] !log jmm@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ncredir7004.magru.wmnet [11:56:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P77829 and previous config saved to /var/cache/conftool/dbconfig/20250612-115618-fceratto.json [11:56:49] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet [11:58:09] !log jmm@cumin1003 START - Cookbook sre.ganeti.addnode for new host ganeti2045.codfw.wmnet to cluster codfw and group A [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250612T1200) [12:00:32] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2045.codfw.wmnet to cluster codfw and group A [12:01:12] !log jmm@cumin1003 START - Cookbook sre.ganeti.addnode for new host ganeti2046.codfw.wmnet to cluster codfw and group A [12:03:05] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2046.codfw.wmnet to cluster codfw and group A [12:03:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:06:32] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10908834 (10Andrew) [12:06:34] 10ops-codfw, 06SRE, 06DC-Ops: Moving extra 1G port to make 10G space on cloud rack. - https://phabricator.wikimedia.org/T396363#10908835 (10Andrew) [12:08:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:08:30] 10ops-codfw, 06SRE, 06DC-Ops: Moving extra 1G port to make 10G space on cloud rack. - https://phabricator.wikimedia.org/T396363#10908837 (10cmooney) @Jhancock.wm yep should be no problem. We can do it easily enough by just going to the switch in Netbox, clicking on the port, and then renaming it in Netbox t... [12:10:04] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1031.eqiad.wmnet [12:11:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T395241)', diff saved to https://phabricator.wikimedia.org/P77830 and previous config saved to /var/cache/conftool/dbconfig/20250612-121125-fceratto.json [12:11:33] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2153.codfw.wmnet with reason: Maintenance [12:11:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2153 (T395241)', diff saved to https://phabricator.wikimedia.org/P77831 and previous config saved to /var/cache/conftool/dbconfig/20250612-121141-fceratto.json [12:13:09] jmm@cumin1003 drain-node (PID 1280824) is awaiting input [12:19:54] (03PS1) 10Andrew Bogott: cloud.yaml: Allow overlayfs in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1156321 [12:20:05] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti1031.eqiad.wmnet [12:20:33] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1156321 (owner: 10Andrew Bogott) [12:21:45] PROBLEM - Host logstash1023 is DOWN: PING CRITICAL - Packet loss = 100% [12:22:18] ^ logstash1023 should be back shortly [12:23:59] (03PS4) 10Brouberol: admin_ng: define a priority class optional environment feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156319 (https://phabricator.wikimedia.org/T395107) [12:24:05] FIRING: [2x] ProbeDown: Service logstash1023:443 has failed probes (http_logstash_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#logstash1023:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:25:57] RECOVERY - Host logstash1023 is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms [12:27:01] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1031.eqiad.wmnet [12:27:08] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1031.eqiad.wmnet [12:27:26] RESOLVED: [3x] ProbeDown: Service ganeti1031:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:27:34] (03CR) 10CI reject: [V:04-1] admin_ng: define a priority class optional environment feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156319 (https://phabricator.wikimedia.org/T395107) (owner: 10Brouberol) [12:28:31] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156323 [12:29:32] (03CR) 10Majavah: [C:03+1] cloud.yaml: Allow overlayfs in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1156321 (owner: 10Andrew Bogott) [12:29:41] (03PS1) 10Btullis: Increase thresholds for run_podsandbox and stop_podsandbox in dse-k8s [alerts] - 10https://gerrit.wikimedia.org/r/1156324 (https://phabricator.wikimedia.org/T396738) [12:33:52] (03CR) 10CI reject: [V:04-1] Increase thresholds for run_podsandbox and stop_podsandbox in dse-k8s [alerts] - 10https://gerrit.wikimedia.org/r/1156324 (https://phabricator.wikimedia.org/T396738) (owner: 10Btullis) [12:34:29] (03CR) 10Andrew Bogott: [C:03+2] cloud.yaml: Allow overlayfs in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1156321 (owner: 10Andrew Bogott) [12:34:36] (03CR) 10Jgiannelos: [C:03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156323 (owner: 10PipelineBot) [12:36:09] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156323 (owner: 10PipelineBot) [12:36:34] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): SSD firmware update for an-coord100[3-4] - https://phabricator.wikimedia.org/T394499#10908935 (10BTullis) Hi @RobH - Sorry, I'm not 100% clear on which host you would like me to proceed. The description at the top says an-coord100... [12:36:43] PROBLEM - Hadoop NodeManager on an-worker1087 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:37:08] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1014.eqiad.wmnet with OS bullseye [12:37:13] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156325 [12:38:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T395241)', diff saved to https://phabricator.wikimedia.org/P77833 and previous config saved to /var/cache/conftool/dbconfig/20250612-123806-fceratto.json [12:41:42] jouncebot: now and next [12:41:42] For the next 0 hour(s) and 18 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250612T1200) [12:41:52] (03CR) 10Filippo Giunchedi: [C:03+2] thanos: enable tracing for store [puppet] - 10https://gerrit.wikimedia.org/r/1155153 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [12:42:35] PROBLEM - Hadoop NodeManager on an-worker1207 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:43:16] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: SSD firmware update for frbackup2002 - https://phabricator.wikimedia.org/T396649#10908949 (10Jgreen) >>! In T396649#10906049, @RobH wrote: > @Jgreen: Should this turf to you or should I assign it over to Greg for allocation? > > Basically we need to u... [12:43:31] jmm@cumin1003 makevm (PID 1283679) is awaiting input [12:43:41] (03CR) 10Ayounsi: [C:03+2] Promote the TransitPeeringIn/OutSaturation alerts to p.aging [alerts] - 10https://gerrit.wikimedia.org/r/1155620 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [12:44:55] (03Merged) 10jenkins-bot: Promote the TransitPeeringIn/OutSaturation alerts to p.aging [alerts] - 10https://gerrit.wikimedia.org/r/1155620 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [12:45:53] PROBLEM - Hadoop NodeManager on an-worker1104 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:47:25] PROBLEM - Hadoop NodeManager on an-worker1088 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:48:09] !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1015.eqiad.wmnet [12:48:11] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts cloudcephosd1015.eqiad.wmnet [12:49:04] !log depooling lvs7002 before migrating to katran - T396561 [12:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:07] T396561: Switch to katran as forwarding plane on non-core DCs - https://phabricator.wikimedia.org/T396561 [12:49:45] !log jmm@cumin1003 START - Cookbook sre.ganeti.makevm for new host ncredir7004.magru.wmnet [12:49:46] !log jmm@cumin1003 START - Cookbook sre.dns.netbox [12:50:06] jouncebot: nowandnext [12:50:06] For the next 0 hour(s) and 9 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250612T1200) [12:50:06] In 0 hour(s) and 9 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250612T1300) [12:50:10] !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1014.eqiad.wmnet'] [12:50:28] !log andrew@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1014.eqiad.wmnet'] [12:50:35] RECOVERY - Hadoop NodeManager on an-worker1207 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:50:55] !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1014.eqiad.wmnet'] [12:50:55] (03CR) 10Tiziano Fogli: [C:03+1] reimage: check for Monitoring::Host in puppetdb [cookbooks] - 10https://gerrit.wikimedia.org/r/1156264 (https://phabricator.wikimedia.org/T395449) (owner: 10Filippo Giunchedi) [12:51:05] edsanders: hey, around? i'm about to ship a train blocker of my own ,and i saw you have one scheduled too [12:51:07] !log andrew@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1014.eqiad.wmnet'] [12:51:18] train blocker _fix_ of course [12:51:40] !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1014.eqiad.wmnet'] [12:51:46] !log vgutierrez@cumin1003 START - Cookbook sre.loadbalancer.admin depooling P{lvs7002.magru.wmnet} and A:liberica (T396561) [12:51:52] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet [12:51:57] !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs7002.magru.wmnet} and A:liberica (T396561) [12:52:21] (03CR) 10Tiziano Fogli: [C:03+1] monitoring: add note about reimage cookbook [puppet] - 10https://gerrit.wikimedia.org/r/1156265 (https://phabricator.wikimedia.org/T395449) (owner: 10Filippo Giunchedi) [12:52:26] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:52:26] !log jmm@cumin1003 START - Cookbook sre.dns.wipe-cache ncredir7004.magru.wmnet on all recursors [12:52:30] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7004.magru.wmnet on all recursors [12:52:39] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T309012 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155143 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [12:52:58] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7004.magru.wmnet - jmm@cumin1003" [12:53:03] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7004.magru.wmnet - jmm@cumin1003" [12:53:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P77834 and previous config saved to /var/cache/conftool/dbconfig/20250612-125314-fceratto.json [12:54:02] jelto@cumin1002 jelto: The backup on gitlab2002 is complete, ready to proceed with upgrade. [12:54:55] jmm@cumin1003 drain-node (PID 1286004) is awaiting input [12:55:16] (03CR) 10Brouberol: [C:03+2] Airflow: Increase k8s check frequency in analytics_test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152681 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [12:56:04] jmm@cumin1003 makevm (PID 1283679) is awaiting input [12:59:03] (03PS1) 10Vgutierrez: hiera: Depool lvs7002 [puppet] - 10https://gerrit.wikimedia.org/r/1156334 (https://phabricator.wikimedia.org/T396561) [12:59:21] !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host ncredir7004.magru.wmnet with OS bookworm [12:59:46] (03PS5) 10Brouberol: admin_ng: define a priority class optional environment feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156319 (https://phabricator.wikimedia.org/T395107) [12:59:48] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1014.eqiad.wmnet'] [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250612T1300) [13:00:05] georgekyz, edsanders, and anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:11] hey! [13:00:12] o/ [13:00:14] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1014.eqiad.wmnet with OS bullseye [13:00:17] i can deploy today [13:00:21] Hey folks [13:00:43] RECOVERY - Hadoop NodeManager on an-worker1087 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:00:56] Shall I move to the first deployment with spider pig? [13:00:57] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1155604 [13:01:11] !log jmm@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7004.magru.wmnet with OS bookworm [13:01:11] !log jmm@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=97) for new host ncredir7004.magru.wmnet [13:01:26] !log vgutierrez@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs7002.magru.wmnet with reason: switching to katran [13:01:56] georgekyz: please wait, we have a train blocker scheduled [13:02:07] alright thnx for letting me know [13:02:19] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1156334 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [13:02:19] can you ping me when ready for deployment ? [13:02:22] aha, it was already deployed [13:02:26] a smooth [13:02:31] shall I go then ? [13:02:34] georgekyz: in that case, go ahead! [13:02:39] perfect thnx ! [13:02:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by gkyziridis@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155604 (https://phabricator.wikimedia.org/T395823) (owner: 10Gkyziridis) [13:03:09] (03PS6) 10Brouberol: admin_ng: define a priority class optional environment feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156319 (https://phabricator.wikimedia.org/T395107) [13:03:14] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to 17.10 [13:03:42] (03Merged) 10jenkins-bot: ores-extension: enable oresUI for the second batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155604 (https://phabricator.wikimedia.org/T395823) (owner: 10Gkyziridis) [13:04:01] (03CR) 10Vgutierrez: [C:03+2] hiera: Depool lvs7002 [puppet] - 10https://gerrit.wikimedia.org/r/1156334 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [13:04:06] !log gkyziridis@deploy1003 Started scap sync-world: Backport for [[gerrit:1155604|ores-extension: enable oresUI for the second batch of wikis (T395823 T395668)]] [13:04:11] T395823: [batch #2] Enable revertrisk filters in recent changes in multiple wikis - https://phabricator.wikimedia.org/T395823 [13:04:11] T395668: [batch #1] Enable revertrisk filters in simplewiki & trwiki - https://phabricator.wikimedia.org/T395668 [13:04:57] !log jmm@cumin1003 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet [13:05:25] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1014.eqiad.wmnet with OS bullseye [13:05:30] !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1014.eqiad.wmnet'] [13:05:38] !log andrew@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1014.eqiad.wmnet'] [13:06:02] (03CR) 10Brouberol: [C:03+2] Configure dse-k8s-worker100[2-3] with the dse_k8s::worker role [puppet] - 10https://gerrit.wikimedia.org/r/1155120 (https://phabricator.wikimedia.org/T395557) (owner: 10Brouberol) [13:06:16] !log gkyziridis@deploy1003 gkyziridis: Backport for [[gerrit:1155604|ores-extension: enable oresUI for the second batch of wikis (T395823 T395668)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:06:44] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2034.codfw.wmnet [13:06:44] !log jmm@cumin1003 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2034.codfw.wmnet [13:06:45] !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1014.eqiad.wmnet'] [13:06:54] !log andrew@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1014.eqiad.wmnet'] [13:07:36] (03PS2) 10Vgutierrez: hiera: Switch lvs7002 to katran [puppet] - 10https://gerrit.wikimedia.org/r/1155610 (https://phabricator.wikimedia.org/T396561) [13:07:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-magru:xe-0/1/2 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:08:15] (03PS2) 10Hnowlan: wikifeeds: use mobileapps via service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156290 (https://phabricator.wikimedia.org/T367418) [13:08:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P77835 and previous config saved to /var/cache/conftool/dbconfig/20250612-130822-fceratto.json [13:08:53] RECOVERY - Hadoop NodeManager on an-worker1104 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:09:49] (03CR) 10Vgutierrez: [C:03+2] hiera: Switch lvs7002 to katran [puppet] - 10https://gerrit.wikimedia.org/r/1155610 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [13:10:03] !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1014.eqiad.wmnet'] [13:10:57] (03CR) 10Scott French: [C:03+1] mw::periodic_job: Add job history limit control [puppet] - 10https://gerrit.wikimedia.org/r/1156295 (https://phabricator.wikimedia.org/T395885) (owner: 10Clément Goubert) [13:13:25] RECOVERY - Hadoop NodeManager on an-worker1088 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:13:31] (03CR) 10JMeybohm: admin_ng: define a priority class optional environment feature (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156319 (https://phabricator.wikimedia.org/T395107) (owner: 10Brouberol) [13:14:55] (03CR) 10Majavah: [V:03+1 C:03+2] P:openstack: pdns: auth: Bind the API on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1155219 (https://phabricator.wikimedia.org/T396448) (owner: 10Majavah) [13:14:56] (03CR) 10Scott French: [C:03+1] team-sre/mw-cron: wikidata-updatequeryservicelag alert [alerts] - 10https://gerrit.wikimedia.org/r/1156310 (https://phabricator.wikimedia.org/T395814) (owner: 10Clément Goubert) [13:15:07] (03CR) 10Scott French: [C:03+1] mw::maintenance::wikidate: Job history for alerting [puppet] - 10https://gerrit.wikimedia.org/r/1156296 (https://phabricator.wikimedia.org/T395814) (owner: 10Clément Goubert) [13:16:10] (03CR) 10Hnowlan: [C:03+2] wikifeeds: use mobileapps via service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156290 (https://phabricator.wikimedia.org/T367418) (owner: 10Hnowlan) [13:16:12] (03CR) 10Clément Goubert: [C:03+2] mw::periodic_job: Add job history limit control [puppet] - 10https://gerrit.wikimedia.org/r/1156295 (https://phabricator.wikimedia.org/T395885) (owner: 10Clément Goubert) [13:16:20] !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on P{cp3081*} and A:cp - 9.2.10 upgrade (T390912) [13:16:21] (03CR) 10Clément Goubert: [C:03+2] mw::maintenance::wikidate: Job history for alerting [puppet] - 10https://gerrit.wikimedia.org/r/1156296 (https://phabricator.wikimedia.org/T395814) (owner: 10Clément Goubert) [13:16:24] T390912: Upgrade to ATS 9.2.10 - https://phabricator.wikimedia.org/T390912 [13:16:29] o/ I missed the beginning of the window and also have a meeting soon ^^ [13:16:35] thanks urbanecm for deploying! [13:17:20] (03PS1) 10Filippo Giunchedi: thanos: add memcached-based index caching to store [puppet] - 10https://gerrit.wikimedia.org/r/1156341 (https://phabricator.wikimedia.org/T394319) [13:17:21] (03PS1) 10Filippo Giunchedi: thanos: trial store memcache on titan[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/1156342 (https://phabricator.wikimedia.org/T394319) [13:17:23] (03PS1) 10Filippo Giunchedi: thanos: activate store memcached across the board [puppet] - 10https://gerrit.wikimedia.org/r/1156343 (https://phabricator.wikimedia.org/T394319) [13:17:46] (03Merged) 10jenkins-bot: wikifeeds: use mobileapps via service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156290 (https://phabricator.wikimedia.org/T367418) (owner: 10Hnowlan) [13:17:47] Folks we found some issues on some specific wikis so I will cancel the deployment, and I will deploy a revert. [13:17:57] (03CR) 10Scott French: [C:03+1] x-wikimedia-debug-routing: add mw-experimental hosts [puppet] - 10https://gerrit.wikimedia.org/r/1154069 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [13:18:06] !log gkyziridis@deploy1003 Sync cancelled. [13:18:32] FIRING: KubernetesCalicoDown: dse-k8s-worker1012.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1012.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:18:48] (03PS1) 10Gkyziridis: Revert "ores-extension: enable oresUI for the second batch of wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156344 [13:19:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by gkyziridis@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156344 (owner: 10Gkyziridis) [13:19:43] (03PS1) 10JMeybohm: CI: Ensure fixtures are available during tasklist creation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156346 (https://phabricator.wikimedia.org/T396234) [13:20:04] (03Merged) 10jenkins-bot: Revert "ores-extension: enable oresUI for the second batch of wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156344 (owner: 10Gkyziridis) [13:20:25] !log gkyziridis@deploy1003 Started scap sync-world: Backport for [[gerrit:1156344|Revert "ores-extension: enable oresUI for the second batch of wikis"]] [13:20:32] !log depooling wdqs1022, it seems to not be updated [13:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:45] !log depooling wdqs1022, it seems to not be updated - T396577 [13:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:49] T396577: wdqs1022 has not been updating since June 8th at around 7 UTC - https://phabricator.wikimedia.org/T396577 [13:20:59] PROBLEM - Hadoop NodeManager on an-worker1176 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:20:59] !log sukhe@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on P{cp3081*} and A:cp - 9.2.10 upgrade (T390912) [13:21:21] !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on P{cp7002*} and A:cp - 9.2.10 upgrade (T390912) [13:21:25] T390912: Upgrade to ATS 9.2.10 - https://phabricator.wikimedia.org/T390912 [13:22:24] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/wikifeeds: apply [13:22:26] (03PS8) 10Majavah: P:openstack: pdns: auth: Support query_local_address for IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1155220 (https://phabricator.wikimedia.org/T396448) [13:22:26] (03PS5) 10Majavah: P:openstack: pdns: auth: Explicitely configure IPs to bind on [puppet] - 10https://gerrit.wikimedia.org/r/1155603 (https://phabricator.wikimedia.org/T396448) [13:22:26] (03PS7) 10Majavah: P:openstack: pdns: recursor: Support binding on multiple addresses [puppet] - 10https://gerrit.wikimedia.org/r/1155228 (https://phabricator.wikimedia.org/T396448) [13:22:27] (03PS5) 10Majavah: hieradata: Add codfw1dev v6 auth DNS IPs [puppet] - 10https://gerrit.wikimedia.org/r/1155613 (https://phabricator.wikimedia.org/T396448) [13:22:28] (03PS5) 10Majavah: hieradata: Add codfw1dev v6 recursive DNS IPs [puppet] - 10https://gerrit.wikimedia.org/r/1155614 (https://phabricator.wikimedia.org/T396448) [13:22:36] !log gkyziridis@deploy1003 gkyziridis: Backport for [[gerrit:1156344|Revert "ores-extension: enable oresUI for the second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:22:49] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [13:23:03] (03PS6) 10Ladsgroup: mariadb: Load list of private tables from the table catalog [puppet] - 10https://gerrit.wikimedia.org/r/1155210 (https://phabricator.wikimedia.org/T363581) [13:23:03] (03PS2) 10Ladsgroup: tables-catalog: Fix visibility of four tables based on maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/1155336 (https://phabricator.wikimedia.org/T363581) [13:23:28] !log gkyziridis@deploy1003 gkyziridis: Continuing with sync [13:23:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T395241)', diff saved to https://phabricator.wikimedia.org/P77836 and previous config saved to /var/cache/conftool/dbconfig/20250612-132329-fceratto.json [13:23:32] RESOLVED: KubernetesCalicoDown: dse-k8s-worker1012.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1012.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:23:49] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance [13:23:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2170 (T395241)', diff saved to https://phabricator.wikimedia.org/P77837 and previous config saved to /var/cache/conftool/dbconfig/20250612-132356-fceratto.json [13:24:04] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5955/console" [puppet] - 10https://gerrit.wikimedia.org/r/1155220 (https://phabricator.wikimedia.org/T396448) (owner: 10Majavah) [13:24:05] I will create a new patch without the problematic wikis (simplewiki, trwiki) and if there is time I will deploy [13:25:10] (03CR) 10Scott French: "Thanks, all!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1081989 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [13:25:16] (03CR) 10Majavah: [V:03+1 C:03+2] P:openstack: pdns: auth: Support query_local_address for IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1155220 (https://phabricator.wikimedia.org/T396448) (owner: 10Majavah) [13:25:22] (03CR) 10Filippo Giunchedi: "To be trialed next week" [puppet] - 10https://gerrit.wikimedia.org/r/1156342 (https://phabricator.wikimedia.org/T394319) (owner: 10Filippo Giunchedi) [13:25:48] (03CR) 10Filippo Giunchedi: "To be merged next week assuming the trial in I04910bd32 goes well" [puppet] - 10https://gerrit.wikimedia.org/r/1156343 (https://phabricator.wikimedia.org/T394319) (owner: 10Filippo Giunchedi) [13:26:11] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5957/console" [puppet] - 10https://gerrit.wikimedia.org/r/1155228 (https://phabricator.wikimedia.org/T396448) (owner: 10Majavah) [13:26:16] (03CR) 10Scott French: [V:03+2] "Verified to build successfully with local docker-pkg." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1081989 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [13:26:18] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5956/console" [puppet] - 10https://gerrit.wikimedia.org/r/1155603 (https://phabricator.wikimedia.org/T396448) (owner: 10Majavah) [13:26:30] (03PS1) 10Ilias Sarantopoulos: ores-extension: enable ores extension UI for second batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156349 (https://phabricator.wikimedia.org/T395823) [13:26:32] (03CR) 10Clément Goubert: [C:03+2] team-sre/mw-cron: wikidata-updatequeryservicelag alert [alerts] - 10https://gerrit.wikimedia.org/r/1156310 (https://phabricator.wikimedia.org/T395814) (owner: 10Clément Goubert) [13:26:45] !log sukhe@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on P{cp7002*} and A:cp - 9.2.10 upgrade (T390912) [13:26:49] T390912: Upgrade to ATS 9.2.10 - https://phabricator.wikimedia.org/T390912 [13:26:52] (03CR) 10Scott French: [V:03+2 C:03+2] httpd: introduce -bookworm track and cascade [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1081989 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [13:26:55] (03CR) 10Majavah: [V:03+1 C:03+2] P:openstack: pdns: auth: Explicitely configure IPs to bind on [puppet] - 10https://gerrit.wikimedia.org/r/1155603 (https://phabricator.wikimedia.org/T396448) (owner: 10Majavah) [13:27:00] (03CR) 10Majavah: [V:03+1 C:03+2] P:openstack: pdns: recursor: Support binding on multiple addresses [puppet] - 10https://gerrit.wikimedia.org/r/1155228 (https://phabricator.wikimedia.org/T396448) (owner: 10Majavah) [13:27:46] (03Merged) 10jenkins-bot: team-sre/mw-cron: wikidata-updatequeryservicelag alert [alerts] - 10https://gerrit.wikimedia.org/r/1156310 (https://phabricator.wikimedia.org/T395814) (owner: 10Clément Goubert) [13:27:51] !log andrew@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1014.eqiad.wmnet'] [13:28:18] 06SRE, 06Infrastructure-Foundations, 10netops: Stage and configure new Juniper switches in codfw rows E/F - https://phabricator.wikimedia.org/T394021#10909127 (10cmooney) 05Open→03Resolved [13:28:30] (03CR) 10Gkyziridis: [C:03+1] "LGTM! Thnx." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156349 (https://phabricator.wikimedia.org/T395823) (owner: 10Ilias Sarantopoulos) [13:28:47] (03PS16) 10Tiziano Fogli: pdb_resource_exporter: add puppetdb resource exporter to puppedb [puppet] - 10https://gerrit.wikimedia.org/r/1143600 (https://phabricator.wikimedia.org/T395442) [13:28:54] !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1014.eqiad.wmnet'] [13:29:04] 10ops-eqiad, 06SRE, 06DC-Ops: Decom eqiad row B <-> cloudsw links - https://phabricator.wikimedia.org/T391489#10909130 (10cmooney) [13:29:11] (03PS2) 10Ilias Sarantopoulos: ores-extension: enable ores extension UI for second batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156349 (https://phabricator.wikimedia.org/T395823) [13:29:37] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5958/co" [puppet] - 10https://gerrit.wikimedia.org/r/1155613 (https://phabricator.wikimedia.org/T396448) (owner: 10Majavah) [13:30:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155652 (https://phabricator.wikimedia.org/T395824) (owner: 10Gkyziridis) [13:30:27] !log gkyziridis@deploy1003 Finished scap sync-world: Backport for [[gerrit:1156344|Revert "ores-extension: enable oresUI for the second batch of wikis"]] (duration: 10m 01s) [13:30:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156349 (https://phabricator.wikimedia.org/T395823) (owner: 10Ilias Sarantopoulos) [13:30:42] PROBLEM - Hadoop NodeManager on an-worker1079 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:30:53] (03CR) 10Ladsgroup: "After this change and above, these tables will be added to list of private tables:" [puppet] - 10https://gerrit.wikimedia.org/r/1155210 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [13:31:07] The revert patch is being merging right now. I would like to deploy this: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1156349 , after you finish with the rest of the deployments. Is it ok if I extend the deployment window a little bit ? [13:31:10] PROBLEM - Hadoop NodeManager on an-worker1205 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:31:22] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1155210 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [13:31:25] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5959/co" [puppet] - 10https://gerrit.wikimedia.org/r/1155614 (https://phabricator.wikimedia.org/T396448) (owner: 10Majavah) [13:32:00] I am finished with the revert deployment. [13:32:43] urbanecm: Revert patch is already deployed. If anybody else wants they can proceed, otherwise my second patch is ready for deployment. [13:32:52] 10ops-codfw, 06SRE, 06DC-Ops: Install and cable Nokia test devices and test servers in codfw - https://phabricator.wikimedia.org/T385217#10909157 (10cmooney) [13:33:14] georgekyz: feel free to continue [13:33:19] thnx so much [13:33:30] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10909159 (10cmooney) Just a note to say I have moved 2047-2050 to the private1-b-codfw vlan rather than the per-rack ones there were on. I've... [13:33:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by gkyziridis@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156349 (https://phabricator.wikimedia.org/T395823) (owner: 10Ilias Sarantopoulos) [13:34:26] (03PS1) 10Jforrester: diffConfig: Add a quick list of affected wikis to the end of the output [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156351 [13:34:35] (03Merged) 10jenkins-bot: ores-extension: enable ores extension UI for second batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156349 (https://phabricator.wikimedia.org/T395823) (owner: 10Ilias Sarantopoulos) [13:34:40] (03PS1) 10Jgreen: Change DMARC aggregate report address for donate.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1156352 (https://phabricator.wikimedia.org/T394788) [13:34:59] !log gkyziridis@deploy1003 Started scap sync-world: Backport for [[gerrit:1156349|ores-extension: enable ores extension UI for second batch of wikis (T395823)]] [13:35:03] T395823: [batch #2] Enable revertrisk filters in recent changes in multiple wikis - https://phabricator.wikimedia.org/T395823 [13:35:45] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cloudcephosd1014.eqiad.wmnet'] [13:35:48] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2034.codfw.wmnet [13:35:48] !log jmm@cumin1003 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2034.codfw.wmnet [13:36:10] RECOVERY - Hadoop NodeManager on an-worker1205 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:36:13] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2034.codfw.wmnet [13:36:25] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1014.eqiad.wmnet with OS bullseye [13:36:49] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti2034.codfw.wmnet [13:37:13] !log gkyziridis@deploy1003 gkyziridis, isaranto: Backport for [[gerrit:1156349|ores-extension: enable ores extension UI for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:37:38] !log failover Ganeti master in eqiad to ganeti1046 [13:37:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:48] (03CR) 10Brouberol: admin_ng: define a priority class optional environment feature (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156319 (https://phabricator.wikimedia.org/T395107) (owner: 10Brouberol) [13:39:02] !log gkyziridis@deploy1003 gkyziridis, isaranto: Continuing with sync [13:39:08] (03PS1) 10Majavah: hieradata: Update Striker to 2025-06-12-103158-production [puppet] - 10https://gerrit.wikimedia.org/r/1156353 (https://phabricator.wikimedia.org/T364605) [13:39:57] (03CR) 10Majavah: [C:03+2] hieradata: Update Striker to 2025-06-12-103158-production [puppet] - 10https://gerrit.wikimedia.org/r/1156353 (https://phabricator.wikimedia.org/T364605) (owner: 10Majavah) [13:40:08] PROBLEM - ganeti-wconfd running on ganeti1048 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [13:40:52] RECOVERY - Hadoop NodeManager on an-worker1176 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:41:24] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1014.eqiad.wmnet with OS bullseye [13:41:28] !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1014.eqiad.wmnet'] [13:41:38] (03CR) 10JMeybohm: CI: Ensure fixtures are available during tasklist creation (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156346 (https://phabricator.wikimedia.org/T396234) (owner: 10JMeybohm) [13:41:45] !log andrew@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1014.eqiad.wmnet'] [13:41:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T395241)', diff saved to https://phabricator.wikimedia.org/P77838 and previous config saved to /var/cache/conftool/dbconfig/20250612-134149-fceratto.json [13:42:21] (03PS8) 10JMeybohm: coredns: Run coredns on an unprivileged port (5353) instead of 53 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153977 (https://phabricator.wikimedia.org/T396107) [13:42:51] (03PS6) 10JMeybohm: calico: Add support to manage CNI installation by daemonset [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153976 (https://phabricator.wikimedia.org/T396107) [13:43:04] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#10909206 (10MoritzMuehlenhoff) [13:43:07] 06SRE, 06Infrastructure-Foundations, 10netops: Netbox: Use LAG interface MAC address field to store LACP system-id for MC-LAG - https://phabricator.wikimedia.org/T392056#10909207 (10cmooney) p:05Medium→03Low > We could (ab)use the FHRP group feature to group members of a MC-LAG and add common variables... [13:43:51] !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1014.eqiad.wmnet'] [13:44:02] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1014.eqiad.wmnet'] [13:44:08] !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1014.eqiad.wmnet'] [13:44:13] !log andrew@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1014.eqiad.wmnet'] [13:44:51] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2034.codfw.wmnet [13:44:58] !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1014.eqiad.wmnet'] [13:45:05] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2034.codfw.wmnet [13:45:13] 06SRE, 06Infrastructure-Foundations, 10netops: Consolidate Automation Templates for DC Switches - https://phabricator.wikimedia.org/T312635#10909214 (10cmooney) 05Open→03Resolved [13:45:14] !log andrew@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1014.eqiad.wmnet'] [13:45:35] !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1014.eqiad.wmnet'] [13:45:59] !log gkyziridis@deploy1003 Finished scap sync-world: Backport for [[gerrit:1156349|ores-extension: enable ores extension UI for second batch of wikis (T395823)]] (duration: 11m 00s) [13:46:03] !log installing mariadb security updates (as shipped in Debian, not the wmf-mariadb packages we use for the main mariadb clusters) [13:46:03] T395823: [batch #2] Enable revertrisk filters in recent changes in multiple wikis - https://phabricator.wikimedia.org/T395823 [13:46:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:44] deployment finished. Thank you urbanecm [13:47:14] (03CR) 10Alexandros Kosiaris: [C:03+1] "1 pedantic comment, otherwise LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156346 (https://phabricator.wikimedia.org/T396234) (owner: 10JMeybohm) [13:47:41] 10ops-eqiad, 06SRE, 06SRE-OnFire, 10Cassandra, and 4 others: additional sessionstore expansion — eqiad - https://phabricator.wikimedia.org/T395955#10909228 (10Eevans) >>! In T395954#10884228, @Jhancock.wm wrote: > i have 12 x 480GB drives readily available on site >>! In T395955#10906453, @VRiley-WMF wrot... [13:47:42] (03PS7) 10Brouberol: admin_ng: define a priority class optional environment feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156319 (https://phabricator.wikimedia.org/T395107) [13:47:49] 10ops-codfw, 06SRE, 06SRE-OnFire, 10Cassandra, and 3 others: additional sessionstore expansion — codfw - https://phabricator.wikimedia.org/T395954#10909230 (10Eevans) >>! In T395954#10884228, @Jhancock.wm wrote: > i have 12 x 480GB drives readily available on site >>! In T395955#10906453, @VRiley-WMF wrot... [13:47:57] (03CR) 10Brouberol: admin_ng: define a priority class optional environment feature (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156319 (https://phabricator.wikimedia.org/T395107) (owner: 10Brouberol) [13:48:36] (03PS1) 10Ssingh: hiera: durum: revert ECH experiment [puppet] - 10https://gerrit.wikimedia.org/r/1156355 (https://phabricator.wikimedia.org/T205378) [13:48:53] 06SRE, 06Infrastructure-Foundations, 10netops: Sub-optimal cloud routing for WMCS in eqiad when link fails - https://phabricator.wikimedia.org/T367203#10909233 (10cmooney) 05Open→03Resolved This problem is now resolved as we are using IBGP with next-hops announced as the loopbacks of each switch, and... [13:50:28] (03PS2) 10JMeybohm: CI: Ensure fixtures are available during tasklist creation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156346 (https://phabricator.wikimedia.org/T396234) [13:50:50] (03CR) 10JMeybohm: CI: Ensure fixtures are available during tasklist creation (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156346 (https://phabricator.wikimedia.org/T396234) (owner: 10JMeybohm) [13:51:22] (03PS1) 10Jelto: gitlab: add bwlimit to backup rsync job [puppet] - 10https://gerrit.wikimedia.org/r/1156356 [13:51:23] (03CR) 10JMeybohm: [V:03+2 C:03+2] "Skipping full CI run for a comment typo" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156346 (https://phabricator.wikimedia.org/T396234) (owner: 10JMeybohm) [13:51:45] (03CR) 10Tiziano Fogli: pdb_resource_exporter: add puppetdb resource exporter to puppedb (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1143600 (https://phabricator.wikimedia.org/T395442) (owner: 10Tiziano Fogli) [13:51:56] 06SRE, 06Infrastructure-Foundations, 10netops: Create cookbook to set up ganeti host network - https://phabricator.wikimedia.org/T378346#10909245 (10cmooney) 05Open→03Declined [13:53:29] (03PS2) 10Scott French: shellbox: define httpd image name in values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156354 (https://phabricator.wikimedia.org/T378128) [13:53:40] (03PS6) 10Majavah: hieradata: Add codfw1dev v6 auth DNS IPs [puppet] - 10https://gerrit.wikimedia.org/r/1155613 (https://phabricator.wikimedia.org/T396448) [13:53:40] (03PS6) 10Majavah: hieradata: Add codfw1dev v6 recursive DNS IPs [puppet] - 10https://gerrit.wikimedia.org/r/1155614 (https://phabricator.wikimedia.org/T396448) [13:54:12] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5961/console" [puppet] - 10https://gerrit.wikimedia.org/r/1156356 (owner: 10Jelto) [13:54:12] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 14): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5960/c" [puppet] - 10https://gerrit.wikimedia.org/r/1156355 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [13:54:46] (03CR) 10CI reject: [V:04-1] coredns: Run coredns on an unprivileged port (5353) instead of 53 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153977 (https://phabricator.wikimedia.org/T396107) (owner: 10JMeybohm) [13:54:50] (03CR) 10Arnaudb: [C:03+1] gitlab: add bwlimit to backup rsync job [puppet] - 10https://gerrit.wikimedia.org/r/1156356 (owner: 10Jelto) [13:54:58] (03CR) 10CI reject: [V:04-1] calico: Add support to manage CNI installation by daemonset [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153976 (https://phabricator.wikimedia.org/T396107) (owner: 10JMeybohm) [13:55:24] PROBLEM - ganeti-wconfd running on ganeti2033 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [13:55:40] (03PS9) 10JMeybohm: coredns: Run coredns on an unprivileged port (5353) instead of 53 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153977 (https://phabricator.wikimedia.org/T396107) [13:55:43] (03PS2) 10Jelto: gitlab: add bwlimit to backup rsync job [puppet] - 10https://gerrit.wikimedia.org/r/1156356 [13:55:46] (03PS7) 10JMeybohm: calico: Add support to manage CNI installation by daemonset [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153976 (https://phabricator.wikimedia.org/T396107) [13:55:56] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1014.eqiad.wmnet'] [13:56:17] !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1014.eqiad.wmnet'] [13:56:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P77839 and previous config saved to /var/cache/conftool/dbconfig/20250612-135657-fceratto.json [13:57:06] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/wikifeeds: apply [13:57:18] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [13:57:18] (03CR) 10Arnaudb: [C:03+1] "as long as we know where to find the value to edit, it makes sense to hardcode a limit for all backups, lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1156356 (owner: 10Jelto) [13:57:25] !log upload liberica 0.18 to apt.wm.o (bookworm-wikimedia) - T396751 [13:57:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:28] T396751: liberica forwarding plane fails to start on systems with 48 CPUs using katran - https://phabricator.wikimedia.org/T396751 [13:57:30] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [13:57:57] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [13:58:36] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [13:58:51] (03CR) 10Jelto: [V:03+1] "I'd prefer a module variable but the rsync logic is inside a shell script which should not use templates. So I'll hardcode it for now." [puppet] - 10https://gerrit.wikimedia.org/r/1156356 (owner: 10Jelto) [13:59:04] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [14:01:18] urbanecm: should I schedule my patches to next window [14:02:33] !log andrew@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1014.eqiad.wmnet'] [14:03:41] RECOVERY - Hadoop NodeManager on an-worker1079 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:04:55] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1014.eqiad.wmnet with OS bullseye [14:06:21] (03PS17) 10Tiziano Fogli: pdb_resource_exporter: add puppetdb resource exporter to puppedb [puppet] - 10https://gerrit.wikimedia.org/r/1143600 (https://phabricator.wikimedia.org/T395442) [14:06:59] (03CR) 10CI reject: [V:04-1] pdb_resource_exporter: add puppetdb resource exporter to puppedb [puppet] - 10https://gerrit.wikimedia.org/r/1143600 (https://phabricator.wikimedia.org/T395442) (owner: 10Tiziano Fogli) [14:07:12] (03PS18) 10Tiziano Fogli: pdb_resource_exporter: add puppetdb resource exporter to puppedb [puppet] - 10https://gerrit.wikimedia.org/r/1143600 (https://phabricator.wikimedia.org/T395442) [14:09:23] (03PS1) 10Vgutierrez: Revert "hiera: Depool lvs7002" [puppet] - 10https://gerrit.wikimedia.org/r/1156361 (https://phabricator.wikimedia.org/T396561) [14:09:44] (03CR) 10CI reject: [V:04-1] Revert "hiera: Depool lvs7002" [puppet] - 10https://gerrit.wikimedia.org/r/1156361 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [14:11:50] (03PS2) 10Vgutierrez: Revert "hiera: Depool lvs7002" [puppet] - 10https://gerrit.wikimedia.org/r/1156361 (https://phabricator.wikimedia.org/T396561) [14:12:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P77840 and previous config saved to /var/cache/conftool/dbconfig/20250612-141205-fceratto.json [14:12:14] (03CR) 10CI reject: [V:04-1] Revert "hiera: Depool lvs7002" [puppet] - 10https://gerrit.wikimedia.org/r/1156361 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [14:12:33] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2033.codfw.wmnet [14:12:38] (03PS3) 10Vgutierrez: Revert "hiera: Depool lvs7002" [puppet] - 10https://gerrit.wikimedia.org/r/1156361 (https://phabricator.wikimedia.org/T396561) [14:13:08] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for cmelo - https://phabricator.wikimedia.org/T395966#10909311 (10ifried) approved [14:13:20] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1156361 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [14:18:05] (03CR) 10Kamila Součková: [C:03+1] shellbox: define httpd image name in values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156354 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [14:18:21] (03PS19) 10Tiziano Fogli: pdb_resource_exporter: add puppetdb resource exporter to puppedb [puppet] - 10https://gerrit.wikimedia.org/r/1143600 (https://phabricator.wikimedia.org/T395442) [14:19:31] (03CR) 10Hnowlan: [C:03+1] shellbox: define httpd image name in values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156354 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [14:20:15] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti2033.codfw.wmnet [14:21:11] (03PS1) 10Jforrester: tables-catalogue: List wikifunctionsclient_usage [puppet] - 10https://gerrit.wikimedia.org/r/1156365 [14:21:11] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1014.eqiad.wmnet with reason: host reimage [14:22:05] (03CR) 10Tiziano Fogli: pdb_resource_exporter: add puppetdb resource exporter to puppedb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1143600 (https://phabricator.wikimedia.org/T395442) (owner: 10Tiziano Fogli) [14:23:16] (03CR) 10CI reject: [V:04-1] tables-catalogue: List wikifunctionsclient_usage [puppet] - 10https://gerrit.wikimedia.org/r/1156365 (owner: 10Jforrester) [14:24:34] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1014.eqiad.wmnet with reason: host reimage [14:27:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T395241)', diff saved to https://phabricator.wikimedia.org/P77841 and previous config saved to /var/cache/conftool/dbconfig/20250612-142712-fceratto.json [14:27:32] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2173.codfw.wmnet with reason: Maintenance [14:27:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2173 (T395241)', diff saved to https://phabricator.wikimedia.org/P77842 and previous config saved to /var/cache/conftool/dbconfig/20250612-142738-fceratto.json [14:28:17] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2033.codfw.wmnet [14:28:20] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2033.codfw.wmnet [14:28:32] !log mfossati@deploy1003 Started deploy [airflow-dags/platform_eng@cb6b18b]: hotfix-bump SEAL to v0.8.0 [14:30:25] !log mfossati@deploy1003 Finished deploy [airflow-dags/platform_eng@cb6b18b]: hotfix-bump SEAL to v0.8.0 (duration: 02m 24s) [14:31:46] (03PS2) 10Jforrester: tables-catalogue: List wikifunctionsclient_usage [puppet] - 10https://gerrit.wikimedia.org/r/1156365 [14:34:13] (03CR) 10Tiziano Fogli: [C:03+1] thanos: activate store memcached across the board [puppet] - 10https://gerrit.wikimedia.org/r/1156343 (https://phabricator.wikimedia.org/T394319) (owner: 10Filippo Giunchedi) [14:35:34] !log aqu@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:36:37] !log aqu@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:36:57] !log aqu@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:37:01] !log aqu@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:37:06] !log aqu@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:37:14] !log aqu@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:37:19] !log aqu@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:38:16] !log aqu@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:38:24] 10ops-eqiad, 06SRE, 06SRE-OnFire, 10Cassandra, and 4 others: additional sessionstore expansion — eqiad - https://phabricator.wikimedia.org/T395955#10909404 (10VRiley-WMF) Sure thing! [14:39:45] (03PS1) 10Ssingh: wikimedia-dns.org: remove TYPE65 record [dns] - 10https://gerrit.wikimedia.org/r/1156373 (https://phabricator.wikimedia.org/T205378) [14:40:10] (03PS2) 10Ssingh: wikimedia-dns.org: remove TYPE65 record for check [dns] - 10https://gerrit.wikimedia.org/r/1156373 (https://phabricator.wikimedia.org/T205378) [14:43:07] 10ops-eqiad, 06SRE, 06SRE-OnFire, 10Cassandra, and 4 others: additional sessionstore expansion — eqiad - https://phabricator.wikimedia.org/T395955#10909421 (10VRiley-WMF) 05Open→03In progress [14:44:14] (03PS1) 10Andrew Bogott: cloudcephosd1014: update nic names [puppet] - 10https://gerrit.wikimedia.org/r/1156375 (https://phabricator.wikimedia.org/T309789) [14:44:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T395241)', diff saved to https://phabricator.wikimedia.org/P77843 and previous config saved to /var/cache/conftool/dbconfig/20250612-144419-fceratto.json [14:44:56] (03CR) 10Andrew Bogott: [C:03+2] cloudcephosd1014: update nic names [puppet] - 10https://gerrit.wikimedia.org/r/1156375 (https://phabricator.wikimedia.org/T309789) (owner: 10Andrew Bogott) [14:47:04] (03PS1) 10Btullis: Bump up the CPU and RAM resources for airflow related namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156377 (https://phabricator.wikimedia.org/T388378) [14:47:45] 10ops-eqiad, 06SRE, 06SRE-OnFire, 10Cassandra, and 4 others: additional sessionstore expansion — eqiad - https://phabricator.wikimedia.org/T395955#10909436 (10VRiley-WMF) 05In progress→03Open The drives have been added. Due to these being reused, please let us know if one of them is bad or throws issue... [14:48:36] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.remove-downtime for lvs7002.magru.wmnet [14:48:36] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs7002.magru.wmnet [14:49:03] (03PS1) 10Urbanecm: LinkRecommendationStore: Query templatelinks on the main DB [extensions/GrowthExperiments] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1156378 (https://phabricator.wikimedia.org/T396680) [14:49:17] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1014.eqiad.wmnet with OS bullseye [14:49:43] jouncebot: nowandnext [14:49:43] No deployments scheduled for the next 1 hour(s) and 10 minute(s) [14:49:43] In 1 hour(s) and 10 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250612T1600) [14:49:52] (03CR) 10Urbanecm: [C:03+2] LinkRecommendationStore: Query templatelinks on the main DB [extensions/GrowthExperiments] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1156378 (https://phabricator.wikimedia.org/T396680) (owner: 10Urbanecm) [14:50:36] (03CR) 10Scott French: "Thanks for the reviews!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156354 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [14:50:56] (03CR) 10Scott French: [C:03+2] shellbox: define httpd image name in values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156354 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [14:52:43] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [14:52:54] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 8 - rack E5) - https://phabricator.wikimedia.org/T390175#10909469 (10VRiley-WMF) 05Open→03In progress adding drives now [14:53:20] (03Merged) 10jenkins-bot: shellbox: define httpd image name in values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156354 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [14:54:55] !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1015.eqiad.wmnet'] [14:55:15] (03PS1) 10Brouberol: conftool-data: add dse-k8s-worker101[23] to the dse-k8s-eqiad ingress backends [puppet] - 10https://gerrit.wikimedia.org/r/1156379 (https://phabricator.wikimedia.org/T395557) [14:56:23] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update for codfw - jhancock@cumin2002" [14:56:28] (03CR) 10Btullis: [C:03+1] "Nice, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1156379 (https://phabricator.wikimedia.org/T395557) (owner: 10Brouberol) [14:56:37] (03CR) 10Brouberol: [C:03+1] "With the obvious observation that we're technically opening ourselves to overcommitting resources, I think that this is worthwile, because" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156377 (https://phabricator.wikimedia.org/T388378) (owner: 10Btullis) [14:56:55] (03CR) 10Brouberol: [C:03+2] conftool-data: add dse-k8s-worker101[23] to the dse-k8s-eqiad ingress backends [puppet] - 10https://gerrit.wikimedia.org/r/1156379 (https://phabricator.wikimedia.org/T395557) (owner: 10Brouberol) [14:56:57] jouncebot: nowandnext [14:56:57] No deployments scheduled for the next 1 hour(s) and 3 minute(s) [14:56:57] In 1 hour(s) and 3 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250612T1600) [14:57:09] (03CR) 10CI reject: [V:04-1] LinkRecommendationStore: Query templatelinks on the main DB [extensions/GrowthExperiments] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1156378 (https://phabricator.wikimedia.org/T396680) (owner: 10Urbanecm) [14:57:09] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox: apply [14:57:11] I’d like to do a small deployment, lmk if that’s a bad idea right now [14:57:13] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox: apply [14:57:17] (mediawiki) [14:57:24] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [14:57:28] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [14:57:39] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-media: apply [14:57:43] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [14:57:54] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [14:57:58] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [14:58:00] (03PS1) 10Alexandros Kosiaris: pontoon: Clarify push branch, add how to mess with private repo [puppet] - 10https://gerrit.wikimedia.org/r/1156380 [14:58:09] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [14:58:13] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [14:58:23] Lucas_WMDE: you might see me deploying shellbox in the background, but it's functionally a noop :) [14:58:25] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-video: apply [14:58:26] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM! Neat" [puppet] - 10https://gerrit.wikimedia.org/r/1143600 (https://phabricator.wikimedia.org/T395442) (owner: 10Tiziano Fogli) [14:58:28] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [14:58:29] ok :) [14:58:48] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update for codfw - jhancock@cumin2002" [14:58:48] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:59:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P77845 and previous config saved to /var/cache/conftool/dbconfig/20250612-145927-fceratto.json [14:59:54] (03CR) 10Vgutierrez: [C:03+2] Revert "hiera: Depool lvs7002" [puppet] - 10https://gerrit.wikimedia.org/r/1156361 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [15:01:07] (03PS2) 10Alexandros Kosiaris: pontoon: Clarify push branch, add how to mess with private repo [puppet] - 10https://gerrit.wikimedia.org/r/1156380 [15:01:11] !log andrew@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1015.eqiad.wmnet'] [15:02:17] (03CR) 10Urbanecm: [C:03+2] "retrying, seems irrelevant" [extensions/GrowthExperiments] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1156378 (https://phabricator.wikimedia.org/T396680) (owner: 10Urbanecm) [15:02:36] Lucas_WMDE: i'm trying to get a GE backport out (train blocker) [15:02:42] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 8 - rack E5) - https://phabricator.wikimedia.org/T390175#10909544 (10VRiley-WMF) an-worker1157 has been upgraded, moving on to the other two [15:02:45] if you have a config one, you'll be faster probably [15:03:14] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox: apply [15:03:20] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [15:03:21] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [15:03:27] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [15:03:28] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [15:03:34] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [15:03:35] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [15:03:41] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [15:03:42] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [15:03:48] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [15:03:49] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [15:03:55] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [15:04:17] urbanecm: ok, I see yours in zuul… [15:04:21] probably better to let you go first then [15:04:24] yep, almost there [15:04:32] yup, just reached PostBuildScript [15:04:42] i'll ping you once done then! [15:04:48] great, thanks! [15:05:17] (03Merged) 10jenkins-bot: LinkRecommendationStore: Query templatelinks on the main DB [extensions/GrowthExperiments] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1156378 (https://phabricator.wikimedia.org/T396680) (owner: 10Urbanecm) [15:05:30] FIRING: LibericaStaleConfig: Liberica instance lvs7002 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://grafana.wikimedia.org/d/fa4de97a-7114-48c7-a91a-f56089ef554f/liberica?orgId=1&viewPanel=10&var-site=magru&var-instance=lvs7002 - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [15:05:49] ^^ that's me [15:05:52] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1156378|LinkRecommendationStore: Query templatelinks on the main DB (T396680)]] [15:05:57] T396680: Table 'hewiki.templatelinks' doesn't exist - https://phabricator.wikimedia.org/T396680 [15:07:34] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [15:08:02] !log urbanecm@deploy1003 urbanecm: Backport for [[gerrit:1156378|LinkRecommendationStore: Query templatelinks on the main DB (T396680)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:08:24] !log re-pooling lvs7002 using katran - T396561 [15:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:28] T396561: Switch to katran as forwarding plane on non-core DCs - https://phabricator.wikimedia.org/T396561 [15:08:30] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs7002.magru.wmnet} and A:liberica [15:08:41] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox: apply [15:08:44] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [15:08:45] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [15:08:48] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [15:08:48] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs7002.magru.wmnet} and A:liberica [15:08:49] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [15:08:52] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [15:08:54] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [15:08:57] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [15:08:58] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [15:09:01] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [15:09:02] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [15:09:06] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [15:09:14] hmm, how do i run a maintenance script on a debug host? [15:09:27] (03CR) 10Aleksandar Mastilovic: "LGTM! A somewhat educational (for me) question: Does the absence of" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156377 (https://phabricator.wikimedia.org/T388378) (owner: 10Btullis) [15:09:57] ok, mwscript works on mwdebug*, but not on mwmaint* [15:10:30] RESOLVED: LibericaStaleConfig: Liberica instance lvs7002 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://grafana.wikimedia.org/d/fa4de97a-7114-48c7-a91a-f56089ef554f/liberica?orgId=1&viewPanel=10&var-site=magru&var-instance=lvs7002 - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [15:11:24] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cp2045-57 to codfw - jhancock@cumin2002" [15:11:30] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cp2045-57 to codfw - jhancock@cumin2002" [15:11:30] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:11:37] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp2045 [15:11:44] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp2045 [15:11:47] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp2046 [15:11:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp2046 [15:11:57] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp2047 [15:12:05] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp2047 [15:12:07] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp2048 [15:12:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp2048 [15:12:19] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp2049 [15:12:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp2049 [15:12:30] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp2050 [15:12:31] !log jmm@cumin1003 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1048.eqiad.wmnet [15:12:41] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1015.eqiad.wmnet with OS bullseye [15:12:44] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp2050 [15:12:47] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp2051 [15:12:56] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp2051 [15:13:07] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp2052 [15:13:08] (03CR) 10Andrew Bogott: [C:03+2] cloudcephosd1015 -> puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1156302 (https://phabricator.wikimedia.org/T309789) (owner: 10Andrew Bogott) [15:13:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp2052 [15:13:18] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp2053 [15:13:19] (03CR) 10Aleksandar Mastilovic: [C:03+1] "Forgot to vote in the previous comment." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156377 (https://phabricator.wikimedia.org/T388378) (owner: 10Btullis) [15:13:25] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp2053 [15:13:27] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp2054 [15:13:35] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp2054 [15:13:38] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp2055 [15:13:45] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp2055 [15:13:48] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp2056 [15:13:56] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp2056 [15:13:58] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp2057 [15:14:02] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 8 - rack E5) - https://phabricator.wikimedia.org/T390175#10909608 (10VRiley-WMF) [15:14:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P77846 and previous config saved to /var/cache/conftool/dbconfig/20250612-151434-fceratto.json [15:14:46] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device lsw1-c5-codfw.mgmt.codfw.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T396506#10909612 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm provisioned server, ran netbox and switch interface scripts. [15:14:50] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts db1253.eqiad.wmnet [15:15:10] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device lsw1-b3-codfw.mgmt.codfw.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T396635#10909624 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm provisioned server, ran netbox and switch interface scripts. [15:15:24] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts db1253.eqiad.wmnet [15:15:26] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 8 - rack E5) - https://phabricator.wikimedia.org/T390175#10909629 (10VRiley-WMF) @Stevemunene replaced drives, this is good to go [15:15:30] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device lsw1-b5-codfw.mgmt.codfw.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T396638#10909631 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm provisioned server, ran netbox and switch interface scripts. [15:15:47] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 8 - rack E5) - https://phabricator.wikimedia.org/T390175#10909639 (10VRiley-WMF) 05In progress→03Open [15:15:58] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device lsw1-c1-codfw.mgmt.codfw.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T396639#10909643 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm provisioned server, ran netbox and switch interface scripts. [15:16:01] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: SSD firmware update not working in firmware cookbook - https://phabricator.wikimedia.org/T394543#10909646 (10RobH) ` Management Password: db1253.eqiad.wmnet (Gen 15): starting db1253.eqiad.wmnet (SSD): update db1253.eqiad.wmnet (SSD): current version:... [15:16:59] !log urbanecm@deploy1003 urbanecm: Continuing with sync [15:17:01] jhancock@cumin2002 configure-switch-interfaces (PID 411771) is awaiting input [15:17:05] unable to break stuff, proceeding... [15:17:17] ...but the ability to run maint scripts outside and inside debug would help [15:19:07] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1015.eqiad.wmnet with OS bullseye [15:19:55] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1015.eqiad.wmnet with OS bullseye [15:20:17] FIRING: [2x] ProbeDown: Service wdqs2013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:20:22] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device lsw1-d1-codfw.mgmt.codfw.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T396641#10909691 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm provisioned server, ran netbox and switch interface scripts. [15:20:39] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device lsw1-d5-codfw.mgmt.codfw.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T396642#10909696 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm provisioned server, ran netbox and switch interface scripts. [15:21:08] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device lsw1-d6-codfw.mgmt.codfw.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T396643#10909700 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm provisioned server, ran netbox and switch interface scripts. [15:21:31] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device lsw1-e1-codfw.mgmt.codfw.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T396657#10909705 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm provisioned server, ran netbox and switch interface scripts. [15:21:44] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device lsw1-e3-codfw.mgmt.codfw.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T396658#10909709 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm provisioned server, ran netbox and switch interface scripts. [15:21:59] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: SSD firmware update for db125[0-4] - https://phabricator.wikimedia.org/T396648#10909713 (10RobH) Ok, after I ran into the same issue and bugged Riccardo, it turns out it was an easy fix and I just didn't quite grep it and now do. I thought each of... [15:23:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp2057 [15:23:58] !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1156378|LinkRecommendationStore: Query templatelinks on the main DB (T396680)]] (duration: 18m 06s) [15:24:02] T396680: Table 'hewiki.templatelinks' doesn't exist - https://phabricator.wikimedia.org/T396680 [15:24:04] done [15:24:06] Lucas_WMDE: over to you [15:24:16] thanks! [15:24:32] 10ops-eqiad, 06SRE, 06SRE-OnFire, 10Cassandra, and 4 others: additional sessionstore expansion — eqiad - https://phabricator.wikimedia.org/T395955#10909729 (10Eevans) 05Open→03Resolved Thanks @VRiley-WMF; All four seem to be present on each! This time around I'll be reimaging to put them first, so... [15:24:44] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1015.eqiad.wmnet with OS bullseye [15:24:54] !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1015.eqiad.wmnet'] [15:25:10] FIRING: SystemdUnitFailed: wmf_auto_restart_atftpd.service on install7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:25:13] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp2057 [15:25:23] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp2057 [15:25:38] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [15:25:39] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts db1253.eqiad.wmnet [15:25:49] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts db1253.eqiad.wmnet [15:25:58] * Lucas_WMDE deploying [15:28:20] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:28:40] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts db1253.eqiad.wmnet [15:29:20] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device lsw1-f1-codfw.mgmt.codfw.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T396659#10909784 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm provisioned server, rand dns and switch interface cookbook [15:29:21] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts db1253.eqiad.wmnet [15:29:30] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts db1253.eqiad.wmnet [15:29:39] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts db1253.eqiad.wmnet [15:29:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T395241)', diff saved to https://phabricator.wikimedia.org/P77847 and previous config saved to /var/cache/conftool/dbconfig/20250612-152942-fceratto.json [15:30:01] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance [15:30:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2174 (T395241)', diff saved to https://phabricator.wikimedia.org/P77848 and previous config saved to /var/cache/conftool/dbconfig/20250612-153008-fceratto.json [15:30:39] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Fix PXE miss-configurations - https://phabricator.wikimedia.org/T396717#10909796 (10Jhancock.wm) I can do the ones in codfw for sure. do these servers need to be depooled before we start on this list? [15:30:57] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: SSD firmware update not working in firmware cookbook - https://phabricator.wikimedia.org/T394543#10909802 (10RobH) Not sure what I'm doing wrong: > robh@cumin2002:~$ sudo cookbook sre.hardware.upgrade-firmware -c ssd "db1253.*" > Acquired lock for... [15:31:32] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1015.eqiad.wmnet'] [15:31:53] (03PS3) 10Filippo Giunchedi: pontoon: Clarify push branch, add how to mess with private repo [puppet] - 10https://gerrit.wikimedia.org/r/1156380 (owner: 10Alexandros Kosiaris) [15:32:08] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1015.eqiad.wmnet with OS bullseye [15:32:47] (03CR) 10Filippo Giunchedi: [C:03+2] "Thank you for the patch, I've tweaked a couple of things and LGTM! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/1156380 (owner: 10Alexandros Kosiaris) [15:32:58] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts db1253.eqiad.wmnet [15:34:38] !log robh@cumin2002 START - Cookbook sre.hosts.reboot-single for host db1253.eqiad.wmnet [15:35:07] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: SSD firmware update not working in firmware cookbook - https://phabricator.wikimedia.org/T394543#10909840 (10RobH) Bah, fixed, was SSD directory not STORAGE, thanks Riccardo! [15:35:09] !log lucaswerkmeister-wmde Deployed security patch for T396685 [15:37:30] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db224[12] - https://phabricator.wikimedia.org/T379757#10909862 (10ayounsi) LGTM! [15:37:54] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:39:08] (03PS5) 10CDobbins: varnish: Replace X-RB-NOREDIR with rb_noredir var [puppet] - 10https://gerrit.wikimedia.org/r/1154085 [15:40:26] andrew@cumin1002 reimage (PID 1615705) is awaiting input [15:40:48] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1015.eqiad.wmnet with OS bullseye [15:40:52] !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1015.eqiad.wmnet'] [15:41:30] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Fix PXE miss-configurations - https://phabricator.wikimedia.org/T396717#10909875 (10ayounsi) In theory not, but could be worth checking with the service owners just in case [15:42:24] 10SRE-tools, 06Infrastructure-Foundations, 10netops: Evaluate automatic MAC-based DHCP for production servers - https://phabricator.wikimedia.org/T396712#10909876 (10ayounsi) [15:42:25] 10ops-codfw, 06SRE, 06DC-Ops: cirrussearch2079 iDRAC not working - https://phabricator.wikimedia.org/T396718#10909877 (10ayounsi) [15:42:26] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Fix PXE miss-configurations - https://phabricator.wikimedia.org/T396717#10909878 (10ayounsi) [15:44:15] !log lucaswerkmeister-wmde Deployed security patch for T396685 [15:44:40] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Fix PXE miss-configurations - https://phabricator.wikimedia.org/T396717#10909891 (10Volans) Running the provision cookbook (with the appropriate options for an existing host) might or might not trigger a host reboot based on what configurations are changed. So it... [15:45:17] * Lucas_WMDE done deploying [15:45:49] !log removed python3-conftool-dbctl package from puppetmaster[12]001 - T395696 [15:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:52] T395696: Move ExternalStore config out of mediawiki config - https://phabricator.wikimedia.org/T395696 [15:48:09] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: SSD firmware update for db125[0-4] - https://phabricator.wikimedia.org/T396648#10909897 (10RobH) [15:49:19] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: SSD firmware update for db125[0-4] - https://phabricator.wikimedia.org/T396648#10909902 (10RobH) a:05RobH→03Ladsgroup Amir, Apologies, when I tested this on other hosts, they were a different chassis model so the firmware file lives in a diffe... [15:49:44] !log robh@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host db1253.eqiad.wmnet [15:49:47] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts db1253.eqiad.wmnet [15:49:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T395241)', diff saved to https://phabricator.wikimedia.org/P77850 and previous config saved to /var/cache/conftool/dbconfig/20250612-154947-fceratto.json [15:52:20] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: SSD firmware update for frbackup2002 - https://phabricator.wikimedia.org/T396649#10909951 (10RobH) [15:55:18] (03CR) 10Btullis: [C:03+2] Bump up the CPU and RAM resources for airflow related namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156377 (https://phabricator.wikimedia.org/T388378) (owner: 10Btullis) [15:58:26] (03CR) 10CDobbins: varnish: Replace X-RB-NOREDIR with rb_noredir var (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1154085 (owner: 10CDobbins) [15:58:31] (03CR) 10Elukey: [C:03+2] admin: move cmelo to ssh user [puppet] - 10https://gerrit.wikimedia.org/r/1155717 (https://phabricator.wikimedia.org/T395966) (owner: 10Elukey) [15:58:46] !log andrew@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1015.eqiad.wmnet'] [15:59:25] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for cmelo - https://phabricator.wikimedia.org/T395966#10909994 (10elukey) [15:59:41] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: SSD firmware update for frbackup2002 - https://phabricator.wikimedia.org/T396649#10909995 (10RobH) So I went to test out my ssh config and since I've reimaged onto a new laptop I just now realize I failed to migrate my frack settings properly. @jgreen:... [16:00:05] jhathaway and moritzm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250612T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:21] (03PS1) 10Jgiannelos: changeprop: Remove rules related to parsoid (RB sunset) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156389 [16:00:23] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for cmelo - https://phabricator.wikimedia.org/T395966#10909997 (10elukey) Credentials are being propagated as we speak, should be ready and usable during the next hour. @cmelo Please test an ssh connection to a host like stat1... [16:00:33] PROBLEM - Hadoop NodeManager on an-worker1174 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:02:00] 06SRE, 06Data-Engineering: WE 5.4 FY 25/26: Improve automata detection at the edge and pass it to the refinery pipeline - https://phabricator.wikimedia.org/T396562#10910014 (10CDanis) >>! In T396562#10907844, @JAllemandou wrote: > This idea is great, thank you @Joe for filling this. > > I concur with the idea... [16:02:09] (03PS8) 10Cwhite: logstash: add filter_on_template_v2 [puppet] - 10https://gerrit.wikimedia.org/r/1154348 (https://phabricator.wikimedia.org/T234565) [16:02:10] (03PS2) 10Jgiannelos: changeprop: Remove rules related to parsoid (RB sunset) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156389 (https://phabricator.wikimedia.org/T367418) [16:02:13] (03Merged) 10jenkins-bot: Bump up the CPU and RAM resources for airflow related namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156377 (https://phabricator.wikimedia.org/T388378) (owner: 10Btullis) [16:02:44] (03CR) 10Eevans: [C:03+2] cassandra-dev2001: configure for partition reuse [puppet] - 10https://gerrit.wikimedia.org/r/1155756 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [16:04:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P77851 and previous config saved to /var/cache/conftool/dbconfig/20250612-160454-fceratto.json [16:05:03] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1015.eqiad.wmnet with OS bullseye [16:05:42] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db224[12] - https://phabricator.wikimedia.org/T379757#10910054 (10Volans) Thanks, I'm checking with @wiki_willy too for the accounting side before proceeding to be sure. [16:10:01] (03PS1) 10Effie Mouzeli: kubernetes: create mediawiki_experimental profile [puppet] - 10https://gerrit.wikimedia.org/r/1156392 (https://phabricator.wikimedia.org/T396767) [16:10:22] !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host cassandra-dev2001.codfw.wmnet with OS bullseye [16:10:32] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10910083 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host cassandra-dev2001.... [16:10:56] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1015.eqiad.wmnet with OS bullseye [16:11:04] 10ops-codfw, 06SRE, 06SRE-OnFire, 10Cassandra, and 3 others: additional sessionstore expansion — codfw - https://phabricator.wikimedia.org/T395954#10910092 (10Jhancock.wm) @Eevans 4 drives installed in the servers. [16:12:51] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr1-magru:xe-0/1/2 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:14:02] 10ops-codfw, 06SRE, 06DC-Ops: Moving extra 1G port to make 10G space on cloud rack. - https://phabricator.wikimedia.org/T396363#10910100 (10Jhancock.wm) @cmooney do you want to do it friday morning (for me) say 1400 UTC [16:14:21] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1015.eqiad.wmnet with OS bullseye [16:14:39] PROBLEM - Hadoop NodeManager on an-worker1148 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:15:35] (03PS10) 10CDobbins: add rest of South America (except Falkland Islands) to geo-maps [dns] - 10https://gerrit.wikimedia.org/r/1153334 [16:15:39] RECOVERY - Hadoop NodeManager on an-worker1148 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:15:53] (03PS1) 10Jgiannelos: changeprop: Remove rules related to page/title (RB sunset) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156396 [16:18:50] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1015.eqiad.wmnet with OS bullseye [16:19:07] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1015.eqiad.wmnet with OS bullseye [16:19:14] 10ops-codfw, 06SRE, 06SRE-OnFire, 10Cassandra, and 3 others: additional sessionstore expansion — codfw - https://phabricator.wikimedia.org/T395954#10910124 (10Eevans) 05Open→03Resolved a:03Jhancock.wm Thanks @Jhancock.wm ; All four seem to be present on each! I'll be reimaging to put them first,... [16:20:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P77852 and previous config saved to /var/cache/conftool/dbconfig/20250612-162002-fceratto.json [16:21:01] jouncebot nowandnext [16:21:02] For the next 0 hour(s) and 38 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250612T1600) [16:21:02] In 0 hour(s) and 38 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250612T1700) [16:21:02] In 0 hour(s) and 38 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250612T1700) [16:21:38] any objections to a backport for a train blocker? [16:23:00] (pending code review.) [16:23:44] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:25:29] !log eevans@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cassandra-dev2001.codfw.wmnet with reason: host reimage [16:25:33] RECOVERY - Hadoop NodeManager on an-worker1174 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:26:02] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1015.eqiad.wmnet with OS bullseye [16:26:09] (03PS2) 10Jgiannelos: changeprop: Remove rules related to page/title (RB sunset) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156396 [16:26:15] !log volans@cumin1003 START - Cookbook sre.dns.netbox [16:29:01] brennen: we have some work planned for the infra window at 17:00, but as long as it's fine if we slide into the train window a bit a 18:00 (in the event we need to hold for a little while for your backport to complete), no objections :) [16:29:08] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:29:59] !log volans@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Invert db2241 and db2242 DNS T379757#10908710 - volans@cumin1003" [16:30:02] T379757: Q2:rack/setup/install db224[12] - https://phabricator.wikimedia.org/T379757 [16:30:03] !log volans@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Invert db2241 and db2242 DNS T379757#10908710 - volans@cumin1003" [16:30:03] !log volans@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:30:10] swfrench-wmf: ack, thanks. i'll update once https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1156390 is through review. [16:30:11] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 7 - rack E6) - https://phabricator.wikimedia.org/T390174#10910160 (10VRiley-WMF) 05Open→03In progress [16:30:44] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [16:30:47] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:31:13] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [16:31:34] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cassandra-dev2001.codfw.wmnet with reason: host reimage [16:35:06] !log volans@cumin1003 START - Cookbook sre.dns.wipe-cache db2241.mgmt.codfw.wmnet db2242.mgmt.codfw.wmnet on all recursors [16:35:09] !log volans@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) db2241.mgmt.codfw.wmnet db2242.mgmt.codfw.wmnet on all recursors [16:35:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T395241)', diff saved to https://phabricator.wikimedia.org/P77853 and previous config saved to /var/cache/conftool/dbconfig/20250612-163509-fceratto.json [16:35:29] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance [16:35:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2176 (T395241)', diff saved to https://phabricator.wikimedia.org/P77854 and previous config saved to /var/cache/conftool/dbconfig/20250612-163536-fceratto.json [16:35:56] (03CR) 10Majavah: [C:03+2] hieradata: Add codfw1dev v6 auth DNS IPs [puppet] - 10https://gerrit.wikimedia.org/r/1155613 (https://phabricator.wikimedia.org/T396448) (owner: 10Majavah) [16:36:04] (03CR) 10Majavah: [C:03+2] hieradata: Add codfw1dev v6 recursive DNS IPs [puppet] - 10https://gerrit.wikimedia.org/r/1155614 (https://phabricator.wikimedia.org/T396448) (owner: 10Majavah) [16:38:18] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: SSD firmware update for frbackup2002 - https://phabricator.wikimedia.org/T396649#10910227 (10Jgreen) >>! In T396649#10909995, @RobH wrote: > So I went to test out my ssh config and since I've reimaged onto a new laptop I just now realize I failed to mig... [16:39:53] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T392968#10910243 (10phaultfinder) [16:40:47] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:41:11] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2226 - https://phabricator.wikimedia.org/T396323#10910259 (10Jhancock.wm) @Marostegui disk has been replaced. Looks physical alerts have cleared. lmk if you need anything else or if we can close the ticket. [16:41:19] (03PS8) 10Btullis: Airflow: Add local settings to enable the xcom_sidecar functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154248 (https://phabricator.wikimedia.org/T388378) [16:42:49] (03CR) 10Dzahn: [C:03+1] gitlab: add bwlimit to backup rsync job [puppet] - 10https://gerrit.wikimedia.org/r/1156356 (owner: 10Jelto) [16:43:32] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:44:32] (03CR) 10Aleksandar Mastilovic: Removing WM Enterprise downloader Puppet configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1142712 (https://phabricator.wikimedia.org/T390556) (owner: 10Aleksandar Mastilovic) [16:44:47] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:46:02] (03PS1) 10Federico Ceratto: zarcillo: Allow egress to idp.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156401 (https://phabricator.wikimedia.org/T384810) [16:46:02] (03CR) 10Federico Ceratto: "A tiny change, already tested on the preprod service." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156401 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [16:46:46] (03CR) 10Btullis: Airflow: Add local settings to enable the xcom_sidecar functionality (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154248 (https://phabricator.wikimedia.org/T388378) (owner: 10Btullis) [16:48:06] !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1015.eqiad.wmnet with OS bookworm [16:48:28] (03PS3) 10Dzahn: Add tj to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/1152136 (https://phabricator.wikimedia.org/T393803) (owner: 10Jasmine) [16:49:23] (03CR) 10Dzahn: [C:03+2] gitlab: add bwlimit to backup rsync job [puppet] - 10https://gerrit.wikimedia.org/r/1156356 (owner: 10Jelto) [16:49:39] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cassandra-dev2001.codfw.wmnet with OS bullseye [16:49:51] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10910356 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host cassandra-dev2001.codf... [16:49:59] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:53:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T395241)', diff saved to https://phabricator.wikimedia.org/P77856 and previous config saved to /var/cache/conftool/dbconfig/20250612-165320-fceratto.json [16:53:54] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db224[12] - https://phabricator.wikimedia.org/T379757#10910370 (10Volans) 05Open→03Resolved Changes applied, I had to also run the `sudo cookbook sre.dns.wipe-cache db2241.mgmt.codfw.wmnet db2242.mgmt.codfw.wmnet` to make sure... [16:56:10] (03PS1) 10Brennen Bearnes: ParserOutput::collectMetadata: Cast array keys to string [core] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1156404 (https://phabricator.wikimedia.org/T396656) [16:56:14] robh@cumin2002 reimage (PID 479670) is awaiting input [16:56:26] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1015.eqiad.wmnet with OS bookworm [16:56:29] (03CR) 10Brennen Bearnes: [C:03+2] ParserOutput::collectMetadata: Cast array keys to string [core] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1156404 (https://phabricator.wikimedia.org/T396656) (owner: 10Brennen Bearnes) [16:57:29] (03PS1) 10BryanDavis: developer-portal: Bump to 2025-06-12-124643-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156405 [17:00:04] bd808: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250612T1700). [17:00:05] swfrench-wmf and jasmine_: Time to do the MediaWiki infrastructure (UTC late) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250612T1700). [17:01:53] swfrench-wmf: got a rough idea of how long your work will take? i'm inclined to say you should go ahead [17:02:59] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 7 - rack E6) - https://phabricator.wikimedia.org/T390174#10910394 (10VRiley-WMF) [17:04:16] jasmine_: you got the green light [17:04:33] ty! [17:04:40] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 7 - rack E6) - https://phabricator.wikimedia.org/T390174#10910405 (10VRiley-WMF) 05In progress→03Open @Stevemunene this has been completed [17:05:54] brennen: mutante: I think we're all set up over here, I'd anticipate the mediawiki deployment part of this to take 20+ minutes, but that assumes everything goes to plan :) [17:06:54] k, please go ahead and give me a ping when you're finished. :) [17:06:58] ack [17:07:09] thanks, sounds good [17:08:28] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P77857 and previous config saved to /var/cache/conftool/dbconfig/20250612-170828-fceratto.json [17:09:01] (03CR) 10Jasmine: [C:03+2] mediawiki/apache: redirect tj.*.org to tg.*.org for all projects [puppet] - 10https://gerrit.wikimedia.org/r/1148981 (https://phabricator.wikimedia.org/T393803) (owner: 10Dzahn) [17:09:17] PROBLEM - Hadoop NodeManager on an-worker1146 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:09:41] (03CR) 10CI reject: [V:04-1] ParserOutput::collectMetadata: Cast array keys to string [core] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1156404 (https://phabricator.wikimedia.org/T396656) (owner: 10Brennen Bearnes) [17:13:18] (03CR) 10Brennen Bearnes: [C:04-1] "I suspect/assume this will pass on a recheck; however holding for the moment while MW infrastructure window work proceeds." [core] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1156404 (https://phabricator.wikimedia.org/T396656) (owner: 10Brennen Bearnes) [17:13:36] !log cmooney@cumin1003 START - Cookbook sre.hosts.dhcp for host cloudcephosd1015.eqiad.wmnet [17:14:05] PROBLEM - Hadoop NodeManager on an-worker1135 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:15:05] RECOVERY - Hadoop NodeManager on an-worker1135 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:16:39] cmooney@cumin1003 dhcp (PID 1312093) is awaiting input [17:17:54] (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump to 2025-06-12-124643-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156405 (owner: 10BryanDavis) [17:18:10] !log cmooney@cumin1003 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host cloudcephosd1015.eqiad.wmnet [17:19:31] (03Merged) 10jenkins-bot: developer-portal: Bump to 2025-06-12-124643-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156405 (owner: 10BryanDavis) [17:20:28] (03PS1) 10Effie Mouzeli: profile::kubernetes::mediawiki_experimental: properly determine latest image [puppet] - 10https://gerrit.wikimedia.org/r/1156410 (https://phabricator.wikimedia.org/T396767) [17:20:57] (03CR) 10CI reject: [V:04-1] profile::kubernetes::mediawiki_experimental: properly determine latest image [puppet] - 10https://gerrit.wikimedia.org/r/1156410 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli) [17:21:44] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [17:21:54] !log bd808@deploy1003 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:22:10] !log bd808@deploy1003 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:22:32] !log bd808@deploy1003 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:22:50] !log bd808@deploy1003 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:23:03] (03CR) 10Vgutierrez: [C:03+1] varnish: Replace analytics fake headers with vars [puppet] - 10https://gerrit.wikimedia.org/r/1147912 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall) [17:23:04] (03PS2) 10Effie Mouzeli: profile::kubernetes::mediawiki_experimental: properly determine latest image [puppet] - 10https://gerrit.wikimedia.org/r/1156410 (https://phabricator.wikimedia.org/T396767) [17:23:32] (03CR) 10CI reject: [V:04-1] profile::kubernetes::mediawiki_experimental: properly determine latest image [puppet] - 10https://gerrit.wikimedia.org/r/1156410 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli) [17:23:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P77859 and previous config saved to /var/cache/conftool/dbconfig/20250612-172335-fceratto.json [17:23:47] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: SSD firmware update for frbackup2002 - https://phabricator.wikimedia.org/T396649#10910478 (10RobH) Chatting with Jeff I'm having issues getting my ssh proxy onto frack working. Rather than troubleshoot that when it works for him, I've typed up the foll... [17:23:47] !log bd808@deploy1003 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:24:02] !log bd808@deploy1003 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:25:06] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding sretest2006 to codfw - jhancock@cumin2002" [17:25:10] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_atftpd.service on install7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:25:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding sretest2006 to codfw - jhancock@cumin2002" [17:25:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:25:16] (03CR) 10Vgutierrez: "are varnishtests happy?" [puppet] - 10https://gerrit.wikimedia.org/r/1152114 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall) [17:25:19] RECOVERY - Dell PowerEdge or Supermicro Broadcom RAID Controller on db2226 is OK: communication: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [17:25:28] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host sretest2006 [17:25:37] (03CR) 10Arlolra: [C:03+1] "Confirmed that looks like CI flakiness" [core] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1156404 (https://phabricator.wikimedia.org/T396656) (owner: 10Brennen Bearnes) [17:25:43] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2006 [17:27:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_atftpd.service on install7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:29:25] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for nokiatest2001.mgmt:22 - https://phabricator.wikimedia.org/T396547#10910488 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm offlined server [17:29:36] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Fix PXE miss-configurations - https://phabricator.wikimedia.org/T396717#10910491 (10ayounsi) [17:29:43] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for nokiatest2002.mgmt:22 - https://phabricator.wikimedia.org/T396546#10910493 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm offlined server [17:31:14] 10ops-codfw, 06SRE, 06DC-Ops: cirrussearch2079 iDRAC not working - https://phabricator.wikimedia.org/T396718#10910502 (10Jhancock.wm) fyi, nokiatest2001-2 were decommed. [17:31:17] RECOVERY - Hadoop NodeManager on an-worker1146 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:32:47] (03PS2) 10Btullis: Increase thresholds for run_podsandbox and stop_podsandbox in dse-k8s [alerts] - 10https://gerrit.wikimedia.org/r/1156324 (https://phabricator.wikimedia.org/T396738) [17:34:04] (03CR) 10CI reject: [V:04-1] Increase thresholds for run_podsandbox and stop_podsandbox in dse-k8s [alerts] - 10https://gerrit.wikimedia.org/r/1156324 (https://phabricator.wikimedia.org/T396738) (owner: 10Btullis) [17:35:12] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-ulsfo and A:cp - Fix VSLbs() assert error and upgrade libvmod-wmfuniq to 0.2.0 (T396581) [17:35:16] T396581: varnish 7.1.1-2~bpo11+wmf1 crash - https://phabricator.wikimedia.org/T396581 [17:35:25] !log jasmine@deploy1003 Started scap sync-world: Deploying apache2 configuration change for T393803 [17:35:29] T393803: Create redirect from tj.*.org to tg.*.org - https://phabricator.wikimedia.org/T393803 [17:36:20] !log jasmine@deploy1003 jasmine: Deploying apache2 configuration change for T393803 synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:36:45] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [17:36:50] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [17:37:03] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Fix PXE miss-configurations - https://phabricator.wikimedia.org/T396717#10910508 (10ayounsi) [17:37:52] (03CR) 10BCornwall: [C:03+2] Revert^2 "acmechief: Add pywikipedia.org to the cert list" [puppet] - 10https://gerrit.wikimedia.org/r/1154855 (owner: 10BCornwall) [17:38:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T395241)', diff saved to https://phabricator.wikimedia.org/P77860 and previous config saved to /var/cache/conftool/dbconfig/20250612-173843-fceratto.json [17:39:02] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2188.codfw.wmnet with reason: Maintenance [17:39:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2188 (T395241)', diff saved to https://phabricator.wikimedia.org/P77861 and previous config saved to /var/cache/conftool/dbconfig/20250612-173909-fceratto.json [17:39:39] PROBLEM - Hadoop NodeManager on an-worker1124 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:41:19] PROBLEM - Hadoop NodeManager on an-worker1117 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:42:33] 06SRE, 10Pywikibot: pywikipedia.org is using wildcard cert for production projects - https://phabricator.wikimedia.org/T388809#10910517 (10BCornwall) 05Open→03Resolved Thanks, everyone, for getting this through the finish line! [17:44:27] !log jasmine@deploy1003 jasmine: Continuing with sync [17:44:39] RECOVERY - Hadoop NodeManager on an-worker1124 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:46:12] (03PS6) 10BCornwall: hiera: Add lvs1016 to high-traffic1 [puppet] - 10https://gerrit.wikimedia.org/r/1153418 (https://phabricator.wikimedia.org/T387145) [17:46:12] (03PS2) 10BCornwall: Promote lvs1016 over lvs1017 [puppet] - 10https://gerrit.wikimedia.org/r/1154905 (https://phabricator.wikimedia.org/T387145) [17:49:09] PROBLEM - Hadoop NodeManager on an-worker1177 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:50:44] !log jasmine@deploy1003 Finished scap sync-world: Deploying apache2 configuration change for T393803 (duration: 20m 58s) [17:50:48] T393803: Create redirect from tj.*.org to tg.*.org - https://phabricator.wikimedia.org/T393803 [17:52:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T395241)', diff saved to https://phabricator.wikimedia.org/P77862 and previous config saved to /var/cache/conftool/dbconfig/20250612-175226-fceratto.json [17:53:09] RECOVERY - Hadoop NodeManager on an-worker1177 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:53:37] (03PS1) 10Hnowlan: wikifeeds: remove rest-gateway references [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156416 (https://phabricator.wikimedia.org/T367418) [17:53:39] brennen: we're done if you'd like to proceed [17:53:46] jasmine_: right on, ty [17:53:54] (03CR) 10VadymTS1: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155945 (https://phabricator.wikimedia.org/T396668) (owner: 10EggRoll97) [17:54:04] (03CR) 10Brennen Bearnes: [C:03+2] ParserOutput::collectMetadata: Cast array keys to string [core] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1156404 (https://phabricator.wikimedia.org/T396656) (owner: 10Brennen Bearnes) [17:55:51] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: SSD firmware update for frbackup2002 - https://phabricator.wikimedia.org/T396649#10910534 (10Jgreen) @RobH Done! ` Disk 12 in Backplane 1 of RAID Controller in Slot 4 DL7C Disk 13 in Backplane 1 of RAID Controller in Slot 4 DL7C ` [17:56:05] (03CR) 10Jgiannelos: [C:04-1] "We still use rest gateway to talk to parsoid. That needs to be changed first before we remove rest gateway" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156416 (https://phabricator.wikimedia.org/T367418) (owner: 10Hnowlan) [17:56:06] (03CR) 10Jasmine: [C:03+2] Add tj to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/1152136 (https://phabricator.wikimedia.org/T393803) (owner: 10Jasmine) [17:56:34] (03CR) 10Anzx: [C:03+1] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155945 (https://phabricator.wikimedia.org/T396668) (owner: 10EggRoll97) [17:56:34] (03CR) 10Anzx: [C:03+1] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155945 (https://phabricator.wikimedia.org/T396668) (owner: 10EggRoll97) [17:58:07] (03Merged) 10jenkins-bot: ParserOutput::collectMetadata: Cast array keys to string [core] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1156404 (https://phabricator.wikimedia.org/T396656) (owner: 10Brennen Bearnes) [17:59:27] !log brennen@deploy1003 Started scap sync-world: Backport for [[gerrit:1156404|ParserOutput::collectMetadata: Cast array keys to string (T396656)]] [17:59:31] T396656: TypeError: MediaWiki\Parser\ParserOutput::appendJsConfigVar(): Argument #2 ($value) must be of type string, int given - https://phabricator.wikimedia.org/T396656 [18:00:06] brennen and dduvall: That opportune time for a MediaWiki train - Utc-7 Version deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250612T1800). [18:01:37] (03PS1) 10Bvibber: Fix for multiple charts on same page using mix of transforms [extensions/JsonConfig] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1156420 (https://phabricator.wikimedia.org/T396512) [18:01:52] !log brennen@deploy1003 brennen: Backport for [[gerrit:1156404|ParserOutput::collectMetadata: Cast array keys to string (T396656)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:01:53] (03PS1) 10Bvibber: Fix for multiple charts on same page using mix of transforms [extensions/JsonConfig] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1156421 (https://phabricator.wikimedia.org/T396512) [18:02:31] 10ops-magru: Power Supply - Status - issue on ganeti7004:9290 - https://phabricator.wikimedia.org/T394601#10910563 (10RobH) 05Open→03Resolved a:03RobH no longer in alarm [18:02:43] 10ops-magru: Power Supply - Status - issue on cp7003:9290 - https://phabricator.wikimedia.org/T394599#10910567 (10RobH) 05Open→03Resolved a:03RobH See T395830 [18:02:49] 10ops-magru: Power Supply - PS Redundancy - issue on cp7010:9290 - https://phabricator.wikimedia.org/T394597#10910573 (10RobH) 05Open→03Resolved a:03RobH T395830 [18:03:28] !log brennen@deploy1003 brennen: Continuing with sync [18:03:50] (tested above, fixes a known-broken page.) [18:04:22] !log jasmine@dns1004 START - running authdns-update [18:04:51] 10ops-magru: Power Supply - Status - issue on cp7004:9290 - https://phabricator.wikimedia.org/T394600#10910582 (10RobH) 05Open→03Resolved a:03RobH T395830 [18:05:33] !log jasmine@dns1004 START - running authdns-update [18:06:32] !log jasmine@dns1004 END - running authdns-update [18:06:36] 10ops-magru: OutboundInterfaceErrors - https://phabricator.wikimedia.org/T390258#10910595 (10RobH) @ayounsi or @cmooney, When I check the link above it shows 'no data' for interface errors, so not sure what we should be checking here? [18:07:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P77863 and previous config saved to /var/cache/conftool/dbconfig/20250612-180733-fceratto.json [18:08:11] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: SSD firmware update for frbackup2002 - https://phabricator.wikimedia.org/T396649#10910608 (10RobH) 05Open→03Resolved a:03RobH Thanks for handling this! [18:08:19] RECOVERY - Hadoop NodeManager on an-worker1117 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:09:45] (03CR) 10CI reject: [V:04-1] Fix for multiple charts on same page using mix of transforms [extensions/JsonConfig] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1156421 (https://phabricator.wikimedia.org/T396512) (owner: 10Bvibber) [18:10:19] !log brennen@deploy1003 Finished scap sync-world: Backport for [[gerrit:1156404|ParserOutput::collectMetadata: Cast array keys to string (T396656)]] (duration: 10m 51s) [18:10:22] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye [18:10:23] T396656: TypeError: MediaWiki\Parser\ParserOutput::appendJsConfigVar(): Argument #2 ($value) must be of type string, int given - https://phabricator.wikimedia.org/T396656 [18:10:28] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10910618 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host an-worker1185.eqiad.wmnet with OS b... [18:14:21] (03PS1) 10TrainBranchBot: group2 to 1.45.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156422 (https://phabricator.wikimedia.org/T392175) [18:14:22] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.45.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156422 (https://phabricator.wikimedia.org/T392175) (owner: 10TrainBranchBot) [18:15:08] (03Merged) 10jenkins-bot: group2 to 1.45.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156422 (https://phabricator.wikimedia.org/T392175) (owner: 10TrainBranchBot) [18:22:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P77864 and previous config saved to /var/cache/conftool/dbconfig/20250612-182241-fceratto.json [18:24:39] !log brennen@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.5 refs T392175 [18:24:43] T392175: 1.45.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T392175 [18:25:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:25:35] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye [18:25:43] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10910668 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host an-worker1185.eqiad.wmnet with OS bulls... [18:26:31] rolling back. [18:27:00] (03PS1) 10TrainBranchBot: group1 to 1.45.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156425 (https://phabricator.wikimedia.org/T392175) [18:27:01] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.45.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156425 (https://phabricator.wikimedia.org/T392175) (owner: 10TrainBranchBot) [18:27:53] (03Merged) 10jenkins-bot: group1 to 1.45.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156425 (https://phabricator.wikimedia.org/T392175) (owner: 10TrainBranchBot) [18:28:45] PROBLEM - Hadoop NodeManager on an-worker1103 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:30:15] FIRING: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:31:55] jasmine_: thank you. tj.wikipedia.org -> tg.wikipedia.org works for me. gj! [18:32:56] nicely done, jasmine_! :) [18:33:44] 06SRE, 10DNS, 06serviceops, 06Traffic: Create redirect from tj.*.org to tg.*.org - https://phabricator.wikimedia.org/T393803#10910716 (10Dzahn) works for me now. thanks @jasmine_ for deploying my patches!:) ` curl -vv https://tj.wikipedia.org ..

The document has moved (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy1003 using scap backport" [extensions/JsonConfig] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1156421 (https://phabricator.wikimedia.org/T396512) (owner: 10Bvibber) [19:07:23] RECOVERY - Hadoop NodeManager on an-worker1141 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:08:51] (03CR) 10Bvibber: "recheck" [extensions/JsonConfig] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1156421 (https://phabricator.wikimedia.org/T396512) (owner: 10Bvibber) [19:10:14] (03CR) 10Bvibber: "recheck" [extensions/JsonConfig] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1156421 (https://phabricator.wikimedia.org/T396512) (owner: 10Bvibber) [19:10:36] (03CR) 10CI reject: [V:04-1] Fix for multiple charts on same page using mix of transforms [extensions/JsonConfig] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1156420 (https://phabricator.wikimedia.org/T396512) (owner: 10Bvibber) [19:12:50] (03CR) 10Bvibber: "recheck" [extensions/JsonConfig] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1156421 (https://phabricator.wikimedia.org/T396512) (owner: 10Bvibber) [19:18:09] !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1015.eqiad.wmnet'] [19:18:10] (03CR) 10CI reject: [V:04-1] Fix for multiple charts on same page using mix of transforms [extensions/JsonConfig] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1156421 (https://phabricator.wikimedia.org/T396512) (owner: 10Bvibber) [19:18:29] vriley@cumin1002 reimage (PID 1640111) is awaiting input [19:18:34] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1015.eqiad.wmnet'] [19:18:46] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye [19:18:58] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1186.eqiad.wmnet with OS bullseye [19:18:58] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10910818 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host an-worker1185.eqiad.wmnet with OS bulls... [19:19:02] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10910820 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host an-worker1186.eqiad.wmnet with OS bulls... [19:19:29] !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1015.eqiad.wmnet'] [19:20:32] FIRING: [2x] ProbeDown: Service wdqs2013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:22:49] (03CR) 10Bvibber: "recheck" [extensions/JsonConfig] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1156421 (https://phabricator.wikimedia.org/T396512) (owner: 10Bvibber) [19:27:05] !log andrew@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1015.eqiad.wmnet'] [19:31:37] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1015.eqiad.wmnet with OS bullseye [19:44:20] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10910933 (10VRiley-WMF) @ayounsi Decomming now [19:47:38] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1015.eqiad.wmnet with reason: host reimage [19:51:02] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1015.eqiad.wmnet with reason: host reimage [20:00:04] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250612T2000). [20:00:04] EggRoll97 and anzx: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:03:27] o/ [20:05:12] 10SRE-tools, 06Infrastructure-Foundations: decommission cookbook: add support for decom spreadsheet - https://phabricator.wikimedia.org/T244315#10910987 (10wiki_willy) Hey @Volans - I think we've come up with a couple solutions since this task was created. One is providing a monthly Netbox dump to the Account... [20:08:34] (03PS2) 10Andrew Bogott: cloudcephosd1016 -> puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1156303 (https://phabricator.wikimedia.org/T309789) [20:08:34] (03PS2) 10Andrew Bogott: cloudcephosd1017 -> puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1156304 (https://phabricator.wikimedia.org/T309789) [20:08:34] (03PS2) 10Andrew Bogott: cloudcephosd1018 -> puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1156305 (https://phabricator.wikimedia.org/T309789) [20:08:35] (03PS2) 10Andrew Bogott: cloudcephosd1019 -> puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1156306 (https://phabricator.wikimedia.org/T309789) [20:08:36] (03PS2) 10Andrew Bogott: cloudcephosd1020 -> puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1156307 (https://phabricator.wikimedia.org/T309789) [20:08:37] (03PS1) 10Andrew Bogott: Update nic names for cloudcephosd1015/Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1156437 [20:10:43] (03CR) 10Andrew Bogott: [C:03+2] Update nic names for cloudcephosd1015/Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1156437 (owner: 10Andrew Bogott) [20:10:48] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Fix PXE miss-configurations - https://phabricator.wikimedia.org/T396717#10911002 (10Jhancock.wm) i got some new servers i can test it out on real quick. i think pxe order might be one of those things that needs a reboot. I'll report back. [20:13:49] (03CR) 10BryanDavis: "Cause of https://phabricator.wikimedia.org/T396732. Cloud VPS doesn't pick up hieradata/role content." [puppet] - 10https://gerrit.wikimedia.org/r/1155609 (owner: 10Muehlenhoff) [20:15:38] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1015.eqiad.wmnet with OS bullseye [20:18:22] !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1016.eqiad.wmnet'] [20:21:00] (03CR) 10Bvibber: "recheck" [extensions/JsonConfig] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1156421 (https://phabricator.wikimedia.org/T396512) (owner: 10Bvibber) [20:21:04] anzx: I can deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1155930 for you since it looks straightforward to me [20:21:36] dancy: ok [20:22:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155930 (https://phabricator.wikimedia.org/T396128) (owner: 10Anzx) [20:23:15] (03Merged) 10jenkins-bot: enwiki: temporary lift of IP cap for event on 16 June 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155930 (https://phabricator.wikimedia.org/T396128) (owner: 10Anzx) [20:23:41] !log dancy@deploy1003 Started scap sync-world: Backport for [[gerrit:1155930|enwiki: temporary lift of IP cap for event on 16 June 2025 (T396128)]] [20:23:45] T396128: Requesting temporary lift of IP cap for 16 June 2025 - https://phabricator.wikimedia.org/T396128 [20:25:53] !log dancy@deploy1003 dancy, anzx: Backport for [[gerrit:1155930|enwiki: temporary lift of IP cap for event on 16 June 2025 (T396128)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:25:56] dancy: nothing to test , please continue with sync [20:26:02] ok [20:26:35] !log dancy@deploy1003 dancy, anzx: Continuing with sync [20:31:38] (03CR) 10Bvibber: "recheck" [extensions/JsonConfig] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1156420 (https://phabricator.wikimedia.org/T396512) (owner: 10Bvibber) [20:33:36] !log dancy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1155930|enwiki: temporary lift of IP cap for event on 16 June 2025 (T396128)]] (duration: 09m 54s) [20:33:40] T396128: Requesting temporary lift of IP cap for 16 June 2025 - https://phabricator.wikimedia.org/T396128 [20:33:52] dancy: thank you for deploying [20:33:59] (03CR) 10Jgreen: Change DMARC aggregate report address for donate.wikimedia.org (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1156352 (https://phabricator.wikimedia.org/T394788) (owner: 10Jgreen) [20:34:15] You're welcome. I'll need to leave the other changes to someone who understands those bits better. [20:34:46] yeah sure [20:35:01] !log andrew@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1016.eqiad.wmnet'] [20:36:20] !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1016.eqiad.wmnet'] [20:36:24] ok i got my backports through ci :D [20:37:04] anybody else deploying or cool to run a couple mediawiki bugfixes? :D [20:37:55] i'll sneak em in real quick :D [20:38:30] !log bvibber@deploy1003 Started scap sync-world: Backport for [[gerrit:1156420|Fix for multiple charts on same page using mix of transforms (T396512)]], [[gerrit:1156421|Fix for multiple charts on same page using mix of transforms (T396512)]] [20:38:34] T396512: Modifications made to tabular data in Lua transforms erroneously carry over to other charts on the same page - https://phabricator.wikimedia.org/T396512 [20:40:39] !log bvibber@deploy1003 bvibber: Backport for [[gerrit:1156420|Fix for multiple charts on same page using mix of transforms (T396512)]], [[gerrit:1156421|Fix for multiple charts on same page using mix of transforms (T396512)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:41:22] !log bvibber@deploy1003 bvibber: Continuing with sync [20:41:39] !log andrew@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1016.eqiad.wmnet'] [20:42:04] 10ops-codfw, 06SRE, 06DC-Ops: Moving extra 1G port to make 10G space on cloud rack. - https://phabricator.wikimedia.org/T396363#10911071 (10Andrew) cloudcephosd2001-dev is now drained and you can unplug things whenever. [20:43:13] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1016.eqiad.wmnet with OS bullseye [20:44:45] (03PS3) 10Andrew Bogott: cloudcephosd1017 -> puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1156304 (https://phabricator.wikimedia.org/T309789) [20:44:45] (03PS3) 10Andrew Bogott: cloudcephosd1018 -> puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1156305 (https://phabricator.wikimedia.org/T309789) [20:44:45] (03PS3) 10Andrew Bogott: cloudcephosd1019 -> puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1156306 (https://phabricator.wikimedia.org/T309789) [20:44:46] (03PS3) 10Andrew Bogott: cloudcephosd1020 -> puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1156307 (https://phabricator.wikimedia.org/T309789) [20:44:47] (03PS1) 10Andrew Bogott: Update cloudcephosd1016 with probably new nic names [puppet] - 10https://gerrit.wikimedia.org/r/1156441 [20:46:06] (03CR) 10Andrew Bogott: [C:03+2] cloudcephosd1016 -> puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1156303 (https://phabricator.wikimedia.org/T309789) (owner: 10Andrew Bogott) [20:46:42] (03CR) 10Andrew Bogott: [C:03+2] Update cloudcephosd1016 with probably new nic names [puppet] - 10https://gerrit.wikimedia.org/r/1156441 (owner: 10Andrew Bogott) [20:48:20] !log bvibber@deploy1003 Finished scap sync-world: Backport for [[gerrit:1156420|Fix for multiple charts on same page using mix of transforms (T396512)]], [[gerrit:1156421|Fix for multiple charts on same page using mix of transforms (T396512)]] (duration: 09m 50s) [20:48:25] T396512: Modifications made to tabular data in Lua transforms erroneously carry over to other charts on the same page - https://phabricator.wikimedia.org/T396512 [20:48:40] whee done [20:51:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:51:47] (03PS4) 10Andrew Bogott: cloudcephosd1018 -> puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1156305 (https://phabricator.wikimedia.org/T309789) [20:51:47] (03PS4) 10Andrew Bogott: cloudcephosd1019 -> puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1156306 (https://phabricator.wikimedia.org/T309789) [20:51:47] (03PS4) 10Andrew Bogott: cloudcephosd1020 -> puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1156307 (https://phabricator.wikimedia.org/T309789) [20:51:48] (03PS1) 10Andrew Bogott: Update cloudcephosd1017 with probably new nic names [puppet] - 10https://gerrit.wikimedia.org/r/1156444 [20:51:49] (03PS1) 10Andrew Bogott: Update cloudcephosd1018 with probably new nic names [puppet] - 10https://gerrit.wikimedia.org/r/1156445 [20:51:50] (03PS1) 10Andrew Bogott: Update cloudcephosd1019 with probably new nic names [puppet] - 10https://gerrit.wikimedia.org/r/1156446 [20:51:54] (03PS1) 10Andrew Bogott: Update cloudcephosd1020 with probably new nic names [puppet] - 10https://gerrit.wikimedia.org/r/1156447 [20:58:11] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1016.eqiad.wmnet with reason: host reimage [21:00:07] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250612T2100) [21:01:34] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1016.eqiad.wmnet with reason: host reimage [21:11:00] Hey all, mstyles and I would like to deploy a few sec patches now, unless we should hold off... [21:15:05] sbassett: from train conductor end, seems fine. no sign of a patch for the train blocker. [21:17:22] tx, brennen [21:18:30] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1016.eqiad.wmnet with OS bullseye [21:24:58] !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host cassandra-dev2001.codfw.wmnet with OS bullseye [21:25:11] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10911155 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host cassandra-dev2001.... [21:30:10] FIRING: SystemdUnitFailed: wmf_auto_restart_atftpd.service on install7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:31:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.753s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:34:25] preparing to run scap for a security deploy [21:37:38] (03PS1) 10Bvibber: Specify Lua transform arguments on {{#chart:}} invocations [extensions/Chart] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1156454 (https://phabricator.wikimedia.org/T395610) [21:37:54] (03PS1) 10Bvibber: Specify Lua transform arguments on {{#chart:}} invocations [extensions/Chart] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1156455 (https://phabricator.wikimedia.org/T395610) [21:39:39] !log eevans@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cassandra-dev2001.codfw.wmnet with reason: host reimage [21:39:53] i'll wait :D [21:41:54] should be wrapped up in 10 bvibber [21:43:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.185s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:43:19] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cassandra-dev2001.codfw.wmnet with reason: host reimage [21:45:10] running scap currently [21:48:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.946s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:53:56] !log Deploy security fix for T396524 [21:54:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:09] two more scaps! [22:00:47] !log Deployed security fix for T396413 [22:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:21] !log Deploy security fix for T394863 [22:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:52] finished with the security deploy [22:08:24] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:08:34] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:09:39] FIRING: CoreBGPDown: Core BGP session down between cr2-eqiad and cr1-drmrs (185.15.58.139) - group Confed_drmrs - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqiad:9804&var-bgp_group=Confed_drmrs&var-bgp_neighbor=cr1-drmrs - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:10:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:11:22] thx maryum [22:11:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy1003 using scap backport" [extensions/Chart] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1156454 (https://phabricator.wikimedia.org/T395610) (owner: 10Bvibber) [22:11:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy1003 using scap backport" [extensions/Chart] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1156455 (https://phabricator.wikimedia.org/T395610) (owner: 10Bvibber) [22:12:58] (03Merged) 10jenkins-bot: Specify Lua transform arguments on {{#chart:}} invocations [extensions/Chart] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1156454 (https://phabricator.wikimedia.org/T395610) (owner: 10Bvibber) [22:13:04] (03Merged) 10jenkins-bot: Specify Lua transform arguments on {{#chart:}} invocations [extensions/Chart] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1156455 (https://phabricator.wikimedia.org/T395610) (owner: 10Bvibber) [22:13:30] !log bvibber@deploy1003 Started scap sync-world: Backport for [[gerrit:1156454|Specify Lua transform arguments on {{#chart:}} invocations (T395610)]], [[gerrit:1156455|Specify Lua transform arguments on {{#chart:}} invocations (T395610)]] [22:13:34] T395610: Spike: expose {{#chart:}} invocation parameters to Lua transforms - https://phabricator.wikimedia.org/T395610 [22:14:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:14:50] RECOVERY - Postfix SMTP on crm2001 is OK: OK - Certificate crm2001.codfw.wmnet will expire on Thu 10 Jul 2025 09:40:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [22:18:55] (03PS1) 10Zabe: maintain-views: Add categorylinks to linktarget table filter [puppet] - 10https://gerrit.wikimedia.org/r/1156466 (https://phabricator.wikimedia.org/T352879) [22:23:51] (03PS2) 10Zabe: maintain-views: Add categorylinks to linktarget table filter [puppet] - 10https://gerrit.wikimedia.org/r/1156466 (https://phabricator.wikimedia.org/T299951) [22:25:55] (03CR) 10Ladsgroup: "LGTM, while we are here, can you add exitencelinks too? It's fine if not possible. We forgot about it 😄" [puppet] - 10https://gerrit.wikimedia.org/r/1156466 (https://phabricator.wikimedia.org/T299951) (owner: 10Zabe) [22:27:53] (03CR) 10Ladsgroup: "I5ede34d53ba" [puppet] - 10https://gerrit.wikimedia.org/r/1156466 (https://phabricator.wikimedia.org/T299951) (owner: 10Zabe) [22:29:31] (03PS3) 10Zabe: maintain-views: Update linktarget table filter [puppet] - 10https://gerrit.wikimedia.org/r/1156466 (https://phabricator.wikimedia.org/T14019) [22:29:48] (03CR) 10Zabe: "Sure" [puppet] - 10https://gerrit.wikimedia.org/r/1156466 (https://phabricator.wikimedia.org/T14019) (owner: 10Zabe) [22:30:41] (03PS3) 10Jforrester: tables-catalogue: List wikifunctionsclient_usage [puppet] - 10https://gerrit.wikimedia.org/r/1156365 [22:30:43] (03CR) 10Ladsgroup: [C:03+2] tables-catalogue: List wikifunctionsclient_usage [puppet] - 10https://gerrit.wikimedia.org/r/1156365 (owner: 10Jforrester) [22:30:45] (03CR) 10Ladsgroup: [V:03+2 C:03+2] tables-catalogue: List wikifunctionsclient_usage [puppet] - 10https://gerrit.wikimedia.org/r/1156365 (owner: 10Jforrester) [22:32:36] (03PS4) 10Zabe: maintain-views: Update linktarget table filter [puppet] - 10https://gerrit.wikimedia.org/r/1156466 (https://phabricator.wikimedia.org/T14019) [22:32:42] (03CR) 10Ladsgroup: [V:03+2 C:03+2] maintain-views: Update linktarget table filter [puppet] - 10https://gerrit.wikimedia.org/r/1156466 (https://phabricator.wikimedia.org/T14019) (owner: 10Zabe) [22:33:07] (03CR) 10Ladsgroup: [V:03+2 C:03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1156466 (https://phabricator.wikimedia.org/T14019) (owner: 10Zabe) [22:37:55] (03CR) 10Dwisehaupt: [C:03+1] Change DMARC aggregate report address for donate.wikimedia.org (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1156352 (https://phabricator.wikimedia.org/T394788) (owner: 10Jgreen) [22:38:00] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: SSD firmware update for db125[0-4] - https://phabricator.wikimedia.org/T396648#10911367 (10Ladsgroup) Thanks! I've started mariadb and now it's catching up, I'll repool it once it's caught up. [22:38:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depool db1254 (T396648)', diff saved to https://phabricator.wikimedia.org/P77867 and previous config saved to /var/cache/conftool/dbconfig/20250612-223834-ladsgroup.json [22:38:40] T396648: SSD firmware update for db125[0-4] - https://phabricator.wikimedia.org/T396648 [22:39:54] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1254.eqiad.wmnet with reason: Firmware upgrade (T396648) [22:42:42] !log ladsgroup@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts db1254.eqiad.wmnet [22:43:07] !log ladsgroup@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts db1254.eqiad.wmnet [22:43:48] !log ladsgroup@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1254.eqiad.wmnet with reason: Firmware upgrade (T396648) [22:43:52] T396648: SSD firmware update for db125[0-4] - https://phabricator.wikimedia.org/T396648 [22:44:11] !log ladsgroup@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts db1254.eqiad.wmnet [22:45:30] !log ladsgroup@cumin2002 START - Cookbook sre.hosts.reboot-single for host db1254.eqiad.wmnet [22:46:28] hm this sync is taking a long time compared to others from today [22:46:41] it's stuck on syncing one of the baremetal testservres? [22:51:00] bvibber: if the patch touches i18n cache, the rebuild might take a very long time. That could be it but it's not the usual i18n files. So I might be very wrong [22:52:04] yeah that'll do it [22:52:13] adds a magic word that probably messes it ;) [22:58:51] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cassandra-dev2001.codfw.wmnet with OS bullseye [22:59:06] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10911383 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host cassandra-dev2001.codf... [22:59:49] !log bvibber@deploy1003 bvibber: Backport for [[gerrit:1156454|Specify Lua transform arguments on {{#chart:}} invocations (T395610)]], [[gerrit:1156455|Specify Lua transform arguments on {{#chart:}} invocations (T395610)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:59:53] T395610: Spike: expose {{#chart:}} invocation parameters to Lua transforms - https://phabricator.wikimedia.org/T395610 [23:00:15] confirmed works [23:00:38] !log bvibber@deploy1003 bvibber: Continuing with sync [23:00:56] !log ladsgroup@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host db1254.eqiad.wmnet [23:01:25] !log ladsgroup@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts db1254.eqiad.wmnet [23:06:15] !log ladsgroup@cumin2002 START - Cookbook sre.mysql.pool db1254 gradually with 4 steps - Firmware update done [23:06:38] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: SSD firmware update for db125[0-4] - https://phabricator.wikimedia.org/T396648#10911397 (10ops-monitoring-bot) Start pool of db1254 gradually with 4 steps - Firmware update done - ladsgroup@cumin2002 [23:07:17] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: SSD firmware update for db125[0-4] - https://phabricator.wikimedia.org/T396648#10911398 (10Ladsgroup) db1254 should be done now. I'm repooling it. The failure of the firmware cookbook was because of replag not being happy (caught up by the time) [23:07:58] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: SSD firmware update for db125[0-4] - https://phabricator.wikimedia.org/T396648#10911400 (10Ladsgroup) [23:13:02] !log ladsgroup@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1252.eqiad.wmnet with reason: Firmware update [23:14:48] !log bvibber@deploy1003 Finished scap sync-world: Backport for [[gerrit:1156454|Specify Lua transform arguments on {{#chart:}} invocations (T395610)]], [[gerrit:1156455|Specify Lua transform arguments on {{#chart:}} invocations (T395610)]] (duration: 61m 18s) [23:14:52] T395610: Spike: expose {{#chart:}} invocation parameters to Lua transforms - https://phabricator.wikimedia.org/T395610 [23:20:32] FIRING: [2x] ProbeDown: Service wdqs2013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:20:46] !log ladsgroup@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts db1252.eqiad.wmnet [23:21:31] !log ladsgroup@cumin2002 START - Cookbook sre.hosts.reboot-single for host db1252.eqiad.wmnet [23:36:22] !log ladsgroup@cumin1002 START - Cookbook sre.wikireplicas.update-views [23:36:46] !log ladsgroup@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host db1252.eqiad.wmnet [23:37:02] !log ladsgroup@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts db1252.eqiad.wmnet [23:37:57] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: SSD firmware update for db125[0-4] - https://phabricator.wikimedia.org/T396648#10911467 (10Ladsgroup) [23:38:47] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1156489 [23:38:47] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1156489 (owner: 10TrainBranchBot) [23:43:15] !log ladsgroup@cumin1002 END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) [23:43:25] !log ladsgroup@cumin2002 START - Cookbook sre.mysql.pool db1253 gradually with 4 steps - Firmware updated [23:43:54] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: SSD firmware update for db125[0-4] - https://phabricator.wikimedia.org/T396648#10911481 (10ops-monitoring-bot) Start pool of db1253 gradually with 4 steps - Firmware updated - ladsgroup@cumin2002 [23:46:06] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: SSD firmware update for db125[0-4] - https://phabricator.wikimedia.org/T396648#10911485 (10Ladsgroup) [23:47:47] !log ladsgroup@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1251.eqiad.wmnet with reason: Firmware update [23:48:57] !log ladsgroup@cumin2002 dbctl commit (dc=all): 'Depool db1251 for firmware update (T396648)', diff saved to https://phabricator.wikimedia.org/P77872 and previous config saved to /var/cache/conftool/dbconfig/20250612-234855-ladsgroup.json [23:49:00] T396648: SSD firmware update for db125[0-4] - https://phabricator.wikimedia.org/T396648 [23:51:56] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1156489 (owner: 10TrainBranchBot) [23:53:04] !log ladsgroup@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts db1251.eqiad.wmnet [23:53:41] !log ladsgroup@cumin2002 START - Cookbook sre.hosts.reboot-single for host db1251.eqiad.wmnet [23:54:27] !log ladsgroup@cumin2002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1254 gradually with 4 steps - Firmware update done [23:54:31] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: SSD firmware update for db125[0-4] - https://phabricator.wikimedia.org/T396648#10911495 (10ops-monitoring-bot) Completed pool of db1254 gradually with 4 steps - Firmware update done - ladsgroup@cumin2002