[00:00:04] PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:00:52] (03CR) 10Raymond Ndibe: "This has already been tested on the toolsbeta cluster node. The reason it's still up was because we had a discussion and decided to snoop " [puppet] - 10https://gerrit.wikimedia.org/r/1113194 (https://phabricator.wikimedia.org/T374193) (owner: 10Raymond Ndibe) [00:09:42] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 208, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:10:04] RECOVERY - Router interfaces on cr1-esams is OK: OK: host 185.15.59.128, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:10:28] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:10:34] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 621.31 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:25:04] PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:25:28] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:25:44] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 207, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:30:44] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 208, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:31:04] RECOVERY - Router interfaces on cr1-esams is OK: OK: host 185.15.59.128, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:31:28] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:38:35] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1115140 [00:38:35] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1115140 (owner: 10TrainBranchBot) [00:50:32] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1115140 (owner: 10TrainBranchBot) [01:08:21] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1115143 [01:08:21] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1115143 (owner: 10TrainBranchBot) [01:13:29] 06SRE, 06Infrastructure-Foundations, 06Traffic: Console domain and property access request - https://phabricator.wikimedia.org/T381904#10507045 (10Scott_French) Thank you both! Great, the list in T381904#10502098 is consistent with what I have from when I did the correlation with Search Console properties i... [01:30:23] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1115143 (owner: 10TrainBranchBot) [01:54:45] (03PS1) 10Zabe: Restrict editing on ruwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115146 (https://phabricator.wikimedia.org/T382805) [01:55:28] (03PS2) 10Zabe: Restrict editing on ruwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115146 (https://phabricator.wikimedia.org/T382805) [01:56:08] (03PS3) 10Zabe: Restrict editing on ruwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115146 (https://phabricator.wikimedia.org/T382805) [01:56:11] (03CR) 10CI reject: [V:04-1] Restrict editing on ruwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115146 (https://phabricator.wikimedia.org/T382805) (owner: 10Zabe) [02:10:34] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 4.44 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:16:39] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2149.codfw.wmnet with reason: Maintenance [02:16:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2149 (T384592)', diff saved to https://phabricator.wikimedia.org/P72822 and previous config saved to /var/cache/conftool/dbconfig/20250130-021645-marostegui.json [02:16:51] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:44:52] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [02:44:58] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [02:45:51] !log scaled down shellbox-video/migration after switch to PHP 8.1 - T377038 [02:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:45:57] T377038: Migrate production Shellbox variants to PHP 8.1 - https://phabricator.wikimedia.org/T377038 [03:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:52:07] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1071-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [04:05:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T384592)', diff saved to https://phabricator.wikimedia.org/P72823 and previous config saved to /var/cache/conftool/dbconfig/20250130-040548-marostegui.json [04:05:54] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [04:20:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P72824 and previous config saved to /var/cache/conftool/dbconfig/20250130-042054-marostegui.json [04:36:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P72825 and previous config saved to /var/cache/conftool/dbconfig/20250130-043601-marostegui.json [04:38:48] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:39:04] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:51:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T384592)', diff saved to https://phabricator.wikimedia.org/P72826 and previous config saved to /var/cache/conftool/dbconfig/20250130-045108-marostegui.json [04:51:14] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [04:51:25] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance [04:51:40] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2186.codfw.wmnet with reason: Maintenance [04:51:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2156 (T384592)', diff saved to https://phabricator.wikimedia.org/P72827 and previous config saved to /var/cache/conftool/dbconfig/20250130-045147-marostegui.json [04:55:04] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:55:48] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:19:54] !log depooling lvs4009 prior to reimaging as a liberica load balancer - T384477 [05:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:59] T384477: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477 [05:20:30] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:20:57] ^^ that's a side effect of depooling lvs4009, 100% expected [05:21:00] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:21:21] (03CR) 10Vgutierrez: [C:03+2] site,hiera: Reimage lvs4009 as role(liberica) [puppet] - 10https://gerrit.wikimedia.org/r/1115075 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [05:22:26] PROBLEM - pybal on lvs4009 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [05:22:58] PROBLEM - PyBal backends health check on lvs4009 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [05:23:18] PROBLEM - PyBal connections to etcd on lvs4009 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [05:44:18] PROBLEM - Host mr1-codfw.oob is DOWN: PING CRITICAL - Packet loss = 100% [05:46:48] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:48:10] (03PS1) 10Vgutierrez: hiera,lvs4009: Add bullseye PNI name to unblock reimage [puppet] - 10https://gerrit.wikimedia.org/r/1115159 (https://phabricator.wikimedia.org/T384477) [05:48:55] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1115159 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [05:49:20] RECOVERY - Host mr1-codfw.oob is UP: PING OK - Packet loss = 0%, RTA = 37.51 ms [05:50:03] (03CR) 10Marostegui: [C:03+1] hiera,lvs4009: Add bullseye PNI name to unblock reimage [puppet] - 10https://gerrit.wikimedia.org/r/1115159 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [05:50:34] PROBLEM - MariaDB Replica SQL: s3 on db2139 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table page_props is corrupt: try to repair it on query. Default database: aswiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:51:08] ^ that host isn't in production [05:51:48] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:51:52] It will be decommissioned anyway [05:52:17] (03PS2) 10Vgutierrez: hiera,lvs4009: Add bullseye PNI name to unblock reimage [puppet] - 10https://gerrit.wikimedia.org/r/1115159 (https://phabricator.wikimedia.org/T384477) [05:52:33] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1115159 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [05:54:03] (03CR) 10Vgutierrez: [C:03+2] hiera,lvs4009: Add bullseye PNI name to unblock reimage [puppet] - 10https://gerrit.wikimedia.org/r/1115159 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [05:57:07] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reimage for host lvs4009.ulsfo.wmnet with OS bookworm [06:05:41] FIRING: JobUnavailable: Reduced availability for job pybal in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:11:19] (03PS1) 10Vgutierrez: Revert "hiera,lvs4009: Add bullseye PNI name to unblock reimage" [puppet] - 10https://gerrit.wikimedia.org/r/1115161 [06:14:26] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs4009.ulsfo.wmnet with reason: host reimage [06:17:54] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs4009.ulsfo.wmnet with reason: host reimage [06:20:23] (03CR) 10Vgutierrez: [C:03+2] Revert "hiera,lvs4009: Add bullseye PNI name to unblock reimage" [puppet] - 10https://gerrit.wikimedia.org/r/1115161 (owner: 10Vgutierrez) [06:29:53] PROBLEM - Router interfaces on cr2-eqdfw is CRITICAL: CRITICAL: host 208.80.153.198, interfaces up: 69, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:30:41] RESOLVED: JobUnavailable: Reduced availability for job pybal in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:35:01] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:35:28] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [06:35:29] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1250.eqiad.wmnet with OS bookworm [06:35:30] !log vriley@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [06:35:31] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:35:31] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1251.eqiad.wmnet with OS bookworm [06:35:39] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db125[0-4] - https://phabricator.wikimedia.org/T380083#10507249 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host db1250.eqiad.wmnet with OS bookworm completed: - db1250 (**PASS**) - Rem... [06:35:40] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db125[0-4] - https://phabricator.wikimedia.org/T380083#10507250 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host db1251.eqiad.wmnet with OS bookworm completed: - db1251 (**WARN**) - Rem... [06:36:18] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db125[0-4] - https://phabricator.wikimedia.org/T380083#10507251 (10VRiley-WMF) [06:38:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T384592)', diff saved to https://phabricator.wikimedia.org/P72828 and previous config saved to /var/cache/conftool/dbconfig/20250130-063849-marostegui.json [06:38:55] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [06:41:55] RECOVERY - Router interfaces on cr2-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:41:59] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs4009.ulsfo.wmnet with OS bookworm [06:50:53] (03PS1) 10Vgutierrez: hiera,lvs4009: Restore BGP priority [puppet] - 10https://gerrit.wikimedia.org/r/1115232 (https://phabricator.wikimedia.org/T384477) [06:51:15] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1115232 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [06:51:47] (03PS2) 10Vgutierrez: hiera,lvs4009: Restore BGP priority [puppet] - 10https://gerrit.wikimedia.org/r/1115232 (https://phabricator.wikimedia.org/T384477) [06:51:56] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1115232 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [06:53:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P72829 and previous config saved to /var/cache/conftool/dbconfig/20250130-065356-marostegui.json [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250130T0700) [07:00:05] marostegui, Amir1, and federico3: How many deployers does it take to do Primary database switchover deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250130T0700). [07:03:59] (03PS1) 10Vgutierrez: lvs: Fix puppet compiler error on missing NIC [puppet] - 10https://gerrit.wikimedia.org/r/1115233 (https://phabricator.wikimedia.org/T384477) [07:04:59] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1115233 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [07:06:58] (03PS2) 10Vgutierrez: lvs: Fix puppet compiler error on missing NIC [puppet] - 10https://gerrit.wikimedia.org/r/1115233 (https://phabricator.wikimedia.org/T384477) [07:09:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P72830 and previous config saved to /var/cache/conftool/dbconfig/20250130-070903-marostegui.json [07:11:03] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1115232 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [07:11:11] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1115233 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [07:12:47] (03CR) 10Vgutierrez: [C:03+2] hiera,lvs4009: Restore BGP priority [puppet] - 10https://gerrit.wikimedia.org/r/1115232 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [07:13:20] !log repooling lvs4009 after reimaging as a liberica load balancer - T384477 [07:13:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:24] T384477: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477 [07:23:58] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1115233 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [07:24:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T384592)', diff saved to https://phabricator.wikimedia.org/P72831 and previous config saved to /var/cache/conftool/dbconfig/20250130-072410-marostegui.json [07:24:16] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [07:24:26] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2177.codfw.wmnet with reason: Maintenance [07:24:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2177 (T384592)', diff saved to https://phabricator.wikimedia.org/P72832 and previous config saved to /var/cache/conftool/dbconfig/20250130-072432-marostegui.json [07:30:54] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti2030.codfw.wmnet with reason: remove from cluster for reimage [07:31:06] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10507313 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=83262e5b-e9b2-4d97-bd96-7e9d851edd21) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(... [07:35:20] ACKNOWLEDGEMENT - MariaDB Replica SQL: s3 on db2139 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table page_props is corrupt: try to repair it on query. Default database: aswiki. [Query snipped] Marostegui Host will be decommissioned https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:36:04] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2031.codfw.wmnet to cluster codfw and group B [07:38:23] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2031.codfw.wmnet to cluster codfw and group B [07:42:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1161 T384994', diff saved to https://phabricator.wikimedia.org/P72833 and previous config saved to /var/cache/conftool/dbconfig/20250130-074228-marostegui.json [07:42:33] T384994: Upgrade and rebuild s5 - https://phabricator.wikimedia.org/T384994 [07:42:39] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti2030.codfw.wmnet [07:42:42] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db1161.eqiad.wmnet [07:45:17] PROBLEM - MariaDB Replica IO: s5 on db1154 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1161.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1161.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:45:24] ^ me, fixing [07:45:52] !log marostegui@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1154.eqiad.wmnet with reason: Rebuild and upgrade db1166 [07:46:40] !log marostegui@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020].eqiad.wmnet with reason: Rebuild and upgrade db1166 [07:47:26] !log marostegui@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on an-redacteddb1001.eqiad.wmnet with reason: Rebuild and upgrade db1166 [07:48:47] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1161.eqiad.wmnet [07:49:17] RECOVERY - MariaDB Replica IO: s5 on db1154 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:49:44] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1161.eqiad.wmnet with reason: Index rebuild [07:51:41] (03PS1) 10Marostegui: db1161: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1115313 [07:52:07] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1071-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [07:53:17] (03CR) 10Marostegui: [C:03+2] db1161: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1115313 (owner: 10Marostegui) [08:00:05] Amir1, Urbanecm, and awight: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250130T0800) [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:00:51] ah cool [08:09:42] !log marostegui@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Rebuild and upgrade dbstore1007:s4 [08:14:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2030.codfw.wmnet [08:17:54] (03CR) 10Volans: Class-of-service: don't insert comment with host name under cos/ints (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1115134 (https://phabricator.wikimedia.org/T379549) (owner: 10Cathal Mooney) [08:18:51] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10507353 (10MoritzMuehlenhoff) [08:24:27] (03CR) 10Volans: [C:03+1] "LGTM, PCC happy:" [puppet] - 10https://gerrit.wikimedia.org/r/1113194 (https://phabricator.wikimedia.org/T374193) (owner: 10Raymond Ndibe) [08:24:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2030.codfw.wmnet [08:24:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ganeti2030.codfw.wmnet [08:25:39] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti2030 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1115057 (owner: 10Muehlenhoff) [08:30:53] (03CR) 10Volans: Network: add qos and sflow config for configure-switch-interfaces (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1115109 (https://phabricator.wikimedia.org/T379549) (owner: 10Cathal Mooney) [08:32:10] (03CR) 10Volans: [C:03+1] "Sure, I'm not familiar with toolforge updates but I guess there is a standard practice workflow to follow :)" [puppet] - 10https://gerrit.wikimedia.org/r/1114007 (https://phabricator.wikimedia.org/T384720) (owner: 10Raymond Ndibe) [08:33:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2030.codfw.wmnet with OS bookworm [08:33:52] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10507374 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2030.codfw.wmnet with OS bookworm [08:40:57] (03PS4) 10Fabfur: benthos: send data to eventgate too [puppet] - 10https://gerrit.wikimedia.org/r/1115113 (https://phabricator.wikimedia.org/T383392) [08:44:15] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1115113 (https://phabricator.wikimedia.org/T383392) (owner: 10Fabfur) [08:44:28] (03CR) 10Muehlenhoff: [C:03+2] postgresql::dirs: Use wmflib::debian_postgresql_version() [puppet] - 10https://gerrit.wikimedia.org/r/1108710 (owner: 10Muehlenhoff) [08:45:32] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2095,2175,2186].codfw.wmnet [08:45:35] !log jayme@cumin1002 START - Cookbook sre.hosts.remove-downtime for wikikube-worker[2095,2175,2186].codfw.wmnet [08:45:37] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wikikube-worker[2095,2175,2186].codfw.wmnet [08:45:37] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2095,2175,2186].codfw.wmnet [08:45:42] (03PS4) 10Jcrespo: backup: Temporary setup of backup101[34], backup201[34] [puppet] - 10https://gerrit.wikimedia.org/r/1115020 (https://phabricator.wikimedia.org/T384977) [08:45:47] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10507417 (10ops-monitoring-bot) pool host wikikube-worker[2095,2175,2186].codfw.wmnet by jayme@cumin1002 with re... [08:45:49] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10507421 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by jayme@cumin1002 pool for... [08:45:51] (03PS1) 10KartikMistry: Update MinT to 2025-01-30-080456-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115314 (https://phabricator.wikimedia.org/T383750) [08:48:12] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2030.codfw.wmnet with OS bookworm [08:48:24] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10507436 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2030.codfw.wmnet with OS bookworm executed with errors:... [08:49:03] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2030.codfw.wmnet with OS bookworm [08:49:10] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10507437 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2030.codfw.wmnet with OS bookworm [08:52:09] (03PS1) 10Muehlenhoff: wmcs::services::postgres::primary: Use wmflib::debian_postgresql_version() [puppet] - 10https://gerrit.wikimedia.org/r/1115316 [08:52:29] (03CR) 10Jcrespo: [C:03+2] backup: Temporary setup of backup101[34], backup201[34] [puppet] - 10https://gerrit.wikimedia.org/r/1115020 (https://phabricator.wikimedia.org/T384977) (owner: 10Jcrespo) [08:55:00] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1115316 (owner: 10Muehlenhoff) [08:55:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1161 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72834 and previous config saved to /var/cache/conftool/dbconfig/20250130-085502-root.json [08:58:26] 10ops-eqiad, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup101[34] - https://phabricator.wikimedia.org/T384977#10507448 (10jcrespo) [09:00:05] jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250130T0900) [09:01:53] I need to check whether it is fine to promote group 1 wikis [09:02:07] hashar: we didn't deploy yesterday because of a train blocker so if you wanna roll forward that would be great! [09:02:20] jeena: oh I would! :) [09:02:30] Hehe thanks! [09:02:34] jeena: and I am quite happy you are still awake to confirm, that is a time saver! :b [09:02:42] Ill handle it ! [09:02:51] Lol I'm having a midnight snack [09:04:36] (03PS1) 10Elukey: services: update cpu requirements for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115320 (https://phabricator.wikimedia.org/T216826) [09:04:51] (03PS1) 10TrainBranchBot: group1 to 1.44.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115321 (https://phabricator.wikimedia.org/T382365) [09:04:52] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.44.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115321 (https://phabricator.wikimedia.org/T382365) (owner: 10TrainBranchBot) [09:04:58] jeena: happy snack! [09:05:21] Thanks for taking the train! [09:05:32] (03Merged) 10jenkins-bot: group1 to 1.44.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115321 (https://phabricator.wikimedia.org/T382365) (owner: 10TrainBranchBot) [09:06:45] (03CR) 10Elukey: [C:03+2] services: update cpu requirements for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115320 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [09:07:39] 09:07:07 K8s deployment progress: 91% (ok: 11; fail: 0; left: 1) / [09:07:58] stalled on a last one. Scap really looks like those MicroSoft Windows progress bar from the 90's [09:08:09] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [09:08:24] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [09:10:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1161 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72835 and previous config saved to /var/cache/conftool/dbconfig/20250130-091007-root.json [09:10:50] 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10507490 (10jcrespo) a:05jcrespo→03None [09:11:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T384592)', diff saved to https://phabricator.wikimedia.org/P72836 and previous config saved to /var/cache/conftool/dbconfig/20250130-091123-marostegui.json [09:11:29] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [09:12:22] (03PS1) 10Elukey: admin_ng: enforce restricted PSS on ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115322 (https://phabricator.wikimedia.org/T369493) [09:12:23] (03PS1) 10Elukey: admin_ng: disable PSP binding for ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115323 (https://phabricator.wikimedia.org/T369493) [09:16:19] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.44.0-wmf.14 refs T382365 [09:16:24] T382365: 1.44.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T382365 [09:17:32] 09:16:18 rebuilt and synchronized wikiversions files: group1 to 1.44.0-wmf.14 refs T382365 [09:17:32] 09:16:19 deploy-promote failed: Command '['/usr/bin/scap', 'sync-wikiversions', 'group1 to 1.44.0-wmf.14 refs T382365']' returned non-zero exit status 1. (scap version: 4.134.0) [09:17:34] bummmer [09:17:53] ah the actual error is above: 1 proxies had sync errors [09:18:02] ssh: Could not resolve hostname mw2410.codfw.wmnet [09:18:17] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2030.codfw.wmnet with reason: host reimage [09:18:49] * hashar points at DNS [09:20:20] (03PS2) 10Arthur taylor: Remove `tmpAlwaysShowMulLanguageCode` temporary setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115006 (https://phabricator.wikimedia.org/T330217) [09:20:20] (03PS2) 10Arthur taylor: Add `enableMulLanguageCode` to replace `tmpEnableMulLanguageCode` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115013 (https://phabricator.wikimedia.org/T330217) [09:20:20] (03PS2) 10Arthur taylor: Remove `tmpEnableMulLanguageCode` setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115016 (https://phabricator.wikimedia.org/T330217) [09:20:48] (03CR) 10Arthur taylor: Remove `tmpAlwaysShowMulLanguageCode` temporary setting (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115006 (https://phabricator.wikimedia.org/T330217) (owner: 10Arthur taylor) [09:21:08] (03CR) 10Arthur taylor: "agreed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115013 (https://phabricator.wikimedia.org/T330217) (owner: 10Arthur taylor) [09:21:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2030.codfw.wmnet with reason: host reimage [09:22:40] <_joe_> hashar: uhm I guess someone forgot to remove that host from the scap proxies when it was converted [09:22:47] <_joe_> hnowlan: ^^ any idea about that? [09:24:46] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10507507 (10phaultfinder) [09:25:05] I think they went to decom mw2410.codfw.wmnet host before it got unconfigured :) [09:25:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1161 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72837 and previous config saved to /var/cache/conftool/dbconfig/20250130-092513-root.json [09:25:19] anyway that is harmless to the train [09:25:32] 16:47 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2410 to wikikube-worker2242 [09:25:33] :b [09:25:41] yeah they were too fast! [09:26:25] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [09:26:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P72838 and previous config saved to /var/cache/conftool/dbconfig/20250130-092630-marostegui.json [09:26:36] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [09:27:04] and https://gerrit.wikimedia.org/r/c/operations/puppet/+/1112714 empty up the list [09:28:11] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [09:28:21] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [09:28:29] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1230 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1115325 (https://phabricator.wikimedia.org/T385147) [09:29:58] 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 10decommission-hardware: decommission ms-be105[1-9].eqiad.wmnet - https://phabricator.wikimedia.org/T385049#10507525 (10MatthewVernon) [09:30:56] 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 10decommission-hardware: decommission ms-be105[1-9].eqiad.wmnet - https://phabricator.wikimedia.org/T385049#10507528 (10MatthewVernon) >>! In T385049#10505664, @Papaul wrote: > @MatthewVernon these are ms-be105[1-9].eqiad.wmnet or ms-fe105[1-9].eqiad.wmnet Oh, b... [09:31:27] (03CR) 10Hashar: [C:03+1] "Given mw2410 has already been renamed, that causes scap to fail when it tries to sync to the no more existent scap proxy. I imagine this " [puppet] - 10https://gerrit.wikimedia.org/r/1112714 (https://phabricator.wikimedia.org/T384196) (owner: 10Hnowlan) [09:31:54] _joe_: yeah your assumption was correct. The host got renamed but is still listed as a scap_proxy [09:32:06] and https://gerrit.wikimedia.org/r/c/operations/puppet/+/1112714 by hnowlan and effie would do it :) [09:32:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1230 with weight 0 T385147', diff saved to https://phabricator.wikimedia.org/P72839 and previous config saved to /var/cache/conftool/dbconfig/20250130-093221-root.json [09:32:28] T385147: Switchover s5 master (db1183 -> db1230) - https://phabricator.wikimedia.org/T385147 [09:32:41] <_joe_> hashar: yeah I don't have time to follow up on that, I'm super busy [09:32:44] (03CR) 10Ilias Sarantopoulos: [C:03+1] admin_ng: enforce restricted PSS on ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115322 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [09:32:49] !log marostegui@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 24 hosts with reason: Primary switchover s5 T385147 [09:33:00] _joe_: no worries, it is harmless. Thanks for the hint ! [09:33:25] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1230 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1115325 (https://phabricator.wikimedia.org/T385147) (owner: 10Gerrit maintenance bot) [09:34:50] (03CR) 10Elukey: [C:03+2] admin_ng: enforce restricted PSS on ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115322 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [09:38:06] !log Starting s5 eqiad failover from db1183 to db1230 - T385147 [09:38:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:12] T385147: Switchover s5 master (db1183 -> db1230) - https://phabricator.wikimedia.org/T385147 [09:38:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/Babel] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115062 (https://phabricator.wikimedia.org/T384941) (owner: 10Urbanecm) [09:38:42] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/Babel] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1115059 (https://phabricator.wikimedia.org/T384941) (owner: 10Urbanecm) [09:38:46] !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [09:38:46] !log marostegui@cumin2002 dbctl commit (dc=all): 'Promote db1230 to s5 primary T385147', diff saved to https://phabricator.wikimedia.org/P72840 and previous config saved to /var/cache/conftool/dbconfig/20250130-093845-marostegui.json [09:39:16] PHP Deprecated: strtr(): Passing null to parameter #1 ($string) of type string is deprecated [09:39:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1183 T385147', diff saved to https://phabricator.wikimedia.org/P72841 and previous config saved to /var/cache/conftool/dbconfig/20250130-093927-marostegui.json [09:39:27] makes me wonder how that passes :b [09:39:40] !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [09:40:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1161 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72842 and previous config saved to /var/cache/conftool/dbconfig/20250130-094018-root.json [09:40:53] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db1183.eqiad.wmnet [09:41:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P72843 and previous config saved to /var/cache/conftool/dbconfig/20250130-094137-marostegui.json [09:41:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2030.codfw.wmnet with OS bookworm [09:41:49] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10507558 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2030.codfw.wmnet with OS bookworm completed: - ganeti203... [09:43:28] !log removed profile::ssh::server::disable_nist_kex: false from Toolforge Hiera settings (so that the defaults apply like for the rest of Cloud VPS) [09:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:43] (03PS1) 10Marostegui: Revert "db1161: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1115327 [09:44:59] (03PS2) 10Marostegui: Revert "db1161: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1115327 [09:45:22] (03CR) 10Marostegui: [C:03+2] Revert "db1161: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1115327 (owner: 10Marostegui) [09:46:13] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1183.eqiad.wmnet [09:46:22] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2030.codfw.wmnet [09:46:43] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1183.eqiad.wmnet with reason: Index rebuild [09:46:47] (03PS4) 10Arnaudb: nftables: add docker profile and forward chain [puppet] - 10https://gerrit.wikimedia.org/r/1114716 (https://phabricator.wikimedia.org/T370677) [09:46:59] (03PS6) 10Arnaudb: nftables: add nftable docker manifest [puppet] - 10https://gerrit.wikimedia.org/r/1114718 (https://phabricator.wikimedia.org/T370677) [09:47:03] (03PS4) 10Arnaudb: gitlab_runner: add nftables logic [puppet] - 10https://gerrit.wikimedia.org/r/1114726 (https://phabricator.wikimedia.org/T370677) [09:47:35] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2192 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1115328 (https://phabricator.wikimedia.org/T385148) [09:47:40] (03PS1) 10Gerrit maintenance bot: wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1115329 (https://phabricator.wikimedia.org/T385148) [09:49:32] hmm and somehow we don't have the PHP version showing up in the logs [09:49:53] ah no it is there [09:50:56] (03PS7) 10Arnaudb: nftables: add nftable docker manifest [puppet] - 10https://gerrit.wikimedia.org/r/1114718 (https://phabricator.wikimedia.org/T370677) [09:50:58] (03PS5) 10Arnaudb: gitlab_runner: add nftables logic [puppet] - 10https://gerrit.wikimedia.org/r/1114726 (https://phabricator.wikimedia.org/T370677) [09:51:24] * hashar ah that is T384858 [09:52:31] (03CR) 10Jelto: [C:03+1] "pcc diff and change looks reasonable to me" [puppet] - 10https://gerrit.wikimedia.org/r/1114970 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [09:54:19] (03PS8) 10Arnaudb: nftables: add nftable docker manifest [puppet] - 10https://gerrit.wikimedia.org/r/1114718 (https://phabricator.wikimedia.org/T370677) [09:54:22] (03PS6) 10Arnaudb: gitlab_runner: add nftables logic [puppet] - 10https://gerrit.wikimedia.org/r/1114726 (https://phabricator.wikimedia.org/T370677) [09:54:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2030.codfw.wmnet [09:55:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1161 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72844 and previous config saved to /var/cache/conftool/dbconfig/20250130-095524-root.json [09:55:27] (03PS7) 10Arnaudb: gitlab_runner: add nftables logic [puppet] - 10https://gerrit.wikimedia.org/r/1114726 (https://phabricator.wikimedia.org/T370677) [09:56:08] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2039.codfw.wmnet to cluster codfw and group A [09:56:11] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2039.codfw.wmnet to cluster codfw and group A [09:56:17] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2030.codfw.wmnet to cluster codfw and group A [09:56:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T384592)', diff saved to https://phabricator.wikimedia.org/P72845 and previous config saved to /var/cache/conftool/dbconfig/20250130-095644-marostegui.json [09:56:50] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [09:57:00] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2190.codfw.wmnet with reason: Maintenance [09:57:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2190 (T384592)', diff saved to https://phabricator.wikimedia.org/P72846 and previous config saved to /var/cache/conftool/dbconfig/20250130-095706-marostegui.json [09:57:38] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2030.codfw.wmnet to cluster codfw and group A [09:58:28] (03CR) 10Cathal Mooney: Network: add qos and sflow config for configure-switch-interfaces (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1115109 (https://phabricator.wikimedia.org/T379549) (owner: 10Cathal Mooney) [10:04:33] !log cmooney@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host frbast1002 [10:04:40] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host frbast1002 [10:05:58] (03PS1) 10Muehlenhoff: Switch ganeti2029 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1115332 [10:08:21] (03PS1) 10Hnowlan: scap: remove mw2410 from proxy list [puppet] - 10https://gerrit.wikimedia.org/r/1115333 (https://phabricator.wikimedia.org/T384196) [10:08:24] (03PS1) 10Slyngshede: Release version 0.1.1 [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1115334 [10:09:28] (03PS1) 10Filippo Giunchedi: pontoon: quote install instructions [puppet] - 10https://gerrit.wikimedia.org/r/1115335 [10:09:33] (03CR) 10Hnowlan: "I've filed a CR to *just* remove mw2410 from the list for now - we're not 100% certain about the impact of removing all proxies (but I'll " [puppet] - 10https://gerrit.wikimedia.org/r/1112714 (https://phabricator.wikimedia.org/T384196) (owner: 10Hnowlan) [10:12:01] (03PS1) 10Urbanecm: migrateConfigToCommunity: Include an edit summary [extensions/Babel] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115336 (https://phabricator.wikimedia.org/T385024) [10:12:01] (03PS7) 10Cathal Mooney: Network: add qos and sflow config for configure-switch-interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/1115109 (https://phabricator.wikimedia.org/T379549) [10:12:16] (03PS1) 10Urbanecm: migrateConfigToCommunity: Include an edit summary [extensions/Babel] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1115337 (https://phabricator.wikimedia.org/T385024) [10:12:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/Babel] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115336 (https://phabricator.wikimedia.org/T385024) (owner: 10Urbanecm) [10:12:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/Babel] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1115337 (https://phabricator.wikimedia.org/T385024) (owner: 10Urbanecm) [10:13:00] (03PS2) 10Hnowlan: scap: remove mw2410, mw1407 from proxy list [puppet] - 10https://gerrit.wikimedia.org/r/1115333 (https://phabricator.wikimedia.org/T384196) [10:14:01] (03PS3) 10Jforrester: dumps: Update legal.html file to list different licences for Wikifunctions [puppet] - 10https://gerrit.wikimedia.org/r/1072268 [10:14:17] (03PS2) 10Cathal Mooney: Class-of-service: don't insert comment with host name under cos/ints [homer/public] - 10https://gerrit.wikimedia.org/r/1115134 (https://phabricator.wikimedia.org/T379549) [10:14:26] (03CR) 10Ladsgroup: [C:03+2] dumps: Update legal.html file to list different licences for Wikifunctions [puppet] - 10https://gerrit.wikimedia.org/r/1072268 (owner: 10Jforrester) [10:15:14] (03CR) 10Cathal Mooney: Network: add qos and sflow config for configure-switch-interfaces (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1115109 (https://phabricator.wikimedia.org/T379549) (owner: 10Cathal Mooney) [10:15:59] !log cmooney@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host frbast1002 [10:16:01] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host frbast1002 [10:16:15] (03CR) 10JMeybohm: [V:03+1 C:03+2] Add restricted users to deployment_server [puppet] - 10https://gerrit.wikimedia.org/r/1114963 (https://phabricator.wikimedia.org/T378429) (owner: 10JMeybohm) [10:16:56] (03PS8) 10Cathal Mooney: Network: add qos and sflow config for configure-switch-interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/1115109 (https://phabricator.wikimedia.org/T379549) [10:17:20] (03PS1) 10Elukey: Revert "admin_ng: enforce restricted PSS on ml-staging-codfw" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115339 [10:19:24] !log cmooney@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host frbast1002 [10:19:27] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host frbast1002 [10:19:47] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: quote install instructions [puppet] - 10https://gerrit.wikimedia.org/r/1115335 (owner: 10Filippo Giunchedi) [10:19:56] (03PS3) 10Ladsgroup: mediawiki: Add Uncategorizedpages cron for commonswiki [puppet] - 10https://gerrit.wikimedia.org/r/1109526 (https://phabricator.wikimedia.org/T369024) [10:19:56] (03PS1) 10Elukey: services: bump kartotherian's pod cpu spec to 4 CPUs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115340 (https://phabricator.wikimedia.org/T216826) [10:20:03] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109526 (https://phabricator.wikimedia.org/T369024) (owner: 10Ladsgroup) [10:21:45] (03CR) 10Elukey: [C:03+2] Revert "admin_ng: enforce restricted PSS on ml-staging-codfw" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115339 (owner: 10Elukey) [10:23:05] (03CR) 10CI reject: [V:04-1] Network: add qos and sflow config for configure-switch-interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/1115109 (https://phabricator.wikimedia.org/T379549) (owner: 10Cathal Mooney) [10:25:36] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10507699 (10MoritzMuehlenhoff) [10:26:28] (03CR) 10Ladsgroup: [C:03+2] mediawiki: Add Uncategorizedpages cron for commonswiki [puppet] - 10https://gerrit.wikimedia.org/r/1109526 (https://phabricator.wikimedia.org/T369024) (owner: 10Ladsgroup) [10:26:43] (03PS2) 10Ladsgroup: mediawiki: Remove special-case wikitech update query page runs [puppet] - 10https://gerrit.wikimedia.org/r/1109527 [10:27:02] (03CR) 10Ladsgroup: [V:03+2 C:03+2] mediawiki: Remove special-case wikitech update query page runs [puppet] - 10https://gerrit.wikimedia.org/r/1109527 (owner: 10Ladsgroup) [10:31:31] (03PS9) 10Cathal Mooney: Network: add qos and sflow config for configure-switch-interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/1115109 (https://phabricator.wikimedia.org/T379549) [10:36:56] (03CR) 10Cathal Mooney: Class-of-service: don't insert comment with host name under cos/ints (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1115134 (https://phabricator.wikimedia.org/T379549) (owner: 10Cathal Mooney) [10:37:07] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2201.codfw.wmnet with reason: test new s4 backups [10:37:39] oh, I downtimed the wrong host [10:38:04] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2202.codfw.wmnet with reason: prepare for decom [10:39:10] !log jynus@cumin1002 START - Cookbook sre.hosts.remove-downtime for db2201.codfw.wmnet [10:39:11] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2201.codfw.wmnet [10:44:32] !log installing util-linux bugfix updates from bookworm point release [10:44:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:42] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:49:01] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T384281#10507819 (10Peachey88) →14Duplicate dup:03T382984 [10:49:03] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T382984#10507822 (10Peachey88) [10:49:14] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T384415#10507824 (10Peachey88) →14Duplicate dup:03T382984 [10:49:15] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T382984#10507826 (10Peachey88) [10:53:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1183 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72847 and previous config saved to /var/cache/conftool/dbconfig/20250130-105327-root.json [10:56:08] !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [10:57:04] !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [10:58:10] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Okay to deploy once the Wikibase change is fully rolled out. (If it’s accidentally deployed earlier, or if the train unexpectedly gets rol" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115006 (https://phabricator.wikimedia.org/T330217) (owner: 10Arthur taylor) [10:58:46] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Okay to deploy at any time (even before the parent change, I think)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115013 (https://phabricator.wikimedia.org/T330217) (owner: 10Arthur taylor) [10:59:52] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Okay to deploy once the Wikibase change is fully deployed. (Unlike I27cd944c5f, I think this change *would* start to cause errors in produ" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115016 (https://phabricator.wikimedia.org/T330217) (owner: 10Arthur taylor) [10:59:54] Lucas_WMDE: thanks for the +2 :) [11:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250130T1100) [11:01:34] (03PS1) 10Hashar: Fix response error handling in FlickrBlacklist [extensions/UploadWizard] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115344 (https://phabricator.wikimedia.org/T385143) [11:02:43] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db1151.eqiad.wmnet [11:03:06] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db2144.codfw.wmnet [11:08:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1183 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72848 and previous config saved to /var/cache/conftool/dbconfig/20250130-110832-root.json [11:09:00] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1151.eqiad.wmnet [11:09:38] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2144.codfw.wmnet [11:10:05] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db1153.eqiad.wmnet [11:10:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/UploadWizard] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115344 (https://phabricator.wikimedia.org/T385143) (owner: 10Hashar) [11:11:42] (03PS1) 10Ladsgroup: Bump portal to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115349 (https://phabricator.wikimedia.org/T368221) [11:12:13] jouncebot: nowandnext [11:12:13] For the next 0 hour(s) and 47 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250130T1100) [11:12:14] In 1 hour(s) and 47 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250130T1300) [11:12:35] (03CR) 10Ladsgroup: [C:03+2] Bump portal to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115349 (https://phabricator.wikimedia.org/T368221) (owner: 10Ladsgroup) [11:12:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2172', diff saved to https://phabricator.wikimedia.org/P72849 and previous config saved to /var/cache/conftool/dbconfig/20250130-111244-marostegui.json [11:13:00] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db2172.codfw.wmnet [11:13:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115349 (https://phabricator.wikimedia.org/T368221) (owner: 10Ladsgroup) [11:13:20] (03Merged) 10jenkins-bot: Bump portal to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115349 (https://phabricator.wikimedia.org/T368221) (owner: 10Ladsgroup) [11:13:40] (03PS1) 10Marostegui: db2172: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1115350 [11:14:21] (03CR) 10Marostegui: [C:03+2] db2172: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1115350 (owner: 10Marostegui) [11:14:57] sigh, I think I need to sync again [11:16:40] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1153.eqiad.wmnet [11:16:59] !log ladsgroup@deploy2002 Synchronized portals/wikipedia.org/assets: Bump portals (T368221 and T373204) (duration: 02m 57s) [11:17:05] T368221: Dark mode for Wikimedia portals (e.g. www.wikipedia.org) - https://phabricator.wikimedia.org/T368221 [11:17:05] T373204: Wikimedia.org page redesign - https://phabricator.wikimedia.org/T373204 [11:20:02] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1223 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1115352 (https://phabricator.wikimedia.org/T385160) [11:21:38] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2172.codfw.wmnet [11:22:21] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2172.codfw.wmnet with reason: Index rebuild [11:22:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2191', diff saved to https://phabricator.wikimedia.org/P72850 and previous config saved to /var/cache/conftool/dbconfig/20250130-112252-marostegui.json [11:22:59] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db2191.codfw.wmnet [11:23:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1183 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72851 and previous config saved to /var/cache/conftool/dbconfig/20250130-112337-root.json [11:24:18] (03CR) 10Clément Goubert: [C:03+1] scap: remove mw2410, mw1407 from proxy list [puppet] - 10https://gerrit.wikimedia.org/r/1115333 (https://phabricator.wikimedia.org/T384196) (owner: 10Hnowlan) [11:24:29] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10507985 (10MoritzMuehlenhoff) [11:24:44] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537#10507989 (10MoritzMuehlenhoff) [11:28:00] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2191.codfw.wmnet [11:28:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2191 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72852 and previous config saved to /var/cache/conftool/dbconfig/20250130-112836-root.json [11:28:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2224', diff saved to https://phabricator.wikimedia.org/P72853 and previous config saved to /var/cache/conftool/dbconfig/20250130-112853-marostegui.json [11:28:59] !log ladsgroup@deploy2002 Synchronized portals/wikipedia.org/assets: Bump portals (second try) (duration: 11m 07s) [11:29:04] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db2224.codfw.wmnet [11:30:40] (03CR) 10Clément Goubert: [C:03+1] services: bump kartotherian's pod cpu spec to 4 CPUs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115340 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [11:34:43] (03CR) 10Elukey: [C:03+2] services: bump kartotherian's pod cpu spec to 4 CPUs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115340 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [11:34:44] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2224.codfw.wmnet [11:35:37] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2224.codfw.wmnet with reason: Index rebuild [11:38:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T384592)', diff saved to https://phabricator.wikimedia.org/P72854 and previous config saved to /var/cache/conftool/dbconfig/20250130-113833-marostegui.json [11:38:39] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [11:38:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1183 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72855 and previous config saved to /var/cache/conftool/dbconfig/20250130-113842-root.json [11:38:58] (03Abandoned) 10Ladsgroup: mariadb: Promote db1223 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1115352 (https://phabricator.wikimedia.org/T385160) (owner: 10Gerrit maintenance bot) [11:39:44] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet [11:39:59] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10508023 (10ops-monitoring-bot) Draining ganeti2029.codfw.wmnet of running VMs [11:40:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet [11:40:43] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet [11:41:02] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10508024 (10ops-monitoring-bot) Draining ganeti2029.codfw.wmnet of running VMs [11:43:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2191 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72856 and previous config saved to /var/cache/conftool/dbconfig/20250130-114341-root.json [11:45:37] (03CR) 10Tiziano Fogli: [C:03+1] vopsbot: sync db when needed [puppet] - 10https://gerrit.wikimedia.org/r/1115014 (https://phabricator.wikimedia.org/T375143) (owner: 10Filippo Giunchedi) [11:47:13] (03CR) 10Hnowlan: [C:03+2] scap: remove mw2410, mw1407 from proxy list [puppet] - 10https://gerrit.wikimedia.org/r/1115333 (https://phabricator.wikimedia.org/T384196) (owner: 10Hnowlan) [11:52:02] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [11:52:07] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1071-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [11:53:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P72857 and previous config saved to /var/cache/conftool/dbconfig/20250130-115340-marostegui.json [11:53:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1183 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72858 and previous config saved to /var/cache/conftool/dbconfig/20250130-115348-root.json [11:55:19] jouncebot: nowandnext [11:55:19] For the next 0 hour(s) and 4 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250130T1100) [11:55:19] In 1 hour(s) and 4 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250130T1300) [11:55:31] (03CR) 10Muehlenhoff: [C:03+2] Make maps-test2001 a bookworm maps master node [puppet] - 10https://gerrit.wikimedia.org/r/1111634 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [11:58:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2191 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72859 and previous config saved to /var/cache/conftool/dbconfig/20250130-115846-root.json [12:00:26] (03PS3) 10Vgutierrez: lvs: Fix puppet compiler error on missing NIC [puppet] - 10https://gerrit.wikimedia.org/r/1115233 (https://phabricator.wikimedia.org/T384477) [12:04:11] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1115233 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [12:04:44] (03PS1) 10Muehlenhoff: maps: Configure master_bookworm and replica_bookworm roles for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1115355 [12:04:50] gonna test scap with no proxies on deploy2002, which means stopping puppet and doing a sync-world [12:06:53] (03PS2) 10Muehlenhoff: maps: Configure master_bookworm and replica_bookworm roles for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1115355 [12:08:41] !log hnowlan@deploy2002 Started scap sync-world: testing removal of scap proxies [12:08:42] RESOLVED: JobUnavailable: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:08:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P72860 and previous config saved to /var/cache/conftool/dbconfig/20250130-120847-marostegui.json [12:09:28] (03CR) 10Vgutierrez: conftool: rm ats-be services cache nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall) [12:31:26] (03PS1) 10Clément Goubert: mediawiki: Fix fixtures files for mwcron [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115364 (https://phabricator.wikimedia.org/T341555) [12:31:27] (03PS1) 10Clément Goubert: mw-cron: Create service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115365 (https://phabricator.wikimedia.org/T341555) [12:37:15] (03CR) 10Hnowlan: [C:03+1] mediawiki: Fix fixtures files for mwcron [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115364 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [12:37:32] (03CR) 10Hnowlan: [C:03+1] mw-cron: Create service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115365 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [12:37:45] (03CR) 10Clément Goubert: [C:03+2] mediawiki: Fix fixtures files for mwcron [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115364 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [12:38:37] (03CR) 10Clément Goubert: [C:03+2] mw-cron: Create service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115365 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [12:39:30] hashar: np ^^ [12:39:32] (03Merged) 10jenkins-bot: mediawiki: Fix fixtures files for mwcron [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115364 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [12:40:39] (03Merged) 10jenkins-bot: mw-cron: Create service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115365 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [12:44:32] (03CR) 10Ladsgroup: [C:03+2] beta: Set categorylinks to write both [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115362 (https://phabricator.wikimedia.org/T385164) (owner: 10Ladsgroup) [12:45:14] (03Merged) 10jenkins-bot: beta: Set categorylinks to write both [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115362 (https://phabricator.wikimedia.org/T385164) (owner: 10Ladsgroup) [12:45:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host maps-test2001.codfw.wmnet with OS bookworm [12:45:44] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#10508161 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host maps-test2001.codfw.wmnet with OS bookworm [12:50:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72865 and previous config saved to /var/cache/conftool/dbconfig/20250130-125004-root.json [12:51:59] (03PS1) 10Clément Goubert: mediawiki: fix cronjob schedule quoting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115367 [12:54:06] (03PS1) 10AOkoth: os-reports: increase resource requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115368 (https://phabricator.wikimedia.org/T350794) [12:55:12] (03PS1) 10Gergő Tisza: Do not disable extensions on SUL3 shared authentication domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115369 (https://phabricator.wikimedia.org/T373737) [12:55:14] (03CR) 10JMeybohm: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/1114000 (owner: 10JMeybohm) [12:55:35] (03CR) 10Jelto: [C:03+1] "worth a try, lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115368 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [12:56:12] (03PS1) 10Jcrespo: dbbackups: Prepare for decommission of db2139 [puppet] - 10https://gerrit.wikimedia.org/r/1115370 (https://phabricator.wikimedia.org/T383971) [12:56:19] (03CR) 10Clément Goubert: [C:03+2] mediawiki: fix cronjob schedule quoting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115367 (owner: 10Clément Goubert) [12:57:43] (03CR) 10Marostegui: [C:03+1] dbbackups: Prepare for decommission of db2139 [puppet] - 10https://gerrit.wikimedia.org/r/1115370 (https://phabricator.wikimedia.org/T383971) (owner: 10Jcrespo) [12:58:50] (03Merged) 10jenkins-bot: mediawiki: fix cronjob schedule quoting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115367 (owner: 10Clément Goubert) [12:59:40] (03CR) 10AOkoth: [C:03+2] os-reports: increase resource requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115368 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250130T1300) [13:00:59] (03Merged) 10jenkins-bot: os-reports: increase resource requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115368 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [13:02:24] (03CR) 10Bartosz Dziewoński: [C:03+1] "Alas." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115369 (https://phabricator.wikimedia.org/T373737) (owner: 10Gergő Tisza) [13:03:29] !log jayme@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on 7 hosts with reason: K8s update [13:05:00] !log aokoth@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [13:05:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2224 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72866 and previous config saved to /var/cache/conftool/dbconfig/20250130-130509-root.json [13:06:45] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps-test2001.codfw.wmnet with reason: host reimage [13:06:55] (03PS8) 10Arnaudb: gitlab_runner: add nftables logic [puppet] - 10https://gerrit.wikimedia.org/r/1114726 (https://phabricator.wikimedia.org/T370677) [13:07:07] FIRING: SystemdUnitFailed: git_pull_charts.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:07:43] (03CR) 10Hashar: [V:03+2 C:03+2] Do not copy Code-Review +2 (take 2) [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1115068 (owner: 10Hashar) [13:07:49] (03PS9) 10Arnaudb: gitlab_runner: add nftables logic [puppet] - 10https://gerrit.wikimedia.org/r/1114726 (https://phabricator.wikimedia.org/T370677) [13:07:58] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubestagemaster_6443: Servers kubestagemaster2004.codfw.wmnet are marked down but pooled: k8s-ingress-staging_30443: Servers kubestage2002.codfw.wmnet, kubestage2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:08:16] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubestagemaster_6443: Servers kubestagemaster2004.codfw.wmnet are marked down but pooled: k8s-ingress-staging_30443: Servers kubestage2002.codfw.wmnet, kubestage2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:09:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps-test2001.codfw.wmnet with reason: host reimage [13:11:28] FIRING: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [13:11:56] FIRING: CalicoTyphaDown: Too few (0) calico-typha replicas running - https://wikitech.wikimedia.org/wiki/Calico#Typha" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoTyphaDown [13:11:57] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [13:12:07] RESOLVED: SystemdUnitFailed: git_pull_charts.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:15:26] !log aokoth@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [13:15:48] jouncebot: now [13:15:48] For the next 0 hour(s) and 44 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250130T1300) [13:16:28] FIRING: [3x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [13:16:32] I am gonna back port https://gerrit.wikimedia.org/r/c/1115344/ [13:16:42] in the interest of time for the backport window [13:17:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [extensions/UploadWizard] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115344 (https://phabricator.wikimedia.org/T385143) (owner: 10Hashar) [13:19:32] 👍 [13:20:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2224 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72867 and previous config saved to /var/cache/conftool/dbconfig/20250130-132014-root.json [13:22:29] CI now only runs 3 jobs: phan, npm test and Qiubble tests [13:22:48] (03PS1) 10Slyngshede: Permissions LDAP group validator [software/bitu] - 10https://gerrit.wikimedia.org/r/1115375 [13:23:45] (03Merged) 10jenkins-bot: Fix response error handling in FlickrBlacklist [extensions/UploadWizard] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115344 (https://phabricator.wikimedia.org/T385143) (owner: 10Hashar) [13:24:06] (03CR) 10Gergő Tisza: Add 'auth' docroot with custom files (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115103 (https://phabricator.wikimedia.org/T383952) (owner: 10Bartosz Dziewoński) [13:24:54] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1115344|Fix response error handling in FlickrBlacklist (T385143)]] [13:24:59] T385143: PHP Deprecated: json_decode(): Passing null to parameter #1 ($json) of type string is deprecated - https://phabricator.wikimedia.org/T385143 [13:25:28] hashar: :O [13:25:32] (03CR) 10Gergő Tisza: [C:03+1] Add 'auth' docroot with custom files [puppet] - 10https://gerrit.wikimedia.org/r/1115104 (https://phabricator.wikimedia.org/T383952) (owner: 10Bartosz Dziewoński) [13:25:34] (re: CI only 3 jobs) [13:25:51] is that specific for backports? [13:26:21] UploadWizard does not trigger the shared wmf-quibblejobs [13:26:31] ah, I see [13:26:35] and yesterday we have removed the Selenium jobs [13:26:44] well for patches made to wmf/ [13:26:54] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install elastic1108-elastic1122 - https://phabricator.wikimedia.org/T384966#10508237 (10RobH) 05In progress→03Open [13:27:02] but most backports would still test the other PHP versions? [13:27:32] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install elastic1108-elastic1122 - https://phabricator.wikimedia.org/T384966#10508239 (10RobH) It doesn't matter that much, as long as it is in the racking column on the #ops-eqiad workboard with no one assigned, the on-sites know... [13:27:49] only php7.4 [13:27:58] !log hashar@deploy2002 hashar: Backport for [[gerrit:1115344|Fix response error handling in FlickrBlacklist (T385143)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:28:03] hm [13:28:04] I think *James went to add 8.1 back in April [13:28:23] now that apparnetly 8.1 is reaching production, I guess yes CI should test [13:28:25] (just saw it on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Babel/+/1115062 too) [13:29:16] https://gerrit.wikimedia.org/r/c/integration/config/+/1115147 uploaded earlier today (UTC) ^^ [13:30:37] !log hashar@deploy2002 hashar: Continuing with sync [13:30:53] ah yeah [13:31:22] (03CR) 10CDanis: [C:03+1] gNMIc: Add BGP stats collection for network devices [puppet] - 10https://gerrit.wikimedia.org/r/1115002 (https://phabricator.wikimedia.org/T369384) (owner: 10Cathal Mooney) [13:34:49] (03PS1) 10Revi: kowikisource: Add Draft(_talk) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115377 (https://phabricator.wikimedia.org/T385162) [13:35:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72868 and previous config saved to /var/cache/conftool/dbconfig/20250130-133519-root.json [13:36:49] !log hashar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1115344|Fix response error handling in FlickrBlacklist (T385143)]] (duration: 11m 54s) [13:36:54] T385143: PHP Deprecated: json_decode(): Passing null to parameter #1 ($json) of type string is deprecated - https://phabricator.wikimedia.org/T385143 [13:37:32] one less issue [13:37:37] (03CR) 10Fabfur: [C:03+1] "LGTM but not an expert on this, so I'd wait for another pair of eyes" [puppet] - 10https://gerrit.wikimedia.org/r/1115233 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [13:37:44] yay [13:38:11] hashar: should I +2 the REL backport of that btw? [13:38:24] (I thought usually it’s okay for the uploader to +2 such backports themself) [13:41:55] Lucas_WMDE: please do yes :) I am not sure what the process is for REL branches [13:42:10] short of pinging S.a.m R.e.e.d [13:42:12] :b [13:43:14] ^^ [13:43:26] but that reminds me, I should probably backport the Wikibase fixes… [13:44:21] is anyone using Wikibase beside WMF? [13:44:26] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/kartotherian: sync [13:44:36] or do we need to maintaing the REL branches of Wikibase? [13:44:47] lots of people use it [13:44:56] ah cool [13:45:00] there’s separate wikibase suite releases but I’m pretty sure they start from the REL branches [13:45:05] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/kartotherian: sync [13:45:09] so I’m happy to backport there and then let the other team take it from there [13:46:53] (03PS1) 10JMeybohm: k8s.wipe-cluster: Allow to specify downtime length [cookbooks] - 10https://gerrit.wikimedia.org/r/1115380 (https://phabricator.wikimedia.org/T341984) [13:46:54] I wish I could stop gerrit from adding -branchname to the topic when cherry-picking [13:47:38] hmm [13:47:52] that sounds "new" [13:47:59] (03CR) 10Bartosz Dziewoński: Add 'auth' docroot with custom files (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115103 (https://phabricator.wikimedia.org/T383952) (owner: 10Bartosz Dziewoński) [13:48:03] I mean, that might be a new behavior [13:48:14] I think I remember it from a few months ago at least [13:48:16] feel free to file it in our Phab against #gerrit so that can at least be tracked [13:48:20] should be easy enough to find evidence on older changes ^^ [13:48:21] yeah sure [13:48:39] (inb4 “a few months is pretty new” ;)) [13:49:02] (03PS1) 10Sergio Gimeno: SuggestedEditSession: remove incorrect cast to integer [extensions/GrowthExperiments] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1115383 (https://phabricator.wikimedia.org/T385117) [13:49:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, February 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115377 (https://phabricator.wikimedia.org/T385162) (owner: 10Revi) [13:49:28] ooh, “intopic” search keyword works [13:49:33] here, isn’t this topic fun https://gerrit.wikimedia.org/r/c/mediawiki/extensions/OAuth/+/1112801 [13:49:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1115383 (https://phabricator.wikimedia.org/T385117) (owner: 10Sergio Gimeno) [13:49:46] (but that was only the other week) [13:50:16] heh, that's kind of my fault for cherry-picking them in a chain, so to speak. instead of all from master [13:50:18] java/com/google/gerrit/server/restapi/change/CherryPickChange.java:415 [13:50:19] newTopic = sourceChange.getTopic() + "-" + dest.shortName(); [13:50:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72869 and previous config saved to /var/cache/conftool/dbconfig/20250130-135024-root.json [13:50:28] MatmaRex: nah, that’s perfectly reasonable to do imho [13:50:34] avoid having to resolve merge conflicts twice [13:50:38] I blame gerrit ^^ [13:50:48] (03PS2) 10Sergio Gimeno: SuggestedEditSession: remove incorrect cast to integer [extensions/GrowthExperiments] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1115383 (https://phabricator.wikimedia.org/T385117) [13:50:50] these topic suffixies haven't always been here, but they've been added for a while. i want to say like a year or so [13:50:59] !log jayme@cumin1002 START - Cookbook sre.k8s.wipe-cluster Wipe the K8s cluster staging-codfw: Kubernetes upgrade [13:51:16] (03PS1) 10Sergio Gimeno: SuggestedEditSession: remove incorrect cast to integer [extensions/GrowthExperiments] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115384 (https://phabricator.wikimedia.org/T385117) [13:51:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115384 (https://phabricator.wikimedia.org/T385117) (owner: 10Sergio Gimeno) [13:51:53] urbanecm: I guess we can backport all your four patches at the same time? [13:52:00] hashar: that's my plan! [13:52:07] lets go for it now so [13:52:11] okay [13:52:17] want me to do it? [13:52:50] hashar: up to you! they're a prep for a release, at this point, they're all a no-op [13:53:37] so there is no need to backport them to wmf/ ? [13:53:39] oh no [13:53:40] ok [13:53:40] sorry [13:54:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [extensions/Babel] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115062 (https://phabricator.wikimedia.org/T384941) (owner: 10Urbanecm) [13:54:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [extensions/Babel] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1115059 (https://phabricator.wikimedia.org/T384941) (owner: 10Urbanecm) [13:54:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [extensions/Babel] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115336 (https://phabricator.wikimedia.org/T385024) (owner: 10Urbanecm) [13:54:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [extensions/Babel] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1115337 (https://phabricator.wikimedia.org/T385024) (owner: 10Urbanecm) [13:54:17] hashar: we need them in wmf, but they will be used once i switch the config too [13:54:21] hashar, MatmaRex: T385168 fyi [13:54:21] T385168: Gerrit adds branch name to topic when cherry-picking - https://phabricator.wikimedia.org/T385168 [13:54:34] (03CR) 10JMeybohm: [C:03+2] CI: Fix helm errors hiding behind YAML parser [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113980 (owner: 10JMeybohm) [13:54:34] Lucas_WMDE: thank you! [13:54:37] (03CR) 10JMeybohm: [C:03+2] CI: Ensure admin checks don't run unnecessary template calls [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113979 (owner: 10JMeybohm) [13:54:39] (03CR) 10JMeybohm: [C:03+2] Update wikikube istio 1.24.2 config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113474 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [13:54:42] (03CR) 10JMeybohm: [C:03+2] Create a copy of the wikikube istio config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113473 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [13:54:46] (03CR) 10JMeybohm: [C:03+2] Update staging-codfw to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112059 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [13:55:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2172 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72870 and previous config saved to /var/cache/conftool/dbconfig/20250130-135559-root.json [13:56:13] (03PS1) 10Filippo Giunchedi: pontoon: sort and improve list-hosts [puppet] - 10https://gerrit.wikimedia.org/r/1115385 [13:56:55] (03CR) 10Filippo Giunchedi: [C:03+2] vopsbot: sync db when needed [puppet] - 10https://gerrit.wikimedia.org/r/1115014 (https://phabricator.wikimedia.org/T375143) (owner: 10Filippo Giunchedi) [13:58:53] (03CR) 10Tiziano Fogli: [C:03+1] pontoon: sort and improve list-hosts [puppet] - 10https://gerrit.wikimedia.org/r/1115385 (owner: 10Filippo Giunchedi) [13:59:48] (03CR) 10JMeybohm: [C:03+2] Update staging-codfw to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1110813 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: Time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250130T1400). [14:00:05] MatmaRex, urbanecm, hashar, and sergi0: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:11] o/ [14:00:18] hello [14:00:26] hey [14:00:29] hi [14:00:37] * urbanecm is around to deploy if needed [14:00:43] i actually need to make a change to my config patch, need a few minutes [14:01:06] hashar is deploying urbanecm’s backports right now, right? [14:02:51] (03PS2) 10Bartosz Dziewoński: Add 'auth' docroot with custom files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115103 (https://phabricator.wikimedia.org/T383952) [14:03:15] (03CR) 10Bartosz Dziewoński: Add 'auth' docroot with custom files (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115103 (https://phabricator.wikimedia.org/T383952) (owner: 10Bartosz Dziewoński) [14:03:22] (ready) [14:03:24] I could probably deploy afterwards [14:03:34] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: sort and improve list-hosts [puppet] - 10https://gerrit.wikimedia.org/r/1115385 (owner: 10Filippo Giunchedi) [14:03:50] (03PS1) 10Jelto: Build helm3.17 with new upstream version [debs/helm3] - 10https://gerrit.wikimedia.org/r/1115388 (https://phabricator.wikimedia.org/T341984) [14:04:48] (03Merged) 10jenkins-bot: migrateConfigToCommunity: Handle false BabelMainCategory [extensions/Babel] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115062 (https://phabricator.wikimedia.org/T384941) (owner: 10Urbanecm) [14:04:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet [14:05:04] (03Merged) 10jenkins-bot: migrateConfigToCommunity: Handle false BabelMainCategory [extensions/Babel] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1115059 (https://phabricator.wikimedia.org/T384941) (owner: 10Urbanecm) [14:05:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T384592)', diff saved to https://phabricator.wikimedia.org/P72871 and previous config saved to /var/cache/conftool/dbconfig/20250130-140553-marostegui.json [14:05:58] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [14:06:20] (03Merged) 10jenkins-bot: migrateConfigToCommunity: Include an edit summary [extensions/Babel] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115336 (https://phabricator.wikimedia.org/T385024) (owner: 10Urbanecm) [14:06:21] (03Merged) 10jenkins-bot: migrateConfigToCommunity: Include an edit summary [extensions/Babel] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1115337 (https://phabricator.wikimedia.org/T385024) (owner: 10Urbanecm) [14:06:24] (03CR) 10Jelto: "Does this makes sense to you? Are we renaming helm311 to helm317 or add another package?" [debs/helm3] - 10https://gerrit.wikimedia.org/r/1115388 (https://phabricator.wikimedia.org/T341984) (owner: 10Jelto) [14:06:49] (03PS1) 10AOkoth: debug: troubleshooting deployment issues [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115389 (https://phabricator.wikimedia.org/T350794) [14:06:52] (i'm away for a minute, i can go last) [14:06:57] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1115062|migrateConfigToCommunity: Handle false BabelMainCategory (T384941)]], [[gerrit:1115059|migrateConfigToCommunity: Handle false BabelMainCategory (T384941)]], [[gerrit:1115336|migrateConfigToCommunity: Include an edit summary (T385024)]], [[gerrit:1115337|migrateConfigToCommunity: Include an edit summary (T385024)]] [14:07:02] T384941: Setting wgBabelCategoryNames[level] to false is not supported by the migration script - https://phabricator.wikimedia.org/T384941 [14:07:03] T385024: Babel's migration script should include an edit summary - https://phabricator.wikimedia.org/T385024 [14:08:26] (03CR) 10AOkoth: [C:03+2] debug: troubleshooting deployment issues [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115389 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [14:10:00] !log hashar@deploy2002 urbanecm, hashar: Backport for [[gerrit:1115062|migrateConfigToCommunity: Handle false BabelMainCategory (T384941)]], [[gerrit:1115059|migrateConfigToCommunity: Handle false BabelMainCategory (T384941)]], [[gerrit:1115336|migrateConfigToCommunity: Include an edit summary (T385024)]], [[gerrit:1115337|migrateConfigToCommunity: Include an edit summary (T385024)]] synced to the testservers (https://wik [14:10:00] itech.wikimedia.org/wiki/Mwdebug) [14:10:08] (03PS1) 10Arturo Borrero Gonzalez: prometheus-node-kernel-messages: add logic to ignore messages [puppet] - 10https://gerrit.wikimedia.org/r/1115391 (https://phabricator.wikimedia.org/T380960) [14:10:15] !log aokoth@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [14:10:41] hashar: feel free to continue [14:11:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2172 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72872 and previous config saved to /var/cache/conftool/dbconfig/20250130-141104-root.json [14:11:52] !log hashar@deploy2002 urbanecm, hashar: Continuing with sync [14:12:15] (03CR) 10Volans: k8s.wipe-cluster: Allow to specify downtime length (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1115380 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [14:12:28] bah I should have done sergi0 patches while at it [14:14:34] No worries, I can wait [14:15:30] sergi0: i will do both of your patches next [14:15:45] great [14:16:32] (03PS1) 10JMeybohm: Explicitely cast string to integer [puppet] - 10https://gerrit.wikimedia.org/r/1115393 (https://phabricator.wikimedia.org/T341984) [14:16:53] (03CR) 10CI reject: [V:04-1] Explicitely cast string to integer [puppet] - 10https://gerrit.wikimedia.org/r/1115393 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [14:16:54] (03PS1) 10Elukey: knative: backport patch from 1.8.x release [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1115394 (https://phabricator.wikimedia.org/T369493) [14:18:16] !log hashar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1115062|migrateConfigToCommunity: Handle false BabelMainCategory (T384941)]], [[gerrit:1115059|migrateConfigToCommunity: Handle false BabelMainCategory (T384941)]], [[gerrit:1115336|migrateConfigToCommunity: Include an edit summary (T385024)]], [[gerrit:1115337|migrateConfigToCommunity: Include an edit summary (T385024)]] (duration: 11m 19s) [14:18:22] T384941: Setting wgBabelCategoryNames[level] to false is not supported by the migration script - https://phabricator.wikimedia.org/T384941 [14:18:22] T385024: Babel's migration script should include an edit summary - https://phabricator.wikimedia.org/T385024 [14:19:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1115383 (https://phabricator.wikimedia.org/T385117) (owner: 10Sergio Gimeno) [14:19:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115384 (https://phabricator.wikimedia.org/T385117) (owner: 10Sergio Gimeno) [14:19:35] (03CR) 10FNegri: prometheus-node-kernel-messages: add logic to ignore messages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1115391 (https://phabricator.wikimedia.org/T380960) (owner: 10Arturo Borrero Gonzalez) [14:19:37] (03PS2) 10JMeybohm: Explicitely cast string to integer [puppet] - 10https://gerrit.wikimedia.org/r/1115393 (https://phabricator.wikimedia.org/T341984) [14:19:48] !log stopped puppet on all kubernetes hosts [14:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:55] (03CR) 10Klausman: [C:03+1] knative: backport patch from 1.8.x release (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1115394 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [14:20:03] (03CR) 10CI reject: [V:04-1] Explicitely cast string to integer [puppet] - 10https://gerrit.wikimedia.org/r/1115393 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [14:20:32] (i'm back, ready whenever you are) [14:20:37] !log aokoth@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [14:21:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P72873 and previous config saved to /var/cache/conftool/dbconfig/20250130-142100-marostegui.json [14:21:04] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4905/console" [puppet] - 10https://gerrit.wikimedia.org/r/1115393 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [14:21:17] (03PS3) 10Bartosz Dziewoński: Define new 'auth' docroot with custom files for the auth domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115103 (https://phabricator.wikimedia.org/T383952) [14:21:37] (03PS5) 10Bartosz Dziewoński: Use new 'auth' docroot for the auth domain [puppet] - 10https://gerrit.wikimedia.org/r/1115104 (https://phabricator.wikimedia.org/T383952) [14:22:04] !log aokoth@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [14:22:05] (03PS2) 10Elukey: knative: backport patch from 1.8.x release [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1115394 (https://phabricator.wikimedia.org/T369493) [14:22:32] (03PS1) 10Reedy: FancyCaptcha: Return early in passCaptcha in numerous cases [extensions/ConfirmEdit] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115395 (https://phabricator.wikimedia.org/T384858) [14:22:40] (03PS1) 10Reedy: FancyCaptcha: Return early in passCaptcha in numerous cases [extensions/ConfirmEdit] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1115396 (https://phabricator.wikimedia.org/T384858) [14:22:48] (03CR) 10FNegri: "An initial list of known errors that can be ignored can be found at:" [puppet] - 10https://gerrit.wikimedia.org/r/1115391 (https://phabricator.wikimedia.org/T380960) (owner: 10Arturo Borrero Gonzalez) [14:23:29] (03PS3) 10JMeybohm: Explicitely cast string to integer [puppet] - 10https://gerrit.wikimedia.org/r/1115393 (https://phabricator.wikimedia.org/T341984) [14:24:46] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4906/console" [puppet] - 10https://gerrit.wikimedia.org/r/1115393 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [14:26:08] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1115393 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [14:26:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2172 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72875 and previous config saved to /var/cache/conftool/dbconfig/20250130-142609-root.json [14:26:43] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.01.11 - 2025.01.31), 13Patch-For-Review: Q3:rack/setup/install elastic1108-elastic1122 - https://phabricator.wikimedia.org/T384966#10508475 (10bking) [14:27:27] (03CR) 10JMeybohm: [V:03+1 C:03+2] Explicitely cast string to integer [puppet] - 10https://gerrit.wikimedia.org/r/1115393 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [14:30:47] (03CR) 10Urbanecm: "should be ready to go" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114002 (https://phabricator.wikimedia.org/T374348) (owner: 10Urbanecm) [14:31:07] (03PS2) 10Urbanecm: [testwiki] Babel: Enable CommunityConfiguration integration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114002 (https://phabricator.wikimedia.org/T374348) [14:32:14] !log aokoth@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [14:34:42] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:35:29] (03CR) 10Bking: [C:03+1] "CCing Jesse from IF for awareness, as we are creating a new EFI-based partman recipe in this CR." [puppet] - 10https://gerrit.wikimedia.org/r/1115122 (https://phabricator.wikimedia.org/T384966) (owner: 10Ryan Kemper) [14:35:39] (03Merged) 10jenkins-bot: SuggestedEditSession: remove incorrect cast to integer [extensions/GrowthExperiments] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1115383 (https://phabricator.wikimedia.org/T385117) (owner: 10Sergio Gimeno) [14:35:46] (03PS1) 10AOkoth: miscweb: image with rsync disabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115402 [14:35:53] (03CR) 10CI reject: [V:04-1] miscweb: image with rsync disabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115402 (owner: 10AOkoth) [14:36:00] (03PS1) 10Urbanecm: Babel: Enable CommunityConfiguration on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115403 (https://phabricator.wikimedia.org/T374348) [14:36:01] (03PS1) 10Urbanecm: CommunityConfiguration: Enable on all wikis except locked down [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115404 (https://phabricator.wikimedia.org/T383910) [14:36:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P72876 and previous config saved to /var/cache/conftool/dbconfig/20250130-143607-marostegui.json [14:36:24] (03Abandoned) 10AOkoth: miscweb: image with rsync disabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115402 (owner: 10AOkoth) [14:36:51] (03Merged) 10jenkins-bot: SuggestedEditSession: remove incorrect cast to integer [extensions/GrowthExperiments] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115384 (https://phabricator.wikimedia.org/T385117) (owner: 10Sergio Gimeno) [14:37:05] sergi0: in progress :) [14:37:22] cool, almost there :) [14:37:23] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1115383|SuggestedEditSession: remove incorrect cast to integer (T385117)]], [[gerrit:1115384|SuggestedEditSession: remove incorrect cast to integer (T385117)]] [14:37:29] T385117: [wmf.13] - errors for mediawiki.structured_task.article.link_suggestion_interaction and eventlogging_HelpPanel - https://phabricator.wikimedia.org/T385117 [14:38:05] (03PS1) 10AOkoth: miscweb: os-reports image with rsync disabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115406 [14:39:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Check link from msw1-eqiad et-0/1/0 to msw2-eqiad et-0/1/0 - https://phabricator.wikimedia.org/T384708#10508615 (10Papaul) 05Open→03Resolved a:03Papaul We are not seeing any errors for the last 24 hours resolving this task fo... [14:39:42] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:43] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:40:29] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:41:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2172 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72877 and previous config saved to /var/cache/conftool/dbconfig/20250130-144115-root.json [14:41:17] (03CR) 10AOkoth: [C:03+2] miscweb: os-reports image with rsync disabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115406 (owner: 10AOkoth) [14:42:39] (03Merged) 10jenkins-bot: miscweb: os-reports image with rsync disabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115406 (owner: 10AOkoth) [14:43:23] !log aokoth@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [14:43:38] !log aokoth@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [14:44:26] !log hashar@deploy2002 hashar, sgimeno: Backport for [[gerrit:1115383|SuggestedEditSession: remove incorrect cast to integer (T385117)]], [[gerrit:1115384|SuggestedEditSession: remove incorrect cast to integer (T385117)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:44:30] T385117: [wmf.13] - errors for mediawiki.structured_task.article.link_suggestion_interaction and eventlogging_HelpPanel - https://phabricator.wikimedia.org/T385117 [14:44:38] !log hashar@deploy2002 hashar, sgimeno: Continuing with sync [14:44:39] * sergi0 testing [14:44:39] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 4.988 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:44:46] oh [14:44:57] sergi0: well I have pressed yes already ooops :\ [14:45:13] MatmaRex: patches are almost done [14:45:19] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53367 bytes in 0.106 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:45:55] I'm gonna clear up some PHP 8.1 deprecated logspam after... [14:46:18] hashar: no problem, I'm pretty confident with it [14:46:32] (03CR) 10Reedy: [C:03+2] FancyCaptcha: Return early in passCaptcha in numerous cases [extensions/ConfirmEdit] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1115396 (https://phabricator.wikimedia.org/T384858) (owner: 10Reedy) [14:46:34] (03CR) 10Reedy: [C:03+2] FancyCaptcha: Return early in passCaptcha in numerous cases [extensions/ConfirmEdit] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115395 (https://phabricator.wikimedia.org/T384858) (owner: 10Reedy) [14:47:00] (03CR) 10Clément Goubert: k8s.pool-depool-node: Add support to downtime/remove downtime (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1114000 (owner: 10JMeybohm) [14:48:06] (03PS1) 10Gerrit maintenance bot: Add knc to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/1115411 (https://phabricator.wikimedia.org/T385181) [14:49:10] (03PS1) 10Clément Goubert: mediawiki: Use quote pipeline because yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115413 [14:49:31] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1100871 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [14:49:50] (03PS2) 10Clément Goubert: mediawiki: Use quote pipeline because yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115413 [14:50:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115074 (https://phabricator.wikimedia.org/T382173) (owner: 10Sergio Gimeno) [14:50:15] (03CR) 10Vgutierrez: [C:03+1] "looks good, please check that this doesn't break deployment-prep (aka the beta cluster)" [puppet] - 10https://gerrit.wikimedia.org/r/1115086 (owner: 10BCornwall) [14:51:04] !log hashar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1115383|SuggestedEditSession: remove incorrect cast to integer (T385117)]], [[gerrit:1115384|SuggestedEditSession: remove incorrect cast to integer (T385117)]] (duration: 13m 41s) [14:51:10] T385117: [wmf.13] - errors for mediawiki.structured_task.article.link_suggestion_interaction and eventlogging_HelpPanel - https://phabricator.wikimedia.org/T385117 [14:51:12] 10ops-magru, 06SRE: Degraded RAID on cp7004 - https://phabricator.wikimedia.org/T380905#10508737 (10RobH) 05Open→03Resolved a:03RobH This was resolved by the reshuffle work and currently no raid errors. [14:51:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T384592)', diff saved to https://phabricator.wikimedia.org/P72880 and previous config saved to /var/cache/conftool/dbconfig/20250130-145114-marostegui.json [14:51:19] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [14:51:30] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2205.codfw.wmnet with reason: Maintenance [14:51:31] hashar: thanks, i'm ready whenever [14:51:36] 10ops-magru: PowerSupplyFailure - https://phabricator.wikimedia.org/T380897#10508743 (10RobH) 05Open→03Resolved a:03RobH Being worked via T381446 [14:51:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2205 (T384592)', diff saved to https://phabricator.wikimedia.org/P72881 and previous config saved to /var/cache/conftool/dbconfig/20250130-145136-marostegui.json [14:51:44] (03CR) 10Ssingh: [C:03+1] "Looks good -- initially, the details of this were a bit hazy for me but I think it makes sense and one of the reasons we probably never ca" [puppet] - 10https://gerrit.wikimedia.org/r/1115233 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [14:52:20] 10ops-magru, 06DC-Ops: Power supply failure (PSU) for cp7006.magru.wmnet - https://phabricator.wikimedia.org/T381446#10508750 (10RobH) This work window to start shortly, they have all afternoon though to show up. With it just being a PSU swap, no user impact is expected. No maint window set in icinga, since... [14:52:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115369 (https://phabricator.wikimedia.org/T373737) (owner: 10Gergő Tisza) [14:52:38] MatmaRex: all done! [14:52:42] sergi0: your patches are live! [14:52:53] (03CR) 10Vgutierrez: [C:04-1] varnish: Fix claim obj.hits isn't known in vcl_hit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113591 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [14:53:05] great, thank you! [14:53:23] MatmaRex: can you deploy yours by yourself? [14:53:29] hashar: no, i was gonna ask [14:53:36] are you able to deploy them as well? [14:53:57] (03CR) 10Vgutierrez: [C:03+2] lvs: Fix puppet compiler error on missing NIC [puppet] - 10https://gerrit.wikimedia.org/r/1115233 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [14:54:12] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/kartotherian: sync [14:54:52] yeah [14:55:35] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/kartotherian: sync [14:55:42] no hiccups about proxies so far I hope :) [14:56:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2172 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P72882 and previous config saved to /var/cache/conftool/dbconfig/20250130-145620-root.json [14:56:22] hashar: there's nothing to test on mwdebug for my patches - one is a no-op, the other's effect is cached so we'll only see the effect in 24 hours [14:56:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113476 (https://phabricator.wikimedia.org/T383916) (owner: 10Bartosz Dziewoński) [14:56:36] hnowlan: nop indeed!! [14:56:43] phew [14:56:46] hnowlan: thank you for the quick cleanup! [14:57:44] (03Merged) 10jenkins-bot: Use full URLs for wgUploadNavigationUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113476 (https://phabricator.wikimedia.org/T383916) (owner: 10Bartosz Dziewoński) [14:57:50] (03CR) 10Volans: k8s.pool-depool-node: Add support to downtime/remove downtime (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1114000 (owner: 10JMeybohm) [14:58:15] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1113476|Use full URLs for wgUploadNavigationUrl (T383916)]] [14:58:20] T383916: Sidebar links are broken on shared domain - https://phabricator.wikimedia.org/T383916 [15:00:10] (03CR) 10Volans: k8s.pool-depool-node: Add support to downtime/remove downtime (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1114000 (owner: 10JMeybohm) [15:00:24] (03PS1) 10Brouberol: envoy: define a specific service entry for noc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1115416 (https://phabricator.wikimedia.org/T384329) [15:00:28] (03Merged) 10jenkins-bot: FancyCaptcha: Return early in passCaptcha in numerous cases [extensions/ConfirmEdit] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1115396 (https://phabricator.wikimedia.org/T384858) (owner: 10Reedy) [15:01:26] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti2029.codfw.wmnet with reason: remove from cluster for reimage [15:01:27] (03Merged) 10jenkins-bot: FancyCaptcha: Return early in passCaptcha in numerous cases [extensions/ConfirmEdit] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115395 (https://phabricator.wikimedia.org/T384858) (owner: 10Reedy) [15:01:35] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10508813 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=c86b38d9-e3a1-4cba-abc9-083df51a2d3e) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(... [15:02:30] !log enabled puppet on all kubernetes hosts [15:02:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:00] !log hashar@deploy2002 hashar, matmarex: Backport for [[gerrit:1113476|Use full URLs for wgUploadNavigationUrl (T383916)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:03:03] !log hashar@deploy2002 hashar, matmarex: Continuing with sync [15:04:27] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti2029.codfw.wmnet [15:04:42] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:06:09] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-fe1014 hardware fault (may need new disk controller?) - https://phabricator.wikimedia.org/T384317#10508847 (10Papaul) 05Open→03Resolved a:03Papaul checking the server again today all looks good. I am closing this task we can still re-open if w... [15:06:49] (03CR) 10Clément Goubert: [C:03+2] mediawiki: Use quote pipeline because yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115413 (owner: 10Clément Goubert) [15:07:14] (03CR) 10Ladsgroup: [C:03+2] Add knc to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/1115411 (https://phabricator.wikimedia.org/T385181) (owner: 10Gerrit maintenance bot) [15:07:46] !log ladsgroup@dns1004 START - running authdns-update [15:09:09] jouncebot: nowandnext [15:09:09] No deployments scheduled for the next 0 hour(s) and 50 minute(s) [15:09:09] In 0 hour(s) and 50 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250130T1600) [15:09:18] !log hashar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1113476|Use full URLs for wgUploadNavigationUrl (T383916)]] (duration: 11m 02s) [15:09:23] T383916: Sidebar links are broken on shared domain - https://phabricator.wikimedia.org/T383916 [15:09:31] (03Merged) 10jenkins-bot: mediawiki: Use quote pipeline because yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115413 (owner: 10Clément Goubert) [15:09:45] (03CR) 10Clément Goubert: k8s.pool-depool-node: Add support to downtime/remove downtime (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1114000 (owner: 10JMeybohm) [15:09:55] (03PS1) 10Elukey: services: bump kartotherian's allowed millicores to 5k [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115420 (https://phabricator.wikimedia.org/T384530) [15:09:56] !log ladsgroup@dns1004 END - running authdns-update [15:10:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115103 (https://phabricator.wikimedia.org/T383952) (owner: 10Bartosz Dziewoński) [15:10:45] one last change :) [15:11:03] thank you. i'm following along :) [15:11:23] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host maps-test2001.codfw.wmnet with OS bookworm [15:11:35] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#10508947 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host maps-test2001.codfw.wmnet with OS bookworm executed with errors: - maps-test2001... [15:11:52] (03Merged) 10jenkins-bot: Define new 'auth' docroot with custom files for the auth domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115103 (https://phabricator.wikimedia.org/T383952) (owner: 10Bartosz Dziewoński) [15:12:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 1.737s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:12:17] hmm [15:12:22] unexpected commits [15:12:33] so some patches got erged in confirmedit [15:12:36] grr [15:13:38] (03CR) 10Jgiannelos: [C:03+1] services: bump kartotherian's allowed millicores to 5k [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115420 (https://phabricator.wikimedia.org/T384530) (owner: 10Elukey) [15:14:35] ah that is from Reedy [15:14:49] Yeah, I was presuming that because you'd merged all the others and scap was in flight... :P [15:15:08] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1115103|Define new 'auth' docroot with custom files for the auth domain (T383952 T384137)]] [15:15:15] T383952: Auth.wikimedia.org circular errors - https://phabricator.wikimedia.org/T383952 [15:15:15] T384137: Set up robots.txt in auth.wikimedia.org - https://phabricator.wikimedia.org/T384137 [15:15:22] we should have done them in their own batches :b [15:15:31] (03PS1) 10Andrew Bogott: Initial puppet role for new ceph nodes [puppet] - 10https://gerrit.wikimedia.org/r/1115422 (https://phabricator.wikimedia.org/T378828) [15:15:36] anyway, yesterday we dropped the selenium jobs from the wmf branches testing [15:15:45] so backports merge a bit faster now [15:15:54] quite a bit :) [15:16:05] (03CR) 10Andrew Bogott: [C:03+2] Initial puppet role for new ceph nodes [puppet] - 10https://gerrit.wikimedia.org/r/1115422 (https://phabricator.wikimedia.org/T378828) (owner: 10Andrew Bogott) [15:17:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 1.737s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:18:06] (03CR) 10Elukey: [C:03+2] services: bump kartotherian's allowed millicores to 5k [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115420 (https://phabricator.wikimedia.org/T384530) (owner: 10Elukey) [15:18:25] !log hashar@deploy2002 matmarex, hashar: Backport for [[gerrit:1115103|Define new 'auth' docroot with custom files for the auth domain (T383952 T384137)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:19:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:23:37] !log hashar@deploy2002 matmarex, hashar: Continuing with sync [15:25:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet [15:25:57] (03PS1) 10Clément Goubert: mediawiki: Gate comment field for mwcron [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115425 [15:26:31] (03PS1) 10Reedy: MultiUsernameFilter: Don't try to split ids if they're not a string [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1115429 (https://phabricator.wikimedia.org/T385169) [15:26:36] (03CR) 10Reedy: [C:03+2] MultiUsernameFilter: Don't try to split ids if they're not a string [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1115429 (https://phabricator.wikimedia.org/T385169) (owner: 10Reedy) [15:26:44] (03PS1) 10Reedy: MultiUsernameFilter: Don't try to split ids if they're not a string [core] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115430 (https://phabricator.wikimedia.org/T385169) [15:26:50] (03CR) 10Reedy: [C:03+2] MultiUsernameFilter: Don't try to split ids if they're not a string [core] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115430 (https://phabricator.wikimedia.org/T385169) (owner: 10Reedy) [15:28:26] Reedy: I will let you ddeploy those [15:28:33] I need a break :) [15:28:57] (03PS2) 10JMeybohm: k8s.wipe-cluster: Improvements for k8s 1.31 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/1115380 (https://phabricator.wikimedia.org/T341984) [15:29:07] (03PS1) 10Reedy: Handle null option value in echomute api [extensions/Echo] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115431 (https://phabricator.wikimedia.org/T384694) [15:29:13] (03CR) 10Reedy: [C:03+2] Handle null option value in echomute api [extensions/Echo] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115431 (https://phabricator.wikimedia.org/T384694) (owner: 10Reedy) [15:29:47] (03CR) 10Klausman: [C:03+1] admin_ng: disable PSP binding for ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115323 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [15:30:03] !log hashar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1115103|Define new 'auth' docroot with custom files for the auth domain (T383952 T384137)]] (duration: 14m 55s) [15:30:04] (03PS1) 10Reedy: SpecialLandingCheck: Handle $sub being null [extensions/LandingCheck] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115432 (https://phabricator.wikimedia.org/T385028) [15:30:09] T383952: Auth.wikimedia.org circular errors - https://phabricator.wikimedia.org/T383952 [15:30:09] T384137: Set up robots.txt in auth.wikimedia.org - https://phabricator.wikimedia.org/T384137 [15:30:09] (03CR) 10Reedy: [C:03+2] SpecialLandingCheck: Handle $sub being null [extensions/LandingCheck] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115432 (https://phabricator.wikimedia.org/T385028) (owner: 10Reedy) [15:30:40] thanks for deploying hashar. sorry it took that long D: [15:30:55] MatmaRex: your patch is live yes :) [15:30:57] well both are [15:31:16] and I guess for the auth docroot to be live that requires some change to the Apache config / varnish / etc [15:31:19] (03PS1) 10JMeybohm: kubernetes-publish-sa-cert: Don't fail when no certs in etcd [puppet] - 10https://gerrit.wikimedia.org/r/1115433 (https://phabricator.wikimedia.org/T341984) [15:31:24] but at lesat you have the basics live now [15:31:33] Reedy: scap is all your :) [15:31:35] yes, the docroot doesn't do anything by itself [15:31:41] i have a puppet patch scheduled later today [15:31:45] that will make use of it [15:31:48] * hashar rm -fR docroot [15:32:01] !og imported maps-deduped-tilelist 0.0.5+deb12u1 to apt.wikimedia.org for bookworm-wikimedia T381565 [15:32:01] T381565: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565 [15:32:19] (03Abandoned) 10Reedy: MemcachedBagOStuff: Null coalescing $component [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1114448 (https://phabricator.wikimedia.org/T384858) (owner: 10Jforrester) [15:33:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host maps-test2001.codfw.wmnet with OS bookworm [15:34:00] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#10509071 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host maps-test2001.codfw.wmnet with OS bookworm [15:34:06] (03CR) 10JMeybohm: k8s.wipe-cluster: Improvements for k8s 1.31 upgrade (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1115380 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [15:35:07] (03PS1) 10Clément Goubert: mediawiki: Actually get values from the jobs array [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115435 [15:36:37] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db125[0-4] - https://phabricator.wikimedia.org/T380083#10509085 (10VRiley-WMF) db1250 A5 U29 CableID 1953 Port 32 db1251 B5 U10 CableID 3795 Port 11 db1252 C3 U28 CableID 4021 Port 15 db1253 D1 U30 CableID 5171 Port 32 db1254 F... [15:36:42] (03CR) 10Hnowlan: [C:03+1] mediawiki: Actually get values from the jobs array [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115435 (owner: 10Clément Goubert) [15:37:07] * Lucas_WMDE wants to !bash * hashar rm -fR docroot [15:37:22] (03PS1) 10CDanis: Trace only on k8s. Not (yet?) available on bare metal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115436 (https://phabricator.wikimedia.org/T321211) [15:37:35] (03CR) 10Clément Goubert: [C:03+2] mediawiki: Actually get values from the jobs array [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115435 (owner: 10Clément Goubert) [15:38:53] (03CR) 10Clément Goubert: [C:03+2] mediawiki: Gate comment field for mwcron [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115425 (owner: 10Clément Goubert) [15:38:54] (03Merged) 10jenkins-bot: MultiUsernameFilter: Don't try to split ids if they're not a string [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1115429 (https://phabricator.wikimedia.org/T385169) (owner: 10Reedy) [15:40:48] (03Merged) 10jenkins-bot: mediawiki: Gate comment field for mwcron [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115425 (owner: 10Clément Goubert) [15:40:56] (03Merged) 10jenkins-bot: MultiUsernameFilter: Don't try to split ids if they're not a string [core] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115430 (https://phabricator.wikimedia.org/T385169) (owner: 10Reedy) [15:40:59] (03Merged) 10jenkins-bot: mediawiki: Actually get values from the jobs array [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115435 (owner: 10Clément Goubert) [15:41:11] (03Merged) 10jenkins-bot: Handle null option value in echomute api [extensions/Echo] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115431 (https://phabricator.wikimedia.org/T384694) (owner: 10Reedy) [15:41:13] (03Merged) 10jenkins-bot: SpecialLandingCheck: Handle $sub being null [extensions/LandingCheck] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115432 (https://phabricator.wikimedia.org/T385028) (owner: 10Reedy) [15:41:45] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti2029.codfw.wmnet [15:41:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ganeti2029.codfw.wmnet [15:42:13] PROBLEM - ganeti-confd running on ganeti2029 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [15:42:13] PROBLEM - ganeti-noded running on ganeti2029 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [15:43:22] !log reedy@deploy2002 Started scap sync-world: Backport for [[gerrit:1115396|FancyCaptcha: Return early in passCaptcha in numerous cases (T384858)]], [[gerrit:1115395|FancyCaptcha: Return early in passCaptcha in numerous cases (T384858)]], [[gerrit:1115429|MultiUsernameFilter: Don't try to split ids if they're not a string (T385169)]], [[gerrit:1115430|MultiUsernameFilter: Don't try to split ids if they're not a string (T [15:43:22] 385169)]], [[gerrit:1115431|Handle null option value in echomute api (T384694)]], [[gerrit:1115432|SpecialLandingCheck: Handle $sub being null (T385028)]] [15:43:28] T384858: PHP Deprecated: strtr(): Passing null to parameter #1 ($string) of type string is deprecated - https://phabricator.wikimedia.org/T384858 [15:43:28] T385169: PHP Deprecated: preg_split(): Passing null to parameter #2 ($subject) of type string is deprecated - https://phabricator.wikimedia.org/T385169 [15:43:29] T384694: TypeError: Argument 1 passed to MediaWiki\Extension\Notifications\Api\ApiEchoMute::parsePref() must be of the type string, null given, called in /srv/mediawiki/php-1.44.0-wmf.13/extensions/Echo/includes/Api/ApiEchoMute.php on l - https://phabricator.wikimedia.org/T384694 [15:43:29] T385028: PHP Deprecated: explode(): Passing null to parameter #2 ($string) of type string is deprecated - https://phabricator.wikimedia.org/T385028 [15:44:29] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti2029 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1115332 (owner: 10Muehlenhoff) [15:46:19] !log reedy@deploy2002 reedy: Backport for [[gerrit:1115396|FancyCaptcha: Return early in passCaptcha in numerous cases (T384858)]], [[gerrit:1115395|FancyCaptcha: Return early in passCaptcha in numerous cases (T384858)]], [[gerrit:1115429|MultiUsernameFilter: Don't try to split ids if they're not a string (T385169)]], [[gerrit:1115430|MultiUsernameFilter: Don't try to split ids if they're not a string (T385169)]], [[gerri [15:46:19] t:1115431|Handle null option value in echomute api (T384694)]], [[gerrit:1115432|SpecialLandingCheck: Handle $sub being null (T385028)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:46:28] !log reedy@deploy2002 reedy: Continuing with sync [15:47:30] !log installing git security updates [15:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:33] (03CR) 10Volans: k8s.wipe-cluster: Improvements for k8s 1.31 upgrade (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1115380 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [15:50:09] (03CR) 10CDanis: [C:04-1] benthos: send data to eventgate too (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1115113 (https://phabricator.wikimedia.org/T383392) (owner: 10Fabfur) [15:52:07] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1071-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [15:52:53] !log reedy@deploy2002 Finished scap sync-world: Backport for [[gerrit:1115396|FancyCaptcha: Return early in passCaptcha in numerous cases (T384858)]], [[gerrit:1115395|FancyCaptcha: Return early in passCaptcha in numerous cases (T384858)]], [[gerrit:1115429|MultiUsernameFilter: Don't try to split ids if they're not a string (T385169)]], [[gerrit:1115430|MultiUsernameFilter: Don't try to split ids if they're not a string ( [15:52:53] T385169)]], [[gerrit:1115431|Handle null option value in echomute api (T384694)]], [[gerrit:1115432|SpecialLandingCheck: Handle $sub being null (T385028)]] (duration: 09m 31s) [15:52:59] T384858: PHP Deprecated: strtr(): Passing null to parameter #1 ($string) of type string is deprecated - https://phabricator.wikimedia.org/T384858 [15:53:00] T385169: PHP Deprecated: preg_split(): Passing null to parameter #2 ($subject) of type string is deprecated - https://phabricator.wikimedia.org/T385169 [15:53:00] T384694: TypeError: Argument 1 passed to MediaWiki\Extension\Notifications\Api\ApiEchoMute::parsePref() must be of the type string, null given, called in /srv/mediawiki/php-1.44.0-wmf.13/extensions/Echo/includes/Api/ApiEchoMute.php on l - https://phabricator.wikimedia.org/T384694 [15:53:00] T385028: PHP Deprecated: explode(): Passing null to parameter #2 ($string) of type string is deprecated - https://phabricator.wikimedia.org/T385028 [15:53:12] !log mforns@deploy2002 Started deploy [airflow-dags/analytics@c85b504]: pin confluent kafka to avoid certificate errors [15:53:45] !log mforns@deploy2002 Finished deploy [airflow-dags/analytics@c85b504]: pin confluent kafka to avoid certificate errors (duration: 00m 52s) [15:54:30] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps-test2001.codfw.wmnet with reason: host reimage [15:55:00] (03CR) 10Vgutierrez: [C:04-1] varnish: Upgrade VCL for Varnish 7.0+/modules 0.20 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113592 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [15:56:52] (03CR) 10Arturo Borrero Gonzalez: prometheus-node-kernel-messages: add logic to ignore messages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1115391 (https://phabricator.wikimedia.org/T380960) (owner: 10Arturo Borrero Gonzalez) [15:57:44] (03CR) 10Kamila Součková: [C:03+1] trafficserver: directly route to citoid on testwiki [puppet] - 10https://gerrit.wikimedia.org/r/1115056 (https://phabricator.wikimedia.org/T361576) (owner: 10Hnowlan) [15:58:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps-test2001.codfw.wmnet with reason: host reimage [15:58:25] (03PS2) 10Alexandros Kosiaris: Trace only on k8s. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115436 (https://phabricator.wikimedia.org/T321211) (owner: 10CDanis) [15:58:31] (03CR) 10Alexandros Kosiaris: [C:03+1] Trace only on k8s. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115436 (https://phabricator.wikimedia.org/T321211) (owner: 10CDanis) [15:58:58] jouncebot: nowandnext [15:58:59] No deployments scheduled for the next 0 hour(s) and 1 minute(s) [15:58:59] In 0 hour(s) and 1 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250130T1600) [15:59:45] (03CR) 10Alexandros Kosiaris: [C:03+1] trafficserver: directly route to citoid on testwiki [puppet] - 10https://gerrit.wikimedia.org/r/1115056 (https://phabricator.wikimedia.org/T361576) (owner: 10Hnowlan) [16:00:05] jeena and hashar: Deploy window Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250130T1600) [16:00:34] cdanis: are you deploying? [16:00:57] hnowlan: you can go ahead if you need [16:01:00] I was going to fix the mwdebug* log spam [16:01:10] but haven't started yet [16:01:44] I'm going to (relatively safely) fiddle with the ATS config for citoid - I think we can probably coexist if you're okay with that [16:01:50] 👍 [16:01:53] Reedy: are you done? [16:03:52] (03PS1) 10Alexandros Kosiaris: Add .gitmessage in the repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115443 [16:04:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cdanis@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115436 (https://phabricator.wikimedia.org/T321211) (owner: 10CDanis) [16:05:33] (03Merged) 10jenkins-bot: Trace only on k8s. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115436 (https://phabricator.wikimedia.org/T321211) (owner: 10CDanis) [16:06:01] !log cdanis@deploy2002 Started scap sync-world: Backport for [[gerrit:1115436|Trace only on k8s. (T321211 T340552 T385037)]] [16:06:11] T321211: distributed tracing v1: tech debt blockers - https://phabricator.wikimedia.org/T321211 [16:06:11] T340552: Implement and wire-up minimal OpenTelemetry tracing client compatible with OTEL data model - https://phabricator.wikimedia.org/T340552 [16:06:12] T385037: mwdebug dashboard on logstash is full of "Failed to connect to exporter" messages (tracing channel) since 7 January - https://phabricator.wikimedia.org/T385037 [16:08:26] (03PS1) 10AOkoth: os_updates: open up rsync to staging kube pods [puppet] - 10https://gerrit.wikimedia.org/r/1115444 (https://phabricator.wikimedia.org/T350794) [16:09:22] (03CR) 10Hnowlan: [C:03+2] trafficserver: directly route to citoid on testwiki [puppet] - 10https://gerrit.wikimedia.org/r/1115056 (https://phabricator.wikimedia.org/T361576) (owner: 10Hnowlan) [16:09:35] cdanis: Yeah, sorry didn't see the ping [16:09:42] no worries :D [16:09:45] !log cdanis@deploy2002 cdanis: Backport for [[gerrit:1115436|Trace only on k8s. (T321211 T340552 T385037)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:10:03] 06SRE, 10fundraising-tech-ops, 06Infrastructure-Foundations, 10netops: Manage frack switches with Netbox - https://phabricator.wikimedia.org/T268802#10509243 (10cmooney) 05Open→03Resolved a:03cmooney [16:10:54] !log cdanis@deploy2002 cdanis: Continuing with sync [16:13:45] (03CR) 10Alexandros Kosiaris: [C:03+2] Add .gitmessage in the repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115443 (owner: 10Alexandros Kosiaris) [16:14:32] (03Merged) 10jenkins-bot: Add .gitmessage in the repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115443 (owner: 10Alexandros Kosiaris) [16:17:57] !log cdanis@deploy2002 Finished scap sync-world: Backport for [[gerrit:1115436|Trace only on k8s. (T321211 T340552 T385037)]] (duration: 11m 55s) [16:18:04] T321211: distributed tracing v1: tech debt blockers - https://phabricator.wikimedia.org/T321211 [16:18:04] T340552: Implement and wire-up minimal OpenTelemetry tracing client compatible with OTEL data model - https://phabricator.wikimedia.org/T340552 [16:18:05] T385037: mwdebug dashboard on logstash is full of "Failed to connect to exporter" messages (tracing channel) since 7 January - https://phabricator.wikimedia.org/T385037 [16:20:10] (03CR) 10Volans: [C:03+1] "LGTM, I trust your tests" [cookbooks] - 10https://gerrit.wikimedia.org/r/1115109 (https://phabricator.wikimedia.org/T379549) (owner: 10Cathal Mooney) [16:20:27] (03CR) 10Volans: [C:03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/1115134 (https://phabricator.wikimedia.org/T379549) (owner: 10Cathal Mooney) [16:21:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host maps-test2001.codfw.wmnet with OS bookworm [16:22:03] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#10509277 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host maps-test2001.codfw.wmnet with OS bookworm completed: - maps-test2001 (**PASS**)... [16:22:13] !log repool ms-fe1014 T384317 [16:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:18] T384317: ms-fe1014 hardware fault (may need new disk controller?) - https://phabricator.wikimedia.org/T384317 [16:30:10] (03CR) 10Anzx: [C:03+1] "looks good, but mailmap changes are unrelated to draft changes it would nice if new patch created that change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115377 (https://phabricator.wikimedia.org/T385162) (owner: 10Revi) [16:33:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205 (T384592)', diff saved to https://phabricator.wikimedia.org/P72885 and previous config saved to /var/cache/conftool/dbconfig/20250130-163321-marostegui.json [16:33:26] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [16:34:17] (03CR) 10Revi: "I do not consider mailmap change to be significant enough to warrant its own CL. If SWAT deployer disagrees I can do that, but otherwise I" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115377 (https://phabricator.wikimedia.org/T385162) (owner: 10Revi) [16:35:22] (03PS1) 10Clément Goubert: mediawiki: Various fixes for mwcron [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115453 [16:37:14] (03PS2) 10Jcrespo: dbbackups: Prepare for decommission of db2139 [puppet] - 10https://gerrit.wikimedia.org/r/1115370 (https://phabricator.wikimedia.org/T383971) [16:40:19] (03CR) 10Ottomata: benthos: send data to eventgate too (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1115113 (https://phabricator.wikimedia.org/T383392) (owner: 10Fabfur) [16:40:42] (03PS1) 10CDanis: otelcol: traces: attach useful pod labels [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115454 (https://phabricator.wikimedia.org/T320549) [16:46:38] (03CR) 10Effie Mouzeli: [C:03+1] "thank you! one nit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115454 (https://phabricator.wikimedia.org/T320549) (owner: 10CDanis) [16:46:59] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frban1002 - https://phabricator.wikimedia.org/T369947#10509331 (10Papaul) [16:47:08] (03CR) 10Jforrester: ":-(" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115443 (owner: 10Alexandros Kosiaris) [16:47:16] (03CR) 10Jcrespo: [C:03+2] dbbackups: Prepare for decommission of db2139 [puppet] - 10https://gerrit.wikimedia.org/r/1115370 (https://phabricator.wikimedia.org/T383971) (owner: 10Jcrespo) [16:48:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205', diff saved to https://phabricator.wikimedia.org/P72887 and previous config saved to /var/cache/conftool/dbconfig/20250130-164828-marostegui.json [16:49:10] 06SRE, 06Infrastructure-Foundations, 10netops: Manage fundraising network elements from Netbox - https://phabricator.wikimedia.org/T377996#10509343 (10cmooney) [16:49:53] (03PS2) 10Krinkle: mediawiki.org/beacon/event - don't raise error on failure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115111 (https://phabricator.wikimedia.org/T383939) (owner: 10Ottomata) [16:50:19] (03CR) 10Krinkle: "fixed missing line break in commit message which had previously but bug references in the commit msg body instead of footer metadata." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115111 (https://phabricator.wikimedia.org/T383939) (owner: 10Ottomata) [16:50:30] (03PS2) 10CDanis: otelcol: traces: attach useful pod labels [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115454 (https://phabricator.wikimedia.org/T320549) [16:50:31] (03CR) 10CDanis: otelcol: traces: attach useful pod labels (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115454 (https://phabricator.wikimedia.org/T320549) (owner: 10CDanis) [16:50:38] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frban1002 - https://phabricator.wikimedia.org/T369947#10509374 (10Papaul) [16:50:56] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdb1007 - https://phabricator.wikimedia.org/T369922#10509383 (10Papaul) [16:51:37] 06SRE, 06Infrastructure-Foundations, 10netops: Manage fundraising network elements from Netbox - https://phabricator.wikimedia.org/T377996#10509388 (10cmooney) 05Open→03Resolved This is now largely complete. We have decided to model the switch<->server links in Netbox (with dummy names 'PRIMARY_A' a... [16:51:48] (03PS1) 10Joal: Update webrequest_sampled_live turnilo config [puppet] - 10https://gerrit.wikimedia.org/r/1115457 (https://phabricator.wikimedia.org/T383900) [16:53:40] (03CR) 10Kamila Součková: [C:03+1] mediawiki: Various fixes for mwcron [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115453 (owner: 10Clément Goubert) [16:56:38] (03CR) 10Clément Goubert: [C:03+1] "+1 since we've tested it and know it works, but I'd like @jmeybohm@wikimedia.org's opinion on if it's the right way to do this." [puppet] - 10https://gerrit.wikimedia.org/r/1115416 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol) [16:56:42] (03PS2) 10Arturo Borrero Gonzalez: prometheus-node-kernel-messages: add logic to ignore messages [puppet] - 10https://gerrit.wikimedia.org/r/1115391 (https://phabricator.wikimedia.org/T380960) [16:57:23] (03CR) 10Ottomata: "TY" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115111 (https://phabricator.wikimedia.org/T383939) (owner: 10Ottomata) [16:57:41] (03CR) 10CDanis: [C:03+2] otelcol: traces: attach useful pod labels [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115454 (https://phabricator.wikimedia.org/T320549) (owner: 10CDanis) [16:57:43] (03CR) 10Alexandros Kosiaris: [C:03+1] "LGTM as well" [puppet] - 10https://gerrit.wikimedia.org/r/1084247 (owner: 10Scott French) [16:57:57] (03PS1) 10JMeybohm: wikikube-staging-codfw: Disable PodSecurityPolicies [puppet] - 10https://gerrit.wikimedia.org/r/1115459 (https://phabricator.wikimedia.org/T341984) [16:58:20] (03CR) 10CI reject: [V:04-1] wikikube-staging-codfw: Disable PodSecurityPolicies [puppet] - 10https://gerrit.wikimedia.org/r/1115459 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [17:00:05] jhathaway and rzl: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250130T1700). [17:00:05] MatmaRex: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:20] hi [17:01:00] i tseted this patch on the beta cluster, but the production config is a bit different, so review would be appreciated. i considered splitting the beta and prod parts into separate patches, let me know if that'd help [17:01:52] (03Merged) 10jenkins-bot: otelcol: traces: attach useful pod labels [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115454 (https://phabricator.wikimedia.org/T320549) (owner: 10CDanis) [17:02:20] !log cdanis@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [17:03:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205', diff saved to https://phabricator.wikimedia.org/P72888 and previous config saved to /var/cache/conftool/dbconfig/20250130-170334-marostegui.json [17:03:43] MatmaRex: hi, sorry, apache changes like this are too complex to merge in the puppet window -- there are times when I've been able to do it anyway but I won't be able to do that today [17:03:57] (I'll also be rotating off the puppet window in a few months and I don't want to set unfair expectations for whoever takes it over) [17:04:22] !log cdanis@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [17:04:35] you should work with an SRE from my team in Service Ops to get a careful review that lasts longer than 30 minutes, probably including httpbb tests [17:05:04] alright. how do i do that? [17:05:50] you might ask in #wikimedia-serviceops, but if you file a task with the #serviceops tag we'll definitely see it [17:05:50] (03CR) 10FNegri: prometheus-node-kernel-messages: add logic to ignore messages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1115391 (https://phabricator.wikimedia.org/T380960) (owner: 10Arturo Borrero Gonzalez) [17:06:13] !log cdanis@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [17:06:23] (the tests are documented at https://wikitech.wikimedia.org/wiki/Httpbb if you want to take an early look in the meantime) [17:06:31] !log cdanis@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [17:06:57] normally I'd be happy to take a look but I'm swamped the next few days and not sure when I'll be able to -- you might get a quicker turnaround from someone else [17:07:01] (03PS1) 10Andrew Bogott: nova-compute: update live_migration_uri to use private cloud network [puppet] - 10https://gerrit.wikimedia.org/r/1115461 [17:07:19] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1115461 (owner: 10Andrew Bogott) [17:07:37] sure [17:07:55] (03PS2) 10JMeybohm: wikikube-staging-codfw: Disable PodSecurityPolicies [puppet] - 10https://gerrit.wikimedia.org/r/1115459 (https://phabricator.wikimedia.org/T341984) [17:08:18] (03CR) 10CI reject: [V:04-1] wikikube-staging-codfw: Disable PodSecurityPolicies [puppet] - 10https://gerrit.wikimedia.org/r/1115459 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [17:08:30] (03PS2) 10Andrew Bogott: nova-compute: update live_migration_uri to use private cloud network [puppet] - 10https://gerrit.wikimedia.org/r/1115461 (https://phabricator.wikimedia.org/T355145) [17:09:10] 10ops-magru, 06DC-Ops: Power supply failure (PSU) for cp7006.magru.wmnet - https://phabricator.wikimedia.org/T381446#10509503 (10RobH) 05Open→03Resolved New PSU swapped in, resolving task. ` The power supplies are redundant. Thu Jan 30 2025 16:56:13 ` [17:09:22] !log upgrade, restart and rebuild tables of db2202 T376905 [17:10:42] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:11:06] (03PS3) 10JMeybohm: wikikube-staging-codfw: Disable PodSecurityPolicies [puppet] - 10https://gerrit.wikimedia.org/r/1115459 (https://phabricator.wikimedia.org/T341984) [17:11:30] (03CR) 10CI reject: [V:04-1] wikikube-staging-codfw: Disable PodSecurityPolicies [puppet] - 10https://gerrit.wikimedia.org/r/1115459 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [17:12:14] (03CR) 10Andrew Bogott: [C:04-1] "This needs corresponding cert changes; right now we use puppet certs that don't use the .private name" [puppet] - 10https://gerrit.wikimedia.org/r/1115461 (https://phabricator.wikimedia.org/T355145) (owner: 10Andrew Bogott) [17:12:43] (03PS4) 10JMeybohm: wikikube-staging-codfw: Disable PodSecurityPolicies [puppet] - 10https://gerrit.wikimedia.org/r/1115459 (https://phabricator.wikimedia.org/T341984) [17:13:35] (03PS5) 10JMeybohm: wikikube-staging-codfw: Disable PodSecurityPolicies [puppet] - 10https://gerrit.wikimedia.org/r/1115459 (https://phabricator.wikimedia.org/T341984) [17:14:09] (03CR) 10JMeybohm: [C:03+2] wikikube-staging-codfw: Disable PodSecurityPolicies [puppet] - 10https://gerrit.wikimedia.org/r/1115459 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [17:14:13] MatmaRex: if you did want to split the patch into beta and prod, I'm happy to merge the beta half "at your own risk" for whatever testing you want to do -- the prod apache config is what I want to be more careful with [17:14:25] 06SRE, 06DC-Ops: Update phabricator templates/instructions for fundraising server provosion - https://phabricator.wikimedia.org/T385208 (10cmooney) 03NEW p:05Triage→03Low [17:15:01] 06SRE, 06DC-Ops: Update phabricator templates/instructions for fundraising server provosion - https://phabricator.wikimedia.org/T385208#10509533 (10cmooney) [17:15:28] rzl: i can cherry-pick it on beta, so that wouldn't make much of a difference for me, and it's easier to keep it as one commit [17:15:28] jynus: got your change in puppet-merge, ok to proceed? [17:15:38] dbbackups: Prepare for decommission of db2139 (25e0658ac7) [17:15:55] thanks for the advice [17:15:57] oh, yes [17:16:00] sorry, I forgot [17:16:00] ack [17:16:01] i will ge in touch [17:16:03] got distracted [17:16:03] np [17:16:14] done [17:16:18] 06SRE, 06DC-Ops: Update phabricator templates/instructions for fundraising server provosion - https://phabricator.wikimedia.org/T385208#10509536 (10cmooney) [17:16:24] MatmaRex: 👍 [17:16:35] thanks, jayme [17:16:59] 06SRE, 06DC-Ops: Update phabricator templates/instructions for fundraising server provosion - https://phabricator.wikimedia.org/T385208#10509542 (10cmooney) [17:17:36] (03CR) 10FNegri: [C:04-1] "My performance concern was invalid (see comment inline)." [puppet] - 10https://gerrit.wikimedia.org/r/1115391 (https://phabricator.wikimedia.org/T380960) (owner: 10Arturo Borrero Gonzalez) [17:18:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205 (T384592)', diff saved to https://phabricator.wikimedia.org/P72889 and previous config saved to /var/cache/conftool/dbconfig/20250130-171841-marostegui.json [17:18:48] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [17:18:56] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2227.codfw.wmnet with reason: Maintenance [17:19:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2227 (T384592)', diff saved to https://phabricator.wikimedia.org/P72890 and previous config saved to /var/cache/conftool/dbconfig/20250130-171903-marostegui.json [17:19:58] (03CR) 10JHathaway: [C:03+1] puppetmaster: remove use of deprecated method in logstash.rb [puppet] - 10https://gerrit.wikimedia.org/r/1115124 (https://phabricator.wikimedia.org/T385058) (owner: 10Cwhite) [17:20:12] !log staging-codfw k8s cluster is currently being updated to k8s 1.31 and in an unusable state - T384450 [17:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:18] T384450: Update wikikube-staging-codfw to kubernetes 1.31 - https://phabricator.wikimedia.org/T384450 [17:20:42] RESOLVED: JobUnavailable: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:21:01] 06SRE, 06DC-Ops: Update phabricator templates/instructions for fundraising server provision - https://phabricator.wikimedia.org/T385208#10509559 (10cmooney) [17:22:43] (03CR) 10Filippo Giunchedi: [C:03+1] Update webrequest_sampled_live turnilo config [puppet] - 10https://gerrit.wikimedia.org/r/1115457 (https://phabricator.wikimedia.org/T383900) (owner: 10Joal) [17:22:55] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fran1002 - https://phabricator.wikimedia.org/T369940#10509567 (10Papaul) [17:23:11] (03CR) 10FNegri: [C:04-1] prometheus-node-kernel-messages: add logic to ignore messages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1115391 (https://phabricator.wikimedia.org/T380960) (owner: 10Arturo Borrero Gonzalez) [17:26:09] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdb1007 - https://phabricator.wikimedia.org/T369922#10509583 (10Papaul) [17:28:25] (03CR) 10JHathaway: [C:03+1] elastic: 15 refresh hosts [puppet] - 10https://gerrit.wikimedia.org/r/1115122 (https://phabricator.wikimedia.org/T384966) (owner: 10Ryan Kemper) [17:31:33] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frban1002 - https://phabricator.wikimedia.org/T369947#10509631 (10Papaul) [17:32:06] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fran1002 - https://phabricator.wikimedia.org/T369940#10509647 (10Papaul) [17:37:58] (03PS1) 10Ladsgroup: file: Remove from filerevision when only one row exists [core] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115466 (https://phabricator.wikimedia.org/T384481) [17:38:14] jouncebot: nowandnext [17:38:14] For the next 0 hour(s) and 21 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250130T1700) [17:38:14] In 0 hour(s) and 21 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250130T1800) [17:38:15] In 0 hour(s) and 21 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250130T1800) [17:38:30] (03CR) 10Ladsgroup: [C:03+2] file: Remove from filerevision when only one row exists [core] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115466 (https://phabricator.wikimedia.org/T384481) (owner: 10Ladsgroup) [17:40:09] (03CR) 10BCornwall: [C:03+1] wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1115329 (https://phabricator.wikimedia.org/T385148) (owner: 10Gerrit maintenance bot) [17:51:42] (03Merged) 10jenkins-bot: file: Remove from filerevision when only one row exists [core] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115466 (https://phabricator.wikimedia.org/T384481) (owner: 10Ladsgroup) [17:52:07] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Install and cable Nokia test devices and test servers in codfw - https://phabricator.wikimedia.org/T385217#10509771 (10cmooney) [17:52:55] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Install and cable Nokia test devices and test servers in codfw - https://phabricator.wikimedia.org/T385217#10509779 (10cmooney) [17:53:34] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Install and cable Nokia test devices and test servers in codfw - https://phabricator.wikimedia.org/T385217#10509783 (10cmooney) [17:54:26] (03PS1) 10BryanDavis: developer-portal: Bump container to 2025-01-30-121819-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115469 [17:55:11] (03CR) 10Ladsgroup: "This isn't deployed yet and is causing mw-config to complain. I'm deploying it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115443 (owner: 10Alexandros Kosiaris) [17:55:53] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1115466|file: Remove from filerevision when only one row exists (T384481)]] [17:55:58] T384481: Set new file tables to write both in production - https://phabricator.wikimedia.org/T384481 [17:56:22] PROBLEM - nova-compute proc minimum on cloudvirt1036 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:56:28] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Install and cable Nokia test devices and test servers in codfw - https://phabricator.wikimedia.org/T385217#10509812 (10cmooney) [17:56:46] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Install and cable Nokia test devices and test servers in codfw - https://phabricator.wikimedia.org/T385217#10509815 (10cmooney) [17:57:22] RECOVERY - nova-compute proc minimum on cloudvirt1036 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:58:51] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Install and cable Nokia test devices and test servers in codfw - https://phabricator.wikimedia.org/T385217#10509830 (10cmooney) [17:58:58] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1115466|file: Remove from filerevision when only one row exists (T384481)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:59:32] 06SRE, 06DC-Ops: Update phabricator templates/instructions for fundraising server provision - https://phabricator.wikimedia.org/T385208#10509833 (10RobH) I think that creating a second template for frack racking is ideal, but wanted to think about it a day or so before taking action to ensure I don't think of... [18:00:05] bd808: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250130T1800). [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250130T1800) [18:00:10] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [18:03:48] (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump container to 2025-01-30-121819-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115469 (owner: 10BryanDavis) [18:04:57] (03Merged) 10jenkins-bot: developer-portal: Bump container to 2025-01-30-121819-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115469 (owner: 10BryanDavis) [18:06:42] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1115466|file: Remove from filerevision when only one row exists (T384481)]] (duration: 10m 48s) [18:06:47] T384481: Set new file tables to write both in production - https://phabricator.wikimedia.org/T384481 [18:07:41] !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply [18:08:28] !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [18:09:56] !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [18:10:26] !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [18:10:31] (03CR) 10Bartosz Dziewoński: "I scheduled this for the Puppet window today, but it wasn't deployed:" [puppet] - 10https://gerrit.wikimedia.org/r/1115104 (https://phabricator.wikimedia.org/T383952) (owner: 10Bartosz Dziewoński) [18:10:34] !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [18:10:51] !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [18:14:39] 10ops-codfw, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T385096#10509882 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm reseated ps2 cable. stable. [18:15:27] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T385078#10509886 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [18:21:56] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users ssh access and Kerberos identity for YLiou_WMF - https://phabricator.wikimedia.org/T385220 (10Isaac) 03NEW [18:29:31] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users ssh access and Kerberos identity for YLiou_WMF - https://phabricator.wikimedia.org/T385220#10509954 (10Isaac) @YLiou_WMF please add your public SSH key where it says `TODO Yu-Ming add here` in the task description. This documentatio... [18:29:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10509958 (10phaultfinder) [18:44:36] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [18:45:30] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdb1007 - https://phabricator.wikimedia.org/T369922#10510034 (10Papaul) [18:46:00] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fran1002 - https://phabricator.wikimedia.org/T369940#10510038 (10Papaul) [18:46:38] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frban1002 - https://phabricator.wikimedia.org/T369947#10510039 (10Papaul) [18:48:04] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add dns entries for frdb1007,fran1002 and frban1002 - pt1979@cumin2002" [18:49:09] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users ssh access and Kerberos identity for YLiou_WMF - https://phabricator.wikimedia.org/T385220#10510047 (10YLiou_WMF) [18:52:39] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users ssh access and Kerberos identity for YLiou_WMF - https://phabricator.wikimedia.org/T385220#10510056 (10YLiou_WMF) [18:58:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T384592)', diff saved to https://phabricator.wikimedia.org/P72891 and previous config saved to /var/cache/conftool/dbconfig/20250130-185833-marostegui.json [18:58:39] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [18:59:27] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Install and cable Nokia test devices and test servers in codfw - https://phabricator.wikimedia.org/T385217#10510080 (10Jhancock.wm) I have the two Dell Poweredge R 440 servers set aside when we are ready to rack them. they have 10G car... [19:00:04] jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250130T1900) [19:00:24] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Install and cable Nokia test devices and test servers in codfw - https://phabricator.wikimedia.org/T385217#10510086 (10Jhancock.wm) also forgot to mention we have one spare SFP-100G-LR4 we can test with [19:02:33] (03PS1) 10TrainBranchBot: group2 to 1.44.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115490 (https://phabricator.wikimedia.org/T382365) [19:02:34] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.44.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115490 (https://phabricator.wikimedia.org/T382365) (owner: 10TrainBranchBot) [19:03:21] (03Merged) 10jenkins-bot: group2 to 1.44.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115490 (https://phabricator.wikimedia.org/T382365) (owner: 10TrainBranchBot) [19:11:59] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [19:12:11] !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.44.0-wmf.14 refs T382365 [19:12:16] T382365: 1.44.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T382365 [19:12:49] (03CR) 10BCornwall: varnish: Fix claim obj.hits isn't known in vcl_hit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113591 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [19:13:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P72892 and previous config saved to /var/cache/conftool/dbconfig/20250130-191340-marostegui.json [19:14:25] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:19:29] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10510145 (10VRiley-WMF) [19:28:39] (03PS1) 10Bartosz Dziewoński: Update and sync 404 error handler pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115492 [19:28:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P72893 and previous config saved to /var/cache/conftool/dbconfig/20250130-192847-marostegui.json [19:32:17] (03PS2) 10Bartosz Dziewoński: Update and sync 404 error handler pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115492 [19:33:25] (03PS6) 10Bartosz Dziewoński: Use new 'auth' docroot for the auth domain [puppet] - 10https://gerrit.wikimedia.org/r/1115104 (https://phabricator.wikimedia.org/T383952) [19:34:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115492 (owner: 10Bartosz Dziewoński) [19:36:13] (03PS8) 10BCornwall: varnish: Upgrade VCL for Varnish 7.0+/modules 0.20 [puppet] - 10https://gerrit.wikimedia.org/r/1113592 (https://phabricator.wikimedia.org/T378737) [19:36:57] (03CR) 10BCornwall: varnish: Fix claim obj.hits isn't known in vcl_hit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113591 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [19:37:03] (03Abandoned) 10BCornwall: varnish: Fix claim obj.hits isn't known in vcl_hit [puppet] - 10https://gerrit.wikimedia.org/r/1113591 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [19:39:58] (03PS9) 10BCornwall: varnish: Upgrade VCL for Varnish 7.0+/modules 0.20 [puppet] - 10https://gerrit.wikimedia.org/r/1113592 (https://phabricator.wikimedia.org/T378737) [19:40:00] (03CR) 10BCornwall: varnish: Upgrade VCL for Varnish 7.0+/modules 0.20 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113592 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [19:42:02] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10510197 (10VRiley-WMF) [19:43:06] (03PS3) 10Bartosz Dziewoński: Update and sync 404 error handler pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115492 [19:43:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T384592)', diff saved to https://phabricator.wikimedia.org/P72894 and previous config saved to /var/cache/conftool/dbconfig/20250130-194354-marostegui.json [19:44:04] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [19:44:10] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2239.codfw.wmnet with reason: Maintenance [19:44:43] (03PS1) 10Andrew Bogott: nova: temporary mod of live_migration rules [puppet] - 10https://gerrit.wikimedia.org/r/1115493 [19:45:39] (03CR) 10Andrew Bogott: [C:03+2] nova: temporary mod of live_migration rules [puppet] - 10https://gerrit.wikimedia.org/r/1115493 (owner: 10Andrew Bogott) [19:48:00] (03PS4) 10Bartosz Dziewoński: Update and sync 404 error handler pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115492 (https://phabricator.wikimedia.org/T383952) [19:48:27] 06SRE, 06Infrastructure-Foundations, 06Traffic: Console domain and property access request - https://phabricator.wikimedia.org/T381904#10510204 (10BCornwall) a:05BCornwall→03None [19:48:42] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4908/co" [puppet] - 10https://gerrit.wikimedia.org/r/1113592 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [19:49:59] (03CR) 10Gergő Tisza: [C:03+1] Update and sync 404 error handler pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115492 (https://phabricator.wikimedia.org/T383952) (owner: 10Bartosz Dziewoński) [19:52:07] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1071-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [19:55:58] (03PS2) 10BCornwall: Varnish: Upgrade test container to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1115123 [19:57:36] (03CR) 10Ssingh: [C:03+1] "Good idea!" [puppet] - 10https://gerrit.wikimedia.org/r/1115123 (owner: 10BCornwall) [20:00:14] (03CR) 10Vgutierrez: "(dropping my -1 as I'll be OoO till Tuesday and I can't review it properly now)" [puppet] - 10https://gerrit.wikimedia.org/r/1113592 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [20:02:03] (03PS1) 10Dwisehaupt: Add aws validation key for lp.email ssl cert generation [dns] - 10https://gerrit.wikimedia.org/r/1115494 (https://phabricator.wikimedia.org/T384931) [20:02:21] rzl: FYI, i added httpbb tests and filed https://phabricator.wikimedia.org/T385228 . thanks for the advice earlier [20:05:57] 06SRE, 06Infrastructure-Foundations: Use FIDO2 ssh keys for prodcution access - https://phabricator.wikimedia.org/T385229 (10cmooney) 03NEW p:05Triage→03Low [20:06:12] (03CR) 10BCornwall: [C:03+2] Varnish: Upgrade test container to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1115123 (owner: 10BCornwall) [20:07:14] (03CR) 10Bartosz Dziewoński: "Added httpbb tests, filed T385228 for the review." [puppet] - 10https://gerrit.wikimedia.org/r/1115104 (https://phabricator.wikimedia.org/T383952) (owner: 10Bartosz Dziewoński) [20:12:25] 06SRE, 06Infrastructure-Foundations: Use FIDO2 ssh keys for prodcution access - https://phabricator.wikimedia.org/T385229#10510266 (10cmooney) [20:16:00] (03PS1) 10Cathal Mooney: Add FIDO2-based ssh keys for user cmooney [puppet] - 10https://gerrit.wikimedia.org/r/1115495 (https://phabricator.wikimedia.org/T385229) [20:23:57] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Use FIDO2 ssh keys for prodcution access - https://phabricator.wikimedia.org/T385229#10510305 (10cmooney) [20:24:00] !log Backing up Grafana DB on grafana1002 - T384840 [20:24:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:05] T384840: Unable to edit/delete Grafana alert - https://phabricator.wikimedia.org/T384840 [20:26:41] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [20:27:51] !log jhathaway@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be2088.codfw.wmnet with reason: T381919 [20:27:55] T381919: Supermicro: unable to set boot order after using Redfish to boot once - https://phabricator.wikimedia.org/T381919 [20:28:44] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Use FIDO2 ssh keys for prodcution access - https://phabricator.wikimedia.org/T385229#10510315 (10cmooney) [20:29:04] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:29:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10510320 (10phaultfinder) [20:32:07] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:33:48] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:42:41] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frban1002 - https://phabricator.wikimedia.org/T369947#10510353 (10Papaul) [20:43:35] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frban1002 - https://phabricator.wikimedia.org/T369947#10510355 (10Papaul) a:03Jgreen @Jgreen all your's [20:43:57] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fran1002 - https://phabricator.wikimedia.org/T369940#10510357 (10Papaul) [20:44:18] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fran1002 - https://phabricator.wikimedia.org/T369940#10510359 (10Papaul) a:03Jgreen @Jgreen all your's [20:44:40] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdb1007 - https://phabricator.wikimedia.org/T369922#10510362 (10Papaul) [20:45:01] (03CR) 10CDanis: [C:03+2] chart-renderer: new release (now w/ ECS) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113846 (https://phabricator.wikimedia.org/T383748) (owner: 10CDanis) [20:45:18] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdb1007 - https://phabricator.wikimedia.org/T369922#10510364 (10Papaul) a:03Jgreen @Jgreen all your's [20:46:47] (03Merged) 10jenkins-bot: chart-renderer: new release (now w/ ECS) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113846 (https://phabricator.wikimedia.org/T383748) (owner: 10CDanis) [20:47:06] !log cdanis@deploy2002 helmfile [staging] START helmfile.d/services/chart-renderer: apply [20:47:52] !log cdanis@deploy2002 helmfile [staging] DONE helmfile.d/services/chart-renderer: apply [20:48:37] (03CR) 10Sergio Gimeno: "Let's stick with beta for the Sprinthackular demo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115074 (https://phabricator.wikimedia.org/T382173) (owner: 10Sergio Gimeno) [20:53:28] FIRING: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [20:56:27] (03CR) 10BCornwall: Add aws validation key for lp.email ssl cert generation (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1115494 (https://phabricator.wikimedia.org/T384931) (owner: 10Dwisehaupt) [20:56:56] FIRING: CalicoTyphaDown: Too few (0) calico-typha replicas running - https://wikitech.wikimedia.org/wiki/Calico#Typha" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoTyphaDown [20:56:57] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [20:59:27] (03PS1) 10CDanis: Revert "chart-renderer: new release (now w/ ECS)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115502 [20:59:45] (03PS2) 10CDanis: Revert "chart-renderer: new release (now w/ ECS)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115502 (https://phabricator.wikimedia.org/T383748) [20:59:51] (03CR) 10CDanis: [C:03+2] Revert "chart-renderer: new release (now w/ ECS)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115502 (https://phabricator.wikimedia.org/T383748) (owner: 10CDanis) [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Time to snap out of that daydream and deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250130T2100). [21:00:05] sergi0, tgr, and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:18] hi [21:00:24] hello [21:01:07] (03Merged) 10jenkins-bot: Revert "chart-renderer: new release (now w/ ECS)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115502 (https://phabricator.wikimedia.org/T383748) (owner: 10CDanis) [21:01:42] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [21:02:38] o/ [21:03:53] !log cdanis@deploy2002 helmfile [staging] START helmfile.d/services/chart-renderer: apply [21:03:57] hi ! i can deploy [21:04:14] unless some of patches in the queue can be self-deployed? [21:04:19] !log cdanis@deploy2002 helmfile [staging] DONE helmfile.d/services/chart-renderer: apply [21:04:46] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db125[0-4] - https://phabricator.wikimedia.org/T380083#10510391 (10Papaul) [21:05:01] I can self-deploy mine [21:05:17] sergi0: go for it - lmk when you're done [21:05:27] sure [21:05:52] I can also self-deploy [21:05:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115074 (https://phabricator.wikimedia.org/T382173) (owner: 10Sergio Gimeno) [21:06:23] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add dns entries for db1252,db1253,db1254 - pt1979@cumin2002" [21:06:32] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users ssh access and Kerberos identity for YLiou_WMF - https://phabricator.wikimedia.org/T385220#10510394 (10RLazarus) p:05Triage→03Medium a:03RLazarus [21:06:40] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add dns entries for db1252,db1253,db1254 - pt1979@cumin2002" [21:06:40] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:06:42] (03Merged) 10jenkins-bot: beta wgEventStreams: opt out collecting user agent for HelpPanel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115074 (https://phabricator.wikimedia.org/T382173) (owner: 10Sergio Gimeno) [21:07:19] done, that was quick [21:07:43] config patches 😌 [21:08:13] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users ssh access and Kerberos identity for YLiou_WMF - https://phabricator.wikimedia.org/T385220#10510401 (10RLazarus) Hi Yu-Ming and Isaac, I'll take care of this from the SRE side. Just to document for posterity, I see that Yu-Ming's m... [21:08:51] cool - ya - labs only is quick [21:09:17] tgr: do you want to self-deploy or would you like me to do it? [21:09:25] will do [21:09:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115369 (https://phabricator.wikimedia.org/T373737) (owner: 10Gergő Tisza) [21:09:43] !log pt1979@cumin1002 START - Cookbook sre.hosts.provision for host db1252.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:10:23] (03Merged) 10jenkins-bot: Do not disable extensions on SUL3 shared authentication domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115369 (https://phabricator.wikimedia.org/T373737) (owner: 10Gergő Tisza) [21:10:42] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1115369|Do not disable extensions on SUL3 shared authentication domain (T373737 T384919 T384236)]] [21:10:49] T373737: Disable irrelevant extensions on SUL3 login domain - https://phabricator.wikimedia.org/T373737 [21:10:50] T384919: SUL3 prevents GrowthExperiments from accessing CommunityConfiguration on testwiki - https://phabricator.wikimedia.org/T384919 [21:10:50] T384236: [regression] Accounts created at test wikis are not receiving Growth features - https://phabricator.wikimedia.org/T384236 [21:13:29] !log tgr@deploy2002 tgr: Backport for [[gerrit:1115369|Do not disable extensions on SUL3 shared authentication domain (T373737 T384919 T384236)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:13:43] (03CR) 10Cathal Mooney: [C:03+2] gNMIc: Add BGP stats collection for network devices [puppet] - 10https://gerrit.wikimedia.org/r/1115002 (https://phabricator.wikimedia.org/T369384) (owner: 10Cathal Mooney) [21:17:18] !log tgr@deploy2002 tgr: Continuing with sync [21:21:31] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users ssh access and Kerberos identity for YLiou_WMF - https://phabricator.wikimedia.org/T385220#10510440 (10RLazarus) Yu-Ming: It looks like you have two developer accounts, [[ https://ldap.toolforge.org/user/yliou | yliou ]] and [[ http... [21:23:24] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1115369|Do not disable extensions on SUL3 shared authentication domain (T373737 T384919 T384236)]] (duration: 12m 42s) [21:23:32] T373737: Disable irrelevant extensions on SUL3 login domain - https://phabricator.wikimedia.org/T373737 [21:23:32] T384919: SUL3 prevents GrowthExperiments from accessing CommunityConfiguration on testwiki - https://phabricator.wikimedia.org/T384919 [21:23:32] T384236: [regression] Accounts created at test wikis are not receiving Growth features - https://phabricator.wikimedia.org/T384236 [21:24:32] cjming: done [21:24:53] thanks! [21:25:02] MatmaRex: do you need a deployer? [21:25:32] cjming: yes please [21:25:40] alrighty [21:25:48] (03PS5) 10Bartosz Dziewoński: Update and sync 404 error handler pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115492 (https://phabricator.wikimedia.org/T383952) [21:28:05] (03PS1) 10Andrew Bogott: Revert "nova: temporary mod of live_migration rules" [puppet] - 10https://gerrit.wikimedia.org/r/1115507 [21:29:31] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:32:51] !log pt1979@cumin1002 START - Cookbook sre.hosts.provision for host db1253.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:33:48] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:35:06] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users ssh access and Kerberos identity for YLiou_WMF - https://phabricator.wikimedia.org/T385220#10510486 (10YLiou_WMF) yes that makes sense to me! [21:37:35] 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 10decommission-hardware: decommission ms-be105[1-9].eqiad.wmnet - https://phabricator.wikimedia.org/T385049#10510488 (10Papaul) [21:38:04] (03PS1) 10RLazarus: admin: Add SSH and Kerberos access for yliou [puppet] - 10https://gerrit.wikimedia.org/r/1115510 (https://phabricator.wikimedia.org/T385220) [21:38:14] 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 10decommission-hardware: decommission ms-be105[1-9].eqiad.wmnet - https://phabricator.wikimedia.org/T385049#10510491 (10Papaul) [21:38:17] cjming: are you deploying? [21:38:54] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users ssh access and Kerberos identity for YLiou_WMF - https://phabricator.wikimedia.org/T385220#10510495 (10Peachey88) Side-note: It would probably be ideal to sort out the two accounts with the same email now, as y... [21:39:03] yes ! sorry - got distracted [21:39:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115492 (https://phabricator.wikimedia.org/T383952) (owner: 10Bartosz Dziewoński) [21:39:59] (03Merged) 10jenkins-bot: Update and sync 404 error handler pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115492 (https://phabricator.wikimedia.org/T383952) (owner: 10Bartosz Dziewoński) [21:40:18] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1115492|Update and sync 404 error handler pages (T383952)]] [21:40:19] cjming: no problem, just wanted to know if i should just wait or do something :) [21:40:24] T383952: Auth.wikimedia.org circular errors - https://phabricator.wikimedia.org/T383952 [21:40:40] (03CR) 10Scott French: [C:03+1] admin: Add SSH and Kerberos access for yliou [puppet] - 10https://gerrit.wikimedia.org/r/1115510 (https://phabricator.wikimedia.org/T385220) (owner: 10RLazarus) [21:41:06] (03CR) 10RLazarus: [C:03+2] admin: Add SSH and Kerberos access for yliou [puppet] - 10https://gerrit.wikimedia.org/r/1115510 (https://phabricator.wikimedia.org/T385220) (owner: 10RLazarus) [21:43:37] MatmaRex: on test servers if testable [21:43:39] !log cjming@deploy2002 cjming, matmarex: Backport for [[gerrit:1115492|Update and sync 404 error handler pages (T383952)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:44:14] cjming: looking [21:44:57] cjming: all good [21:45:11] !log cjming@deploy2002 cjming, matmarex: Continuing with sync [21:49:23] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users ssh access and Kerberos identity for YLiou_WMF - https://phabricator.wikimedia.org/T385220#10510517 (10RLazarus) 05Open→03Resolved - Your SSH access is configured in Puppet. Give it 30 minutes for the c... [21:51:18] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1115492|Update and sync 404 error handler pages (T383952)]] (duration: 10m 59s) [21:51:23] T383952: Auth.wikimedia.org circular errors - https://phabricator.wikimedia.org/T383952 [21:51:31] MatmaRex: should be live! [21:51:44] thanks cjming [21:51:47] yw :) [21:53:30] FIRING: Emergency syslog message: Alert for device asw1-b12-drmrs.mgmt.drmrs.wmnet - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [21:56:06] !log end of UTC late backport window [21:56:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:36] !log jhathaway@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be2088.codfw.wmnet with reason: T381919 [21:57:41] T381919: Supermicro: unable to set boot order after using Redfish to boot once - https://phabricator.wikimedia.org/T381919 [21:58:30] FIRING: [2x] Emergency syslog message: Device asw1-b12-drmrs.mgmt.drmrs.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [22:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250130T2200) [22:00:19] (03Abandoned) 10CDanis: allow k8s service-runner apps to opt-in to auto-ECS [puppet] - 10https://gerrit.wikimedia.org/r/1112295 (owner: 10CDanis) [22:03:30] RESOLVED: Emergency syslog message: Device cloudsw1-e4-eqiad.mgmt.eqiad.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [22:19:04] !log pt1979@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1252.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:19:35] !log pt1979@cumin1002 START - Cookbook sre.hosts.provision for host db1252.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:20:30] !log pt1979@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1253.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:20:48] !log pt1979@cumin1002 START - Cookbook sre.hosts.provision for host db1253.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:24:18] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Use FIDO2 ssh keys for production access - https://phabricator.wikimedia.org/T385229#10510600 (10Nemoralis) [22:31:20] !log pt1979@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1252.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:33:59] !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on netflow2003.codfw.wmnet with reason: disabling gnmic in systemd [22:35:00] (03PS1) 10Brouberol: Update the airflow deb version to downgrade confluent-kafka [puppet] - 10https://gerrit.wikimedia.org/r/1115515 [22:35:24] (03PS2) 10Brouberol: Update the airflow deb version to downgrade confluent-kafka [puppet] - 10https://gerrit.wikimedia.org/r/1115515 [22:36:57] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4912/co" [puppet] - 10https://gerrit.wikimedia.org/r/1115515 (owner: 10Brouberol) [22:40:42] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:43:54] (03PS1) 10Andrew Bogott: cinder policy.conf: change get_pools to admin only [puppet] - 10https://gerrit.wikimedia.org/r/1115516 [22:44:37] (03PS2) 10Andrew Bogott: cinder policy.conf: change get_pools to admin only [puppet] - 10https://gerrit.wikimedia.org/r/1115516 [22:44:40] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1115516 (owner: 10Andrew Bogott) [22:45:06] 06SRE, 06Infrastructure-Foundations, 10netops, 10observability, and 3 others: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731#10510616 (10cmooney) @fgiunchedi perhaps you might know a way to do this. We now have stats like this in Prometheu... [22:45:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:46:29] (03CR) 10Dwisehaupt: Add aws validation key for lp.email ssl cert generation (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1115494 (https://phabricator.wikimedia.org/T384931) (owner: 10Dwisehaupt) [22:48:02] (03CR) 10Andrew Bogott: [C:03+2] cinder policy.conf: change get_pools to admin only [puppet] - 10https://gerrit.wikimedia.org/r/1115516 (owner: 10Andrew Bogott) [23:06:07] (03PS3) 10Urbanecm: [testwiki] Babel: Enable CommunityConfiguration integration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114002 (https://phabricator.wikimedia.org/T374348) [23:06:11] (03CR) 10Urbanecm: [C:03+2] [testwiki] Babel: Enable CommunityConfiguration integration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114002 (https://phabricator.wikimedia.org/T374348) (owner: 10Urbanecm) [23:06:55] (03Merged) 10jenkins-bot: [testwiki] Babel: Enable CommunityConfiguration integration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114002 (https://phabricator.wikimedia.org/T374348) (owner: 10Urbanecm) [23:07:39] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1114002|[testwiki] Babel: Enable CommunityConfiguration integration (T374348)]] [23:07:44] T374348: Switch BabelUseCommunityConfiguration to true on Wikimedia sites - https://phabricator.wikimedia.org/T374348 [23:10:23] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1114002|[testwiki] Babel: Enable CommunityConfiguration integration (T374348)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:17:07] !log urbanecm@deploy2002 urbanecm: Continuing with sync [23:23:11] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1114002|[testwiki] Babel: Enable CommunityConfiguration integration (T374348)]] (duration: 15m 32s) [23:23:16] T374348: Switch BabelUseCommunityConfiguration to true on Wikimedia sites - https://phabricator.wikimedia.org/T374348 [23:23:23] (03PS2) 10Urbanecm: Babel: Enable CommunityConfiguration on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115403 (https://phabricator.wikimedia.org/T374348) [23:23:31] (03CR) 10Urbanecm: [C:03+2] Babel: Enable CommunityConfiguration on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115403 (https://phabricator.wikimedia.org/T374348) (owner: 10Urbanecm) [23:24:14] (03Merged) 10jenkins-bot: Babel: Enable CommunityConfiguration on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115403 (https://phabricator.wikimedia.org/T374348) (owner: 10Urbanecm) [23:25:07] !log pt1979@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1253.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:25:32] !log pt1979@cumin1002 START - Cookbook sre.hosts.provision for host db1253.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:26:17] 06SRE, 10MW-on-K8s, 06serviceops: mwscript-k8s does not support short maintenance script names - https://phabricator.wikimedia.org/T385238 (10Urbanecm_WMF) 03NEW [23:27:57] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1115403|Babel: Enable CommunityConfiguration on all wikis (T374348)]] [23:28:31] (03PS2) 10Urbanecm: CommunityConfiguration: Enable on all wikis except locked down [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115404 (https://phabricator.wikimedia.org/T383910) [23:28:35] !log pt1979@cumin1002 START - Cookbook sre.hosts.reimage for host db1252.eqiad.wmnet with OS bookworm [23:28:46] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db125[0-4] - https://phabricator.wikimedia.org/T380083#10510700 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1002 for host db1252.eqiad.wmnet with OS bookworm [23:30:37] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1115403|Babel: Enable CommunityConfiguration on all wikis (T374348)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:30:42] T374348: Switch BabelUseCommunityConfiguration to true on Wikimedia sites - https://phabricator.wikimedia.org/T374348 [23:32:28] (03CR) 10BCornwall: [C:03+1] Add aws validation key for lp.email ssl cert generation (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1115494 (https://phabricator.wikimedia.org/T384931) (owner: 10Dwisehaupt) [23:34:58] !log urbanecm@deploy2002 urbanecm: Continuing with sync [23:41:01] !log pt1979@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1253.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:41:02] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1115403|Babel: Enable CommunityConfiguration on all wikis (T374348)]] (duration: 13m 04s) [23:41:10] T374348: Switch BabelUseCommunityConfiguration to true on Wikimedia sites - https://phabricator.wikimedia.org/T374348 [23:43:08] (03CR) 10Urbanecm: [C:03+2] CommunityConfiguration: Enable on all wikis except locked down [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115404 (https://phabricator.wikimedia.org/T383910) (owner: 10Urbanecm) [23:43:25] !log pt1979@cumin1002 START - Cookbook sre.hosts.provision for host db1253.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:43:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115404 (https://phabricator.wikimedia.org/T383910) (owner: 10Urbanecm) [23:43:53] !log pt1979@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1252.eqiad.wmnet with reason: host reimage [23:43:54] (03Merged) 10jenkins-bot: CommunityConfiguration: Enable on all wikis except locked down [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115404 (https://phabricator.wikimedia.org/T383910) (owner: 10Urbanecm) [23:44:12] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1115404|CommunityConfiguration: Enable on all wikis except locked down (T383910)]] [23:44:17] T383910: Deploy the CommunityConfiguration extension on all wikis - https://phabricator.wikimedia.org/T383910 [23:46:51] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1115404|CommunityConfiguration: Enable on all wikis except locked down (T383910)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:47:29] !log pt1979@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1252.eqiad.wmnet with reason: host reimage [23:51:09] !log pt1979@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1253.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:51:46] !log urbanecm@deploy2002 urbanecm: Continuing with sync [23:52:07] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1071-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [23:57:53] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1115404|CommunityConfiguration: Enable on all wikis except locked down (T383910)]] (duration: 13m 41s) [23:57:59] T383910: Deploy the CommunityConfiguration extension on all wikis - https://phabricator.wikimedia.org/T383910