[00:02:13] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P93113 and previous config saved to /var/cache/conftool/dbconfig/20260527-000209-fceratto.json [00:12:21] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P93114 and previous config saved to /var/cache/conftool/dbconfig/20260527-001220-fceratto.json [00:17:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:22:29] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T426633)', diff saved to https://phabricator.wikimedia.org/P93115 and previous config saved to /var/cache/conftool/dbconfig/20260527-002228-fceratto.json [00:23:02] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance [00:23:10] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2155 (T426633)', diff saved to https://phabricator.wikimedia.org/P93116 and previous config saved to /var/cache/conftool/dbconfig/20260527-002309-fceratto.json [00:31:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T426633)', diff saved to https://phabricator.wikimedia.org/P93117 and previous config saved to /var/cache/conftool/dbconfig/20260527-003141-fceratto.json [00:41:50] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P93118 and previous config saved to /var/cache/conftool/dbconfig/20260527-004149-fceratto.json [00:51:58] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P93119 and previous config saved to /var/cache/conftool/dbconfig/20260527-005157-fceratto.json [00:52:33] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1293102 (owner: 10L10n-bot) [01:02:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T426633)', diff saved to https://phabricator.wikimedia.org/P93120 and previous config saved to /var/cache/conftool/dbconfig/20260527-010205-fceratto.json [01:02:27] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance [01:02:35] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2172 (T426633)', diff saved to https://phabricator.wikimedia.org/P93121 and previous config saved to /var/cache/conftool/dbconfig/20260527-010234-fceratto.json [01:04:49] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/1293107 (owner: 10L10n-bot) [01:09:35] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1293825 [01:09:35] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1293825 (owner: 10TrainBranchBot) [01:11:12] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T426633)', diff saved to https://phabricator.wikimedia.org/P93122 and previous config saved to /var/cache/conftool/dbconfig/20260527-011111-fceratto.json [01:21:20] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P93123 and previous config saved to /var/cache/conftool/dbconfig/20260527-012119-fceratto.json [01:23:38] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1293825 (owner: 10TrainBranchBot) [01:31:27] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P93124 and previous config saved to /var/cache/conftool/dbconfig/20260527-013126-fceratto.json [01:41:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T426633)', diff saved to https://phabricator.wikimedia.org/P93125 and previous config saved to /var/cache/conftool/dbconfig/20260527-014134-fceratto.json [01:41:57] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance [01:42:04] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2179 (T426633)', diff saved to https://phabricator.wikimedia.org/P93126 and previous config saved to /var/cache/conftool/dbconfig/20260527-014204-fceratto.json [01:50:38] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T426633)', diff saved to https://phabricator.wikimedia.org/P93127 and previous config saved to /var/cache/conftool/dbconfig/20260527-015037-fceratto.json [02:00:45] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:00:46] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P93128 and previous config saved to /var/cache/conftool/dbconfig/20260527-020045-fceratto.json [02:07:15] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 29s) [02:08:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:09:14] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:10:54] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P93129 and previous config saved to /var/cache/conftool/dbconfig/20260527-021053-fceratto.json [02:21:01] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T426633)', diff saved to https://phabricator.wikimedia.org/P93130 and previous config saved to /var/cache/conftool/dbconfig/20260527-022100-fceratto.json [02:21:25] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance [02:21:33] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2206 (T426633)', diff saved to https://phabricator.wikimedia.org/P93131 and previous config saved to /var/cache/conftool/dbconfig/20260527-022133-fceratto.json [02:29:53] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T426633)', diff saved to https://phabricator.wikimedia.org/P93132 and previous config saved to /var/cache/conftool/dbconfig/20260527-022953-fceratto.json [02:34:14] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:35:29] FIRING: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:14] RESOLVED: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:40:01] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P93133 and previous config saved to /var/cache/conftool/dbconfig/20260527-024000-fceratto.json [02:47:08] (03PS1) 10RLazarus: Refactor the backend regex in ATSBackendErrorsHigh [alerts] - 10https://gerrit.wikimedia.org/r/1293839 [02:50:09] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P93134 and previous config saved to /var/cache/conftool/dbconfig/20260527-025008-fceratto.json [02:54:41] (03CR) 10RLazarus: "Not particularly urgent, just a tiny quality-of-life improvement. :)" [alerts] - 10https://gerrit.wikimedia.org/r/1293839 (owner: 10RLazarus) [03:00:17] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T426633)', diff saved to https://phabricator.wikimedia.org/P93135 and previous config saved to /var/cache/conftool/dbconfig/20260527-030016-fceratto.json [03:00:35] FIRING: DiskSpace: Disk space krb1002:9100:/ 2.867% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=krb1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [03:00:38] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance [03:00:46] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2210 (T426633)', diff saved to https://phabricator.wikimedia.org/P93136 and previous config saved to /var/cache/conftool/dbconfig/20260527-030045-fceratto.json [03:05:35] RESOLVED: DiskSpace: Disk space krb1002:9100:/ 2.263% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=krb1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [03:05:53] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:06:07] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:06:39] FIRING: CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=drmrs&var-device=cr1-drmrs:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [03:07:10] FIRING: [4x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [03:07:55] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:08:07] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:09:16] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T426633)', diff saved to https://phabricator.wikimedia.org/P93137 and previous config saved to /var/cache/conftool/dbconfig/20260527-030915-fceratto.json [03:11:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=drmrs&var-device=cr1-drmrs:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [03:12:10] RESOLVED: [4x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [03:19:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P93138 and previous config saved to /var/cache/conftool/dbconfig/20260527-031923-fceratto.json [03:29:31] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P93139 and previous config saved to /var/cache/conftool/dbconfig/20260527-032931-fceratto.json [03:39:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T426633)', diff saved to https://phabricator.wikimedia.org/P93140 and previous config saved to /var/cache/conftool/dbconfig/20260527-033938-fceratto.json [03:40:01] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance [03:40:09] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2219 (T426633)', diff saved to https://phabricator.wikimedia.org/P93141 and previous config saved to /var/cache/conftool/dbconfig/20260527-034008-fceratto.json [03:48:29] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T426633)', diff saved to https://phabricator.wikimedia.org/P93142 and previous config saved to /var/cache/conftool/dbconfig/20260527-034828-fceratto.json [03:58:37] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P93143 and previous config saved to /var/cache/conftool/dbconfig/20260527-035836-fceratto.json [04:07:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:08:45] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P93144 and previous config saved to /var/cache/conftool/dbconfig/20260527-040844-fceratto.json [04:18:53] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T426633)', diff saved to https://phabricator.wikimedia.org/P93145 and previous config saved to /var/cache/conftool/dbconfig/20260527-041852-fceratto.json [04:19:14] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance [04:19:22] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2236 (T426633)', diff saved to https://phabricator.wikimedia.org/P93146 and previous config saved to /var/cache/conftool/dbconfig/20260527-041921-fceratto.json [04:27:37] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2236 (T426633)', diff saved to https://phabricator.wikimedia.org/P93147 and previous config saved to /var/cache/conftool/dbconfig/20260527-042737-fceratto.json [04:37:45] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P93148 and previous config saved to /var/cache/conftool/dbconfig/20260527-043744-fceratto.json [04:47:52] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P93149 and previous config saved to /var/cache/conftool/dbconfig/20260527-044751-fceratto.json [04:57:59] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2236 (T426633)', diff saved to https://phabricator.wikimedia.org/P93150 and previous config saved to /var/cache/conftool/dbconfig/20260527-045759-fceratto.json [04:58:21] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance [04:58:28] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2237 (T426633)', diff saved to https://phabricator.wikimedia.org/P93151 and previous config saved to /var/cache/conftool/dbconfig/20260527-045827-fceratto.json [05:06:46] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2237 (T426633)', diff saved to https://phabricator.wikimedia.org/P93152 and previous config saved to /var/cache/conftool/dbconfig/20260527-050645-fceratto.json [05:08:17] (03PS1) 10Marostegui: pc1024: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1293842 [05:09:14] (03CR) 10Marostegui: [C:03+2] pc1024: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1293842 (owner: 10Marostegui) [05:16:53] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P93153 and previous config saved to /var/cache/conftool/dbconfig/20260527-051653-fceratto.json [05:22:57] (03PS1) 10Marostegui: instances.yaml: Remove pc1014 [puppet] - 10https://gerrit.wikimedia.org/r/1293843 (https://phabricator.wikimedia.org/T427270) [05:24:21] (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove pc1014 [puppet] - 10https://gerrit.wikimedia.org/r/1293843 (https://phabricator.wikimedia.org/T427270) (owner: 10Marostegui) [05:26:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove pc1014 from dbctl T427270', diff saved to https://phabricator.wikimedia.org/P93154 and previous config saved to /var/cache/conftool/dbconfig/20260527-052624-marostegui.json [05:26:29] T427270: decommission pc1014.eqiad.wmnet - https://phabricator.wikimedia.org/T427270 [05:27:01] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P93155 and previous config saved to /var/cache/conftool/dbconfig/20260527-052700-fceratto.json [05:28:32] (03PS2) 10Robertsky: Update wikimania wordmark for 2026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270986 (https://phabricator.wikimedia.org/T413331) [05:33:00] (03CR) 10Marostegui: "Yeah, you'd need to restart replication on sanitarium." [puppet] - 10https://gerrit.wikimedia.org/r/1292346 (https://phabricator.wikimedia.org/T426984) (owner: 10Ladsgroup) [05:33:03] (03CR) 10Marostegui: [C:03+1] Add config for conductwiki [puppet] - 10https://gerrit.wikimedia.org/r/1292346 (https://phabricator.wikimedia.org/T426984) (owner: 10Ladsgroup) [05:37:09] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2237 (T426633)', diff saved to https://phabricator.wikimedia.org/P93156 and previous config saved to /var/cache/conftool/dbconfig/20260527-053708-fceratto.json [05:37:20] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2245.codfw.wmnet with reason: Maintenance [05:37:28] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2245 (T426633)', diff saved to https://phabricator.wikimedia.org/P93157 and previous config saved to /var/cache/conftool/dbconfig/20260527-053727-fceratto.json [05:38:07] !log remove ganeti1026 from eqiad Ganeti cluster T424680 [05:38:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:11] T424680: Add ganeti105[5678] and decom ganeti102[3456] - https://phabricator.wikimedia.org/T424680 [05:39:14] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:40:05] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [05:40:28] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool es2055: Upgrading es2055.codfw.wmnet [05:40:29] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:40:47] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool es2055: Upgrading es2055.codfw.wmnet [05:41:31] PROBLEM - ganeti-confd running on ganeti1026 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 109 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [05:41:31] PROBLEM - ganeti-noded running on ganeti1026 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [05:41:34] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es2055.codfw.wmnet with OS trixie [05:42:50] FIRING: ProbeDown: Service ganeti1026:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:45:51] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2245 (T426633)', diff saved to https://phabricator.wikimedia.org/P93159 and previous config saved to /var/cache/conftool/dbconfig/20260527-054550-fceratto.json [05:49:14] PROBLEM - SSH on netmon2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:49:38] PROBLEM - librenms.wikimedia.org requires authentication on netmon2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [05:50:02] PROBLEM - librenms.wikimedia.org tls expiry on netmon2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [05:52:10] RECOVERY - SSH on netmon2002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u10 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:53:35] (03PS3) 10JavierMonton: image: Flink 2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1293664 (https://phabricator.wikimedia.org/T412978) [05:54:14] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:55:14] PROBLEM - SSH on netmon2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:55:59] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2245', diff saved to https://phabricator.wikimedia.org/P93160 and previous config saved to /var/cache/conftool/dbconfig/20260527-055558-fceratto.json [05:56:43] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es2055.codfw.wmnet with reason: host reimage [05:56:57] (03PS3) 10Robertsky: Update wikimania wordmark for 2026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270986 (https://phabricator.wikimedia.org/T413331) [05:57:06] RECOVERY - SSH on netmon2002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u10 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:57:28] RECOVERY - librenms.wikimedia.org requires authentication on netmon2002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 701 bytes in 0.143 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [05:57:52] RECOVERY - librenms.wikimedia.org tls expiry on netmon2002 is OK: OK - Certificate librenms.wikimedia.org will expire on Sun 12 Jul 2026 02:51:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [05:58:14] (03CR) 10Robertsky: Update wikimania wordmark for 2026 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270986 (https://phabricator.wikimedia.org/T413331) (owner: 10Robertsky) [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260527T0600) [06:02:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 27 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270986 (https://phabricator.wikimedia.org/T413331) (owner: 10Robertsky) [06:04:14] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:04:17] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2055.codfw.wmnet with reason: host reimage [06:06:07] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2245', diff saved to https://phabricator.wikimedia.org/P93161 and previous config saved to /var/cache/conftool/dbconfig/20260527-060606-fceratto.json [06:08:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:16:14] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2245 (T426633)', diff saved to https://phabricator.wikimedia.org/P93162 and previous config saved to /var/cache/conftool/dbconfig/20260527-061613-fceratto.json [06:16:36] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2246.codfw.wmnet with reason: Maintenance [06:16:41] (03CR) 10Chlod Alejandro: [C:03+1] Update wikimania wordmark for 2026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270986 (https://phabricator.wikimedia.org/T413331) (owner: 10Robertsky) [06:16:44] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2246 (T426633)', diff saved to https://phabricator.wikimedia.org/P93163 and previous config saved to /var/cache/conftool/dbconfig/20260527-061643-fceratto.json [06:21:09] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2055.codfw.wmnet with OS trixie [06:21:56] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [06:22:32] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool es2055: repool after maintenance [06:22:50] RESOLVED: ProbeDown: Service ganeti1026:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:25:04] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2246 (T426633)', diff saved to https://phabricator.wikimedia.org/P93165 and previous config saved to /var/cache/conftool/dbconfig/20260527-062503-fceratto.json [06:30:42] (03PS1) 10JavierMonton: html-enrichment: relax offset lag monitors [alerts] - 10https://gerrit.wikimedia.org/r/1294113 (https://phabricator.wikimedia.org/T423920) [06:35:12] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2246', diff saved to https://phabricator.wikimedia.org/P93166 and previous config saved to /var/cache/conftool/dbconfig/20260527-063511-fceratto.json [06:36:28] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:36:35] (03PS2) 10JavierMonton: html-enrichment: relax offset lag monitors [alerts] - 10https://gerrit.wikimedia.org/r/1294113 (https://phabricator.wikimedia.org/T423920) [06:44:02] (03CR) 10JMeybohm: [C:03+1] Remove k8s version from all services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273967 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [06:44:13] (03CR) 10JMeybohm: [C:03+1] CI: Fix race condition [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293757 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [06:45:19] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2246', diff saved to https://phabricator.wikimedia.org/P93168 and previous config saved to /var/cache/conftool/dbconfig/20260527-064519-fceratto.json [06:45:30] (03CR) 10Ryan Kemper: [C:03+1] relforge: remove logstash (gelf) profile [puppet] - 10https://gerrit.wikimedia.org/r/1293809 (https://phabricator.wikimedia.org/T324335) (owner: 10Bking) [06:45:39] (03CR) 10JMeybohm: [C:03+1] "I think it's just taking it's time because changing the rake_modules triggers a full CI run. ~25min is not uncommon for that." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293757 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [06:50:52] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1293790 (https://phabricator.wikimedia.org/T427312) (owner: 10Scott French) [06:51:23] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1293789 (https://phabricator.wikimedia.org/T427312) (owner: 10Scott French) [06:54:19] (03PS1) 10Muehlenhoff: Remove ganeti1025/1026 [puppet] - 10https://gerrit.wikimedia.org/r/1294115 (https://phabricator.wikimedia.org/T424680) [06:54:42] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti1025.eqiad.wmnet [06:55:27] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2246 (T426633)', diff saved to https://phabricator.wikimedia.org/P93170 and previous config saved to /var/cache/conftool/dbconfig/20260527-065526-fceratto.json [06:55:38] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2247.codfw.wmnet with reason: Maintenance [06:55:46] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2247 (T426633)', diff saved to https://phabricator.wikimedia.org/P93171 and previous config saved to /var/cache/conftool/dbconfig/20260527-065545-fceratto.json [06:59:55] (03Abandoned) 10Elukey: admin_ng: disable tag->sha256 for all ml clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163712 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [06:59:56] jmm@cumin2002 decommission (PID 1477266) is awaiting input [07:00:05] Amir1, urbanecm, and awight: UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260527T0700). Please do the needful. [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:35] (03CR) 10Elukey: [C:03+2] team-sre: modify pki's alert to notify users earlier [alerts] - 10https://gerrit.wikimedia.org/r/1286923 (owner: 10Elukey) [07:02:32] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader1003.wikimedia.org [07:02:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:04:11] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2247 (T426633)', diff saved to https://phabricator.wikimedia.org/P93172 and previous config saved to /var/cache/conftool/dbconfig/20260527-070410-fceratto.json [07:06:02] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1190.eqiad.wmnet with reason: Maintenance on db1190 [07:06:17] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [07:07:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader1003.wikimedia.org [07:07:26] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader2003.wikimedia.org [07:07:57] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool es2055: repool after maintenance [07:11:04] (03PS1) 10Marostegui: mariadb: Decommission pc1014 [puppet] - 10https://gerrit.wikimedia.org/r/1294123 (https://phabricator.wikimedia.org/T427270) [07:11:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader2003.wikimedia.org [07:11:59] jmm@cumin2002 decommission (PID 1477266) is awaiting input [07:13:41] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1025.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [07:13:43] !log marostegui@cumin1003 START - Cookbook sre.mysql.decommission [07:13:58] !log marostegui@cumin1003 START - Cookbook sre.hosts.decommission for hosts pc1014.eqiad.wmnet [07:14:12] (03PS1) 10Muehlenhoff: Failover url downloaders for reboots [dns] - 10https://gerrit.wikimedia.org/r/1294124 [07:14:19] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2247', diff saved to https://phabricator.wikimedia.org/P93174 and previous config saved to /var/cache/conftool/dbconfig/20260527-071418-fceratto.json [07:14:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1025.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [07:14:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:14:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti1025.eqiad.wmnet [07:14:37] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Add ganeti105[5678] and decom ganeti102[3456] - https://phabricator.wikimedia.org/T424680#11958012 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ganeti1025.eqiad.wmnet` - ganeti1025.eqiad.wmne... [07:15:19] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti1026.eqiad.wmnet [07:18:27] jmm@cumin2002 decommission (PID 1491224) is awaiting input [07:18:57] !log marostegui@cumin1003 START - Cookbook sre.dns.netbox [07:20:17] (03CR) 10Marostegui: [C:03+2] mariadb: Decommission pc1014 [puppet] - 10https://gerrit.wikimedia.org/r/1294123 (https://phabricator.wikimedia.org/T427270) (owner: 10Marostegui) [07:23:09] !log marostegui@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: pc1014.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [07:23:24] !log marostegui@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: pc1014.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [07:23:24] !log marostegui@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:23:25] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts pc1014.eqiad.wmnet [07:23:28] !log marostegui@cumin1003 Removing pc1014 from zarcillo T427190 [07:23:31] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.decommission (exit_code=0) [07:23:33] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission pc1013.eqiad.wmnet - https://phabricator.wikimedia.org/T427190#11958022 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1003 for hosts: `pc1014.eqiad.wmnet` - pc1014.eqiad.wmnet (**PASS**) - D... [07:23:36] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission pc1013.eqiad.wmnet - https://phabricator.wikimedia.org/T427190#11958023 (10ops-monitoring-bot) pc1014 has been deleted from zarcillo [07:23:36] T427190: decommission pc1013.eqiad.wmnet - https://phabricator.wikimedia.org/T427190 [07:23:38] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission pc1013.eqiad.wmnet - https://phabricator.wikimedia.org/T427190#11958024 (10ops-monitoring-bot) pc1014 has been decommissioned by Data Persistence [07:24:27] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2247', diff saved to https://phabricator.wikimedia.org/P93175 and previous config saved to /var/cache/conftool/dbconfig/20260527-072426-fceratto.json [07:24:54] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission pc1014.eqiad.wmnet - https://phabricator.wikimedia.org/T427270#11958028 (10Marostegui) a:05Marostegui→03None [07:24:57] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission pc1014.eqiad.wmnet - https://phabricator.wikimedia.org/T427270#11958034 (10Marostegui) This host is ready for DC-Ops to decommission [07:25:38] PROBLEM - orchestrator resolve cache non-FQDNs on dborch1002 is CRITICAL: CRITICAL: 1 non-FQDN entries in orchestrator resolve cache: https://wikitech.wikimedia.org/wiki/Orchestrator [07:26:28] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:26:34] (03PS1) 10Mszwarc: Add script to demote ineligible members of restricted global groups [extensions/CentralAuth] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1294125 (https://phabricator.wikimedia.org/T425395) [07:26:49] (03PS1) 10Mszwarc: Add script to demote ineligible members of restricted global groups [extensions/CentralAuth] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294126 (https://phabricator.wikimedia.org/T425395) [07:28:08] (03CR) 10Marostegui: "Worked nicely, check the comment below, mostly UI related." [cookbooks] - 10https://gerrit.wikimedia.org/r/1291952 (https://phabricator.wikimedia.org/T426613) (owner: 10Federico Ceratto) [07:28:55] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [07:30:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1294125 (https://phabricator.wikimedia.org/T425395) (owner: 10Mszwarc) [07:30:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294126 (https://phabricator.wikimedia.org/T425395) (owner: 10Mszwarc) [07:32:06] (03Merged) 10jenkins-bot: Add script to demote ineligible members of restricted global groups [extensions/CentralAuth] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1294125 (https://phabricator.wikimedia.org/T425395) (owner: 10Mszwarc) [07:32:11] (03Merged) 10jenkins-bot: Add script to demote ineligible members of restricted global groups [extensions/CentralAuth] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294126 (https://phabricator.wikimedia.org/T425395) (owner: 10Mszwarc) [07:32:42] (03CR) 10Slyngshede: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1294124 (owner: 10Muehlenhoff) [07:33:36] !log mszwarc@deploy1003 Started scap sync-world: Backport for [[gerrit:1294125|Add script to demote ineligible members of restricted global groups (T425395)]], [[gerrit:1294126|Add script to demote ineligible members of restricted global groups (T425395)]] [07:33:41] T425395: Add a script to demote ineligible users from restricted global groups - https://phabricator.wikimedia.org/T425395 [07:34:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2247 (T426633)', diff saved to https://phabricator.wikimedia.org/P93176 and previous config saved to /var/cache/conftool/dbconfig/20260527-073434-fceratto.json [07:34:40] jmm@cumin2002 decommission (PID 1491224) is awaiting input [07:34:57] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2248.codfw.wmnet with reason: Maintenance [07:35:05] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2248 (T426633)', diff saved to https://phabricator.wikimedia.org/P93177 and previous config saved to /var/cache/conftool/dbconfig/20260527-073504-fceratto.json [07:35:35] !log mszwarc@deploy1003 mszwarc: Backport for [[gerrit:1294125|Add script to demote ineligible members of restricted global groups (T425395)]], [[gerrit:1294126|Add script to demote ineligible members of restricted global groups (T425395)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:36:01] !log mszwarc@deploy1003 mszwarc: Continuing with deployment [07:38:33] (03CR) 10Marostegui: "We should also update: https://wikitech.wikimedia.org/wiki/MariaDB/Decommissioning_a_DB_Host" [cookbooks] - 10https://gerrit.wikimedia.org/r/1291952 (https://phabricator.wikimedia.org/T426613) (owner: 10Federico Ceratto) [07:40:19] !log mszwarc@deploy1003 Finished scap sync-world: Backport for [[gerrit:1294125|Add script to demote ineligible members of restricted global groups (T425395)]], [[gerrit:1294126|Add script to demote ineligible members of restricted global groups (T425395)]] (duration: 06m 42s) [07:40:24] T425395: Add a script to demote ineligible users from restricted global groups - https://phabricator.wikimedia.org/T425395 [07:40:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2248 (T426633)', diff saved to https://phabricator.wikimedia.org/P93178 and previous config saved to /var/cache/conftool/dbconfig/20260527-074031-fceratto.json [07:41:37] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [07:42:02] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool es2051: Upgrading es2051.codfw.wmnet [07:42:21] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool es2051: Upgrading es2051.codfw.wmnet [07:43:35] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1026.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [07:43:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1026.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [07:43:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:43:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti1026.eqiad.wmnet [07:43:53] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Add ganeti105[5678] and decom ganeti102[3456] - https://phabricator.wikimedia.org/T424680#11958085 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ganeti1026.eqiad.wmnet` - ganeti1026.eqiad.wmne... [07:49:04] (03PS2) 10Muehlenhoff: Remove ganeti1025/1026 [puppet] - 10https://gerrit.wikimedia.org/r/1294115 (https://phabricator.wikimedia.org/T424680) [07:49:39] (03CR) 10Elukey: [C:03+1] Add urldownloader[12]00[56] [puppet] - 10https://gerrit.wikimedia.org/r/1293743 (https://phabricator.wikimedia.org/T427282) (owner: 10Muehlenhoff) [07:50:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2248', diff saved to https://phabricator.wikimedia.org/P93180 and previous config saved to /var/cache/conftool/dbconfig/20260527-075039-fceratto.json [07:52:50] (03PS1) 10Elukey: Set pki-root1001 to role insetup [puppet] - 10https://gerrit.wikimedia.org/r/1294179 (https://phabricator.wikimedia.org/T416664) [07:53:07] (03CR) 10Muehlenhoff: [C:03+2] Remove ganeti1025/1026 [puppet] - 10https://gerrit.wikimedia.org/r/1294115 (https://phabricator.wikimedia.org/T424680) (owner: 10Muehlenhoff) [07:56:25] 10ops-codfw, 06SRE, 06DC-Ops, 10procurement, and 2 others: decomission deploy2002.codfw.wmnet - https://phabricator.wikimedia.org/T426222#11958159 (10MLechvien-WMF) p:05Triage→03Medium [07:56:42] 10ops-codfw, 06SRE, 06DC-Ops, 10procurement, and 2 others: decommission deploy2002.codfw.wmnet - https://phabricator.wikimedia.org/T426222#11958160 (10MLechvien-WMF) [07:59:09] (03PS4) 10Mszwarc: Periodic jobs: add demote_ineligible_users (and _central_ counterpart) [puppet] - 10https://gerrit.wikimedia.org/r/1285315 (https://phabricator.wikimedia.org/T425396) [07:59:09] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es2051.codfw.wmnet with OS trixie [07:59:22] (03PS5) 10Federico Ceratto: cookbooks/sre/mysql/decommission: add cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1291952 (https://phabricator.wikimedia.org/T426613) [08:00:05] jnuche and hashar: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260527T0800) [08:00:20] morning, train will start soon [08:00:47] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2248', diff saved to https://phabricator.wikimedia.org/P93181 and previous config saved to /var/cache/conftool/dbconfig/20260527-080046-fceratto.json [08:01:57] (03PS1) 10Jelto: miscweb: remove wmf-navigator public and private config from web container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294208 (https://phabricator.wikimedia.org/T414405) [08:02:14] (03CR) 10Federico Ceratto: cookbooks/sre/mysql/decommission: add cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1291952 (https://phabricator.wikimedia.org/T426613) (owner: 10Federico Ceratto) [08:02:26] (03CR) 10CI reject: [V:04-1] cookbooks/sre/mysql/decommission: add cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1291952 (https://phabricator.wikimedia.org/T426613) (owner: 10Federico Ceratto) [08:02:59] (03CR) 10Mszwarc: "I433a6c82f42550b9c91d1ed5691dc5b12d4c34df has been merged and backported to wikis" [puppet] - 10https://gerrit.wikimedia.org/r/1285315 (https://phabricator.wikimedia.org/T425396) (owner: 10Mszwarc) [08:03:27] (03CR) 10Marostegui: [C:03+1] "All good from my side, pending the discussion with Ceri." [cookbooks] - 10https://gerrit.wikimedia.org/r/1291952 (https://phabricator.wikimedia.org/T426613) (owner: 10Federico Ceratto) [08:03:51] (03PS1) 10TrainBranchBot: group1 to 1.47.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294209 (https://phabricator.wikimedia.org/T423913) [08:03:53] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by jnuche@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294209 (https://phabricator.wikimedia.org/T423913) (owner: 10TrainBranchBot) [08:04:14] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:05:29] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:05:38] (03CR) 10Muehlenhoff: [C:03+2] Failover url downloaders for reboots [dns] - 10https://gerrit.wikimedia.org/r/1294124 (owner: 10Muehlenhoff) [08:05:43] !log jmm@dns1004 START - running authdns-update [08:05:54] (03Merged) 10jenkins-bot: group1 to 1.47.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294209 (https://phabricator.wikimedia.org/T423913) (owner: 10TrainBranchBot) [08:07:23] !log jmm@dns1004 END - running authdns-update [08:07:26] (03CR) 10Muehlenhoff: Set pki-root1001 to role insetup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1294179 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey) [08:08:04] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Add ganeti105[5678] and decom ganeti102[3456] - https://phabricator.wikimedia.org/T424680#11958176 (10MoritzMuehlenhoff) [08:10:55] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2248 (T426633)', diff saved to https://phabricator.wikimedia.org/P93182 and previous config saved to /var/cache/conftool/dbconfig/20260527-081054-fceratto.json [08:11:06] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2153.codfw.wmnet with reason: Maintenance [08:11:13] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2153 (T426633)', diff saved to https://phabricator.wikimedia.org/P93183 and previous config saved to /var/cache/conftool/dbconfig/20260527-081112-fceratto.json [08:11:45] (03PS2) 10Elukey: Set pki-root1001 to role insetup [puppet] - 10https://gerrit.wikimedia.org/r/1294179 (https://phabricator.wikimedia.org/T416664) [08:11:54] (03CR) 10Elukey: Set pki-root1001 to role insetup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1294179 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey) [08:11:56] !log jnuche@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.47.0-wmf.4 refs T423913 [08:12:01] T423913: 1.47.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T423913 [08:15:15] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es2051.codfw.wmnet with reason: host reimage [08:16:31] (03PS2) 10Arnaudb: gitlab: add envoy on Gitlab [puppet] - 10https://gerrit.wikimedia.org/r/1293722 (https://phabricator.wikimedia.org/T425441) [08:16:31] (03CR) 10Arnaudb: "thanks for the reviews, all things considered I think it's better to avoid adding Envoy on WMCS outside of the scope of a dedicated task" [puppet] - 10https://gerrit.wikimedia.org/r/1293722 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [08:18:06] (03PS5) 10Arnaudb: trafficserver: add a map for gitlab as a backend [puppet] - 10https://gerrit.wikimedia.org/r/1290731 (https://phabricator.wikimedia.org/T425441) [08:18:46] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2051.codfw.wmnet with reason: host reimage [08:19:43] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T426633)', diff saved to https://phabricator.wikimedia.org/P93184 and previous config saved to /var/cache/conftool/dbconfig/20260527-081942-fceratto.json [08:27:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader2004.wikimedia.org [08:27:56] (03CR) 10Atsuko: [C:03+2] image: Flink 2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1293664 (https://phabricator.wikimedia.org/T412978) (owner: 10JavierMonton) [08:28:41] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Repurpose ganeti102[3456] for Zuul migration - https://phabricator.wikimedia.org/T427353 (10MoritzMuehlenhoff) 03NEW [08:29:50] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P93185 and previous config saved to /var/cache/conftool/dbconfig/20260527-082950-fceratto.json [08:29:55] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Repurpose ganeti102[3456] for Zuul migration - https://phabricator.wikimedia.org/T427353#11958231 (10MoritzMuehlenhoff) [08:31:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader2004.wikimedia.org [08:31:49] (03PS6) 10Arnaudb: trafficserver: add a map for gitlab as a backend [puppet] - 10https://gerrit.wikimedia.org/r/1290731 (https://phabricator.wikimedia.org/T425441) [08:32:07] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader1004.wikimedia.org [08:33:45] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [08:33:57] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db2166: Upgrading db2166.codfw.wmnet [08:34:17] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2166: Upgrading db2166.codfw.wmnet [08:35:07] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2051.codfw.wmnet with OS trixie [08:35:38] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db2166.codfw.wmnet with OS trixie [08:36:00] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [08:36:22] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1203: Upgrading db1203.eqiad.wmnet [08:36:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader1004.wikimedia.org [08:36:51] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1203: Upgrading db1203.eqiad.wmnet [08:37:43] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [08:38:27] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool es2051: repool after maintenance [08:38:28] (03PS1) 10Phuedx: ext.wikimediaEvents: Add hoisting error detection test [extensions/WikimediaEvents] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294217 (https://phabricator.wikimedia.org/T427092) [08:38:40] (03PS1) 10Blake: mcrouter_wancache: swap mc1055 for mc1054 for trixie testing [puppet] - 10https://gerrit.wikimedia.org/r/1294216 (https://phabricator.wikimedia.org/T426044) [08:38:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/WikimediaEvents] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294217 (https://phabricator.wikimedia.org/T427092) (owner: 10Phuedx) [08:39:58] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P93189 and previous config saved to /var/cache/conftool/dbconfig/20260527-083957-fceratto.json [08:41:15] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#11958339 (10ayounsi) [08:41:29] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1203.eqiad.wmnet with OS trixie [08:42:08] (03PS4) 10Arnaudb: gitlab: add envoy on Gitlab [puppet] - 10https://gerrit.wikimedia.org/r/1293722 (https://phabricator.wikimedia.org/T425441) [08:42:14] (03CR) 10Arnaudb: [C:03+2] gitlab: add envoy on Gitlab [puppet] - 10https://gerrit.wikimedia.org/r/1293722 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [08:42:24] (03CR) 10Muehlenhoff: [C:03+1] "Finally!" [puppet] - 10https://gerrit.wikimedia.org/r/1294179 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey) [08:43:06] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#11958357 (10ayounsi) [08:43:32] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#11958374 (10ayounsi) [08:47:26] (03PS3) 10Filippo Giunchedi: alerts: add transformations option [puppet] - 10https://gerrit.wikimedia.org/r/1291947 (https://phabricator.wikimedia.org/T424814) [08:47:26] (03PS3) 10Filippo Giunchedi: toolforge: use alerts::deploy transformations [puppet] - 10https://gerrit.wikimedia.org/r/1291948 (https://phabricator.wikimedia.org/T424814) [08:47:33] (03CR) 10Filippo Giunchedi: alerts: add transformations option (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1291947 (https://phabricator.wikimedia.org/T424814) (owner: 10Filippo Giunchedi) [08:48:02] (03CR) 10Hnowlan: prometheus: add deployment label to appservers RED recording rules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1293080 (https://phabricator.wikimedia.org/T249663) (owner: 10Hnowlan) [08:50:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T426633)', diff saved to https://phabricator.wikimedia.org/P93190 and previous config saved to /var/cache/conftool/dbconfig/20260527-085005-fceratto.json [08:50:17] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance [08:50:18] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, though please consider also absenting mcrouter for puppet to do the cleanup instead of manually" [puppet] - 10https://gerrit.wikimedia.org/r/1278528 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [08:50:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2170 (T426633)', diff saved to https://phabricator.wikimedia.org/P93191 and previous config saved to /var/cache/conftool/dbconfig/20260527-085024-fceratto.json [08:50:29] !log depooling and installing haproxy-awslc on cp3074 and cp3066 (T419825) [08:50:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:34] T419825: Test HAProxy 3.2 with AWS-LC libraries - https://phabricator.wikimedia.org/T419825 [08:51:16] !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [08:51:28] !log fabfur@cumin1003 conftool action : set/pooled=no; selector: name=cp3074.* [08:51:40] !log fabfur@cumin1003 conftool action : set/pooled=no; selector: name=cp3066.* [08:51:41] !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [08:51:46] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [08:51:51] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/MultimediaViewer] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1290781 (https://phabricator.wikimedia.org/T426960) (owner: 10Krinkle) [08:52:03] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [08:52:08] !log jayme@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [08:52:23] !log jayme@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [08:52:28] !log jayme@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [08:52:46] !log jayme@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [08:52:52] !log jayme@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [08:53:07] !log jayme@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:53:11] !log jayme@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [08:53:29] !log jayme@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [08:53:33] !log jayme@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [08:53:58] !log jayme@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:54:00] (03CR) 10Fabfur: [C:03+2] hiera: using haproxy-awslc on cp3074,cp3066 [puppet] - 10https://gerrit.wikimedia.org/r/1289998 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [08:54:02] !log jayme@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. [08:54:04] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2166.codfw.wmnet with reason: host reimage [08:54:20] !log jayme@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. [08:54:40] !log restart swift on ms-fe2011 T360913 [08:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:44] T360913: Swift proxy server misbehaviour (no longer calling `accept`?) - https://phabricator.wikimedia.org/T360913 [08:55:25] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1203.eqiad.wmnet with reason: host reimage [08:57:52] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T426633)', diff saved to https://phabricator.wikimedia.org/P93193 and previous config saved to /var/cache/conftool/dbconfig/20260527-085751-fceratto.json [08:59:22] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2166.codfw.wmnet with reason: host reimage [09:00:41] (03PS3) 10Effie Mouzeli: scap: remove testservers 4 [puppet] - 10https://gerrit.wikimedia.org/r/1198019 (https://phabricator.wikimedia.org/T397498) [09:01:44] (03Abandoned) 10Effie Mouzeli: mw-mcrouter: use puppet defined image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054580 (owner: 10Effie Mouzeli) [09:02:31] !log slyngshede@cumin1003 conftool action : set/pooled=yes; selector: name=cp6015.* [09:02:38] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1203.eqiad.wmnet with reason: host reimage [09:02:47] !log slyngshede@cumin1003 START - Cookbook sre.hosts.remove-downtime for cp6015.drmrs.wmnet [09:02:47] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp6015.drmrs.wmnet [09:02:58] !log repooling cp3074 and cp3066 (T419825) [09:03:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:03] T419825: Test HAProxy 3.2 with AWS-LC libraries - https://phabricator.wikimedia.org/T419825 [09:03:09] !log fabfur@cumin1003 conftool action : set/pooled=yes; selector: name=cp3066.* [09:03:12] (03CR) 10JMeybohm: [C:03+1] Update to kubernetes v1.31.14. [debs/kubernetes] (v1.31) - 10https://gerrit.wikimedia.org/r/1293087 (https://phabricator.wikimedia.org/T427065) (owner: 10Blake) [09:03:16] !log fabfur@cumin1003 conftool action : set/pooled=yes; selector: name=cp3074.* [09:03:25] (03CR) 10JMeybohm: [C:03+2] Remove pinned chart versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293750 (https://phabricator.wikimedia.org/T423251) (owner: 10JMeybohm) [09:03:44] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:03:47] 10ops-drmrs, 06DC-Ops: cp6015 network error - https://phabricator.wikimedia.org/T426968#11958478 (10SLyngshede-WMF) I've done a few check and there isn't any reason to reimage the host. I've removed the downtime and repooled the host. [09:03:58] 10ops-drmrs, 06DC-Ops: cp6015 network error - https://phabricator.wikimedia.org/T426968#11958479 (10SLyngshede-WMF) 05Open→03Resolved [09:04:42] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:04:49] (03CR) 10Filippo Giunchedi: [C:03+2] alerts: add transformations option [puppet] - 10https://gerrit.wikimedia.org/r/1291947 (https://phabricator.wikimedia.org/T424814) (owner: 10Filippo Giunchedi) [09:04:55] (03CR) 10Filippo Giunchedi: [C:03+2] toolforge: use alerts::deploy transformations [puppet] - 10https://gerrit.wikimedia.org/r/1291948 (https://phabricator.wikimedia.org/T424814) (owner: 10Filippo Giunchedi) [09:05:00] RECOVERY - Confd vcl based reload on cp6009 is OK: reload-vcl successfully ran 0h, 2 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:05:14] PROBLEM - Confd vcl based reload on cp6015 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [09:05:42] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:05:44] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:08:00] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P93194 and previous config saved to /var/cache/conftool/dbconfig/20260527-090759-fceratto.json [09:08:19] (03PS1) 10Elukey: role::ml_k8s::staging::master: enable IPIP encapsulation [puppet] - 10https://gerrit.wikimedia.org/r/1294223 (https://phabricator.wikimedia.org/T420438) [09:08:20] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:08:21] (03PS1) 10Elukey: Set ml-staging-ctrl to the Maglev scheduler and fix stale options [puppet] - 10https://gerrit.wikimedia.org/r/1294224 (https://phabricator.wikimedia.org/T420438) [09:08:23] (03PS1) 10Elukey: role::ml_k8s::staging::worker: enable IPIP encapsulation [puppet] - 10https://gerrit.wikimedia.org/r/1294225 (https://phabricator.wikimedia.org/T420438) [09:08:26] (03PS1) 10Elukey: Set Maglev's scheduling for inference-staging and ingress [puppet] - 10https://gerrit.wikimedia.org/r/1294226 (https://phabricator.wikimedia.org/T420438) [09:09:14] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:09:18] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 04 Aug 2026 03:33:57 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:10:29] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:10:57] (03CR) 10Clément Goubert: [C:03+2] cache::text: pipe caching for lw streaming API [puppet] - 10https://gerrit.wikimedia.org/r/1293746 (https://phabricator.wikimedia.org/T425680) (owner: 10Clément Goubert) [09:11:29] (03Merged) 10jenkins-bot: Remove pinned chart versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293750 (https://phabricator.wikimedia.org/T423251) (owner: 10JMeybohm) [09:16:36] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2166.codfw.wmnet with OS trixie [09:18:07] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P93196 and previous config saved to /var/cache/conftool/dbconfig/20260527-091806-fceratto.json [09:19:59] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1203.eqiad.wmnet with OS trixie [09:23:40] RECOVERY - orchestrator resolve cache non-FQDNs on dborch1002 is OK: OK: all orchestrator resolve cache entries are FQDNs https://wikitech.wikimedia.org/wiki/Orchestrator [09:23:51] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool es2051: repool after maintenance [09:24:30] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db2166: Migration of db2166.codfw.wmnet completed [09:25:19] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool es1050: repool after maintenance [09:25:20] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool es1050: repool after maintenance [09:25:33] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [09:25:55] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool es1050: Upgrading es1050.eqiad.wmnet [09:26:14] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool es1050: Upgrading es1050.eqiad.wmnet [09:26:45] (03PS2) 10Effie Mouzeli: mcrouter_wancache: swap mc1055 for mc1054 for trixie testing [puppet] - 10https://gerrit.wikimedia.org/r/1294216 (https://phabricator.wikimedia.org/T426044) (owner: 10Blake) [09:27:00] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es1050.eqiad.wmnet with OS trixie [09:28:14] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T426633)', diff saved to https://phabricator.wikimedia.org/P93200 and previous config saved to /var/cache/conftool/dbconfig/20260527-092814-fceratto.json [09:28:32] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1203: Migration of db1203.eqiad.wmnet completed [09:28:36] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2173.codfw.wmnet with reason: Maintenance [09:28:43] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2173 (T426633)', diff saved to https://phabricator.wikimedia.org/P93202 and previous config saved to /var/cache/conftool/dbconfig/20260527-092842-fceratto.json [09:28:53] (03PS4) 10Arnaudb: gitlab: use service name for upstream addr [puppet] - 10https://gerrit.wikimedia.org/r/1294219 (https://phabricator.wikimedia.org/T425441) [09:28:53] (03CR) 10Arnaudb: "That change will require a gitlab-ctl reconfigure (run by puppet), so it will trigger a short unavailability period. I suggest to merge it" [puppet] - 10https://gerrit.wikimedia.org/r/1294219 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [09:30:44] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission pc1014.eqiad.wmnet - https://phabricator.wikimedia.org/T427270#11958567 (10Jclark-ctr) a:03Jclark-ctr [09:32:06] (03CR) 10Blake: [C:03+2] Update to kubernetes v1.31.14. [debs/kubernetes] (v1.31) - 10https://gerrit.wikimedia.org/r/1293087 (https://phabricator.wikimedia.org/T427065) (owner: 10Blake) [09:32:57] (03CR) 10Arnaudb: [C:03+2] vrts: alerts for the new antispam pipeline [alerts] - 10https://gerrit.wikimedia.org/r/1293667 (https://phabricator.wikimedia.org/T402260) (owner: 10Arnaudb) [09:34:48] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:34:59] (03Merged) 10jenkins-bot: vrts: alerts for the new antispam pipeline [alerts] - 10https://gerrit.wikimedia.org/r/1293667 (https://phabricator.wikimedia.org/T402260) (owner: 10Arnaudb) [09:36:10] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T426633)', diff saved to https://phabricator.wikimedia.org/P93203 and previous config saved to /var/cache/conftool/dbconfig/20260527-093609-fceratto.json [09:36:18] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:36:56] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti105[5678] and decom ganeti102[3456] - https://phabricator.wikimedia.org/T424680#11958604 (10MoritzMuehlenhoff) 05Open→03Resolved All done [09:37:01] !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [09:37:16] !log bwojtowicz@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [09:38:04] !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [09:38:19] !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [09:41:46] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es1050.eqiad.wmnet with reason: host reimage [09:43:50] !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [09:45:39] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 3 others: Replace Spamassassin with Rspam for VRTS on Postfix - https://phabricator.wikimedia.org/T402260#11958638 (10ABran-WMF) 05In progress→03Resolved Alerts have been merged, I'm marking this as `Resolved`, feel free to... [09:46:04] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1050.eqiad.wmnet with reason: host reimage [09:46:17] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P93206 and previous config saved to /var/cache/conftool/dbconfig/20260527-094616-fceratto.json [09:46:38] (03CR) 10Effie Mouzeli: [C:03+1] mcrouter_wancache: swap mc1055 for mc1054 for trixie testing [puppet] - 10https://gerrit.wikimedia.org/r/1294216 (https://phabricator.wikimedia.org/T426044) (owner: 10Blake) [09:47:07] (03CR) 10Arnaudb: "the previous deployment calendar link is broken: https://wikitech.wikimedia.org/wiki/Deployments#Friday,_May_29" [puppet] - 10https://gerrit.wikimedia.org/r/1294219 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [09:47:26] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [09:50:47] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289349 (owner: 10PipelineBot) [09:53:13] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289349 (owner: 10PipelineBot) [09:56:25] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P93208 and previous config saved to /var/cache/conftool/dbconfig/20260527-095624-fceratto.json [09:58:30] (03PS2) 10Jelto: miscweb: remove wmf-navigator public and private config from web container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294208 (https://phabricator.wikimedia.org/T414405) [09:59:00] (03PS5) 10Dzahn: tcpproxy: add support for gitlab-ssh [puppet] - 10https://gerrit.wikimedia.org/r/1282428 (https://phabricator.wikimedia.org/T425441) [09:59:17] (03PS8) 10Arnaudb: service: add gitlab-https and gitlab-ssh service to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/1290684 (https://phabricator.wikimedia.org/T425441) [09:59:17] (03PS8) 10Arnaudb: lvs7003: add gitlab-ssh and gitlab-https [puppet] - 10https://gerrit.wikimedia.org/r/1291898 (https://phabricator.wikimedia.org/T425441) [09:59:55] (03PS1) 10STran: Deploy IRS Direct Reporting feature to enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294229 (https://phabricator.wikimedia.org/T427369) [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260527T1000) [10:02:18] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1050.eqiad.wmnet with OS trixie [10:03:42] (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: tighten rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289992 (https://phabricator.wikimedia.org/T424821) (owner: 10Daniel Kinzler) [10:04:40] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [10:05:09] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool es1050: repool after maintenance [10:06:14] (03Merged) 10jenkins-bot: rest-gateway: tighten rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289992 (https://phabricator.wikimedia.org/T424821) (owner: 10Daniel Kinzler) [10:06:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T426633)', diff saved to https://phabricator.wikimedia.org/P93211 and previous config saved to /var/cache/conftool/dbconfig/20260527-100632-fceratto.json [10:06:39] (03CR) 10Muehlenhoff: "Looks good, but see comment inline about moving to 7.3.7" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1285804 (owner: 10Slyngshede) [10:06:54] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance [10:07:02] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2174 (T426633)', diff saved to https://phabricator.wikimedia.org/P93212 and previous config saved to /var/cache/conftool/dbconfig/20260527-100701-fceratto.json [10:08:39] !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:08:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:08:47] FIRING: HelmReleaseBadStatus: Helm release kube-system/calico on k8s@codfw in state pending-rollback - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:10:00] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2166: Migration of db2166.codfw.wmnet completed [10:10:01] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [10:10:08] (03PS3) 10Jelto: miscweb: remove wmf-navigator public and private config from web container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294208 (https://phabricator.wikimedia.org/T414405) [10:10:30] !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:11:18] (03CR) 10Federico Ceratto: sre.mysql.upgrade: fix looping logic (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1291999 (https://phabricator.wikimedia.org/T420203) (owner: 10FNegri) [10:13:16] (03CR) 10LSobanski: "Just to confirm, there is no way of making port 22 work externally on the GitLab IP of the TCP proxies?" [puppet] - 10https://gerrit.wikimedia.org/r/1282428 (https://phabricator.wikimedia.org/T425441) (owner: 10Dzahn) [10:14:01] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1203: Migration of db1203.eqiad.wmnet completed [10:14:02] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [10:14:27] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T426633)', diff saved to https://phabricator.wikimedia.org/P93215 and previous config saved to /var/cache/conftool/dbconfig/20260527-101426-fceratto.json [10:17:50] (03CR) 10Muehlenhoff: "This should be handled by SRE Clinic duty with a dedicated task" [puppet] - 10https://gerrit.wikimedia.org/r/1293769 (https://phabricator.wikimedia.org/T423255) (owner: 10ArielGlenn) [10:18:54] (03CR) 10Muehlenhoff: [C:03+2] profile::rpkivalidator: Switch to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1293609 (owner: 10Muehlenhoff) [10:19:07] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs1020 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [10:21:37] !log daniel@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [10:21:55] 10ops-codfw, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: re-rack mc2055 (before Jun 9th) - https://phabricator.wikimedia.org/T427373 (10jijiki) 03NEW p:05Triage→03High [10:22:03] !log daniel@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [10:24:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P93217 and previous config saved to /var/cache/conftool/dbconfig/20260527-102434-fceratto.json [10:27:57] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs1019 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [10:28:57] (03PS1) 10Muehlenhoff: rpkivalidator: Fix up previous patch [puppet] - 10https://gerrit.wikimedia.org/r/1294238 [10:29:20] !log daniel@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [10:29:24] (03CR) 10Marostegui: sre.mysql.global-read-only Set all sections as RO/RW (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [10:29:41] !log daniel@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [10:32:11] (03CR) 10Muehlenhoff: [C:03+2] rpkivalidator: Fix up previous patch [puppet] - 10https://gerrit.wikimedia.org/r/1294238 (owner: 10Muehlenhoff) [10:34:33] (03CR) 10JMeybohm: miscweb: remove wmf-navigator public and private config from web container (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294208 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [10:34:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P93218 and previous config saved to /var/cache/conftool/dbconfig/20260527-103441-fceratto.json [10:34:50] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [10:35:01] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db2165: Upgrading db2165.codfw.wmnet [10:35:20] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2165: Upgrading db2165.codfw.wmnet [10:35:29] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [10:35:50] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1193: Upgrading db1193.eqiad.wmnet [10:36:29] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1193: Upgrading db1193.eqiad.wmnet [10:36:49] (03PS1) 10Arnaudb: vrts: skip pint validation on active/passive alerts [alerts] - 10https://gerrit.wikimedia.org/r/1294240 (https://phabricator.wikimedia.org/T402260) [10:36:52] (03CR) 10Arnaudb: [C:03+2] vrts: skip pint validation on active/passive alerts [alerts] - 10https://gerrit.wikimedia.org/r/1294240 (https://phabricator.wikimedia.org/T402260) (owner: 10Arnaudb) [10:38:00] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1193.eqiad.wmnet with OS trixie [10:38:33] (03Merged) 10jenkins-bot: vrts: skip pint validation on active/passive alerts [alerts] - 10https://gerrit.wikimedia.org/r/1294240 (https://phabricator.wikimedia.org/T402260) (owner: 10Arnaudb) [10:39:08] (03PS1) 10Muehlenhoff: Switch rpki2003 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1294241 [10:39:32] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db2165.codfw.wmnet with OS trixie [10:41:16] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1294241 (owner: 10Muehlenhoff) [10:44:20] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Anycast services - depool strategy in terms of BGP routing - https://phabricator.wikimedia.org/T420821#11958890 (10ayounsi) [10:44:50] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T426633)', diff saved to https://phabricator.wikimedia.org/P93222 and previous config saved to /var/cache/conftool/dbconfig/20260527-104449-fceratto.json [10:45:11] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance [10:45:19] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2176 (T426633)', diff saved to https://phabricator.wikimedia.org/P93223 and previous config saved to /var/cache/conftool/dbconfig/20260527-104518-fceratto.json [10:46:04] (03PS3) 10Hnowlan: prometheus: add deployment label to appservers RED recording rules [puppet] - 10https://gerrit.wikimedia.org/r/1293080 (https://phabricator.wikimedia.org/T249663) [10:47:02] (03PS1) 10STran: Set minimum edit count for skipcaptcha right to 10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294243 (https://phabricator.wikimedia.org/T426973) [10:50:34] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool es1050: repool after maintenance [10:51:01] (03CR) 10Tiziano Fogli: [C:03+1] prometheus: add deployment label to appservers RED recording rules [puppet] - 10https://gerrit.wikimedia.org/r/1293080 (https://phabricator.wikimedia.org/T249663) (owner: 10Hnowlan) [10:52:09] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1193.eqiad.wmnet with reason: host reimage [10:52:35] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T426633)', diff saved to https://phabricator.wikimedia.org/P93225 and previous config saved to /var/cache/conftool/dbconfig/20260527-105235-fceratto.json [10:53:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1019: reimage to move primary IP from private1-c-eqiad to private1-c7-eqiad vlan - https://phabricator.wikimedia.org/T405632#11958935 (10ayounsi) [10:53:04] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: lvs1020: reimage to move primary IP from private1-d-eqiad to private1-d7-eqiad vlan - https://phabricator.wikimedia.org/T405630#11958936 (10ayounsi) [10:56:36] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [10:56:46] (03CR) 10Mszwarc: [C:03+1] Deploy IRS Direct Reporting feature to enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294229 (https://phabricator.wikimedia.org/T427369) (owner: 10STran) [10:57:22] (03PS2) 10Slyngshede: Update to CAS version 7.3.7 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1285804 [10:57:41] PROBLEM - Host db2189 #page is DOWN: CRITICAL - Network Unreachable (10.192.16.180) [10:57:52] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2165.codfw.wmnet with reason: host reimage [10:57:53] 06SRE, 06Infrastructure-Foundations, 10netops, 10observability: Add Icinga check for SRX cluster status - https://phabricator.wikimedia.org/T271298#11958946 (10ayounsi) 05Open→03Declined We're not going to add more stuff to Icinga. [10:57:57] (03CR) 10Slyngshede: Update to CAS version 7.3.7 (032 comments) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1285804 (owner: 10Slyngshede) [10:58:07] (03CR) 10Cathal Mooney: [C:03+2] Interface validators: allow for channelized port numbers on Juniper [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1293649 (https://phabricator.wikimedia.org/T427056) (owner: 10Cathal Mooney) [10:58:14] !ack [10:58:14] 8023 (ACKED) Host db2189 (paged) [10:58:46] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1193.eqiad.wmnet with reason: host reimage [10:58:47] RESOLVED: HelmReleaseBadStatus: Helm release kube-system/calico on k8s@codfw in state pending-rollback - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:00:05] mvolz: Time to snap out of that daydream and deploy Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260527T1100). [11:00:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2189', diff saved to https://phabricator.wikimedia.org/P93226 and previous config saved to /var/cache/conftool/dbconfig/20260527-110016-marostegui.json [11:00:33] jelto: db2189 went down, I will handle it [11:00:33] (03Merged) 10jenkins-bot: Interface validators: allow for channelized port numbers on Juniper [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1293649 (https://phabricator.wikimedia.org/T427056) (owner: 10Cathal Mooney) [11:01:20] marostegui: I get no login process on serial, unsure if rebooting or stuck, but no normal state [11:01:30] (03CR) 10Harroyo-wmf: [C:03+1] Set minimum edit count for skipcaptcha right to 10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294243 (https://phabricator.wikimedia.org/T426973) (owner: 10STran) [11:01:32] will log out so you can take it from there [11:01:41] Great thank you [11:01:45] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2165.codfw.wmnet with reason: host reimage [11:01:45] !log cmooney@cumin1003 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [11:02:06] !log cmooney@cumin1003 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [11:02:10] (03PS1) 10STran: Update Direct Reporting email [extensions/ReportIncident] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294247 (https://phabricator.wikimedia.org/T427358) [11:02:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/ReportIncident] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294247 (https://phabricator.wikimedia.org/T427358) (owner: 10STran) [11:02:40] !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply [11:02:43] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P93227 and previous config saved to /var/cache/conftool/dbconfig/20260527-110242-fceratto.json [11:02:51] (03CR) 10Mszwarc: [C:03+1] Update Direct Reporting email [extensions/ReportIncident] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294247 (https://phabricator.wikimedia.org/T427358) (owner: 10STran) [11:02:58] !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:03:29] (03PS1) 10Muehlenhoff: dbproxy: Remove unused public type [puppet] - 10https://gerrit.wikimedia.org/r/1294248 (https://phabricator.wikimedia.org/T149804) [11:04:24] (03CR) 10Ayounsi: [C:03+1] Switch rpki2003 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1294241 (owner: 10Muehlenhoff) [11:05:16] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1285804 (owner: 10Slyngshede) [11:05:17] (03CR) 10Dreamy Jazz: Set minimum edit count for skipcaptcha right to 10 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294243 (https://phabricator.wikimedia.org/T426973) (owner: 10STran) [11:05:43] (03CR) 10Slyngshede: [V:03+2 C:03+2] Update to CAS version 7.3.7 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1285804 (owner: 10Slyngshede) [11:05:49] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1294248 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [11:06:06] (03PS1) 10Marostegui: db2189: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1294249 [11:06:12] (03PS1) 10Mvolz: Revert "citoid: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294250 [11:06:22] (03CR) 10Mvolz: [C:03+2] Revert "citoid: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294250 (owner: 10Mvolz) [11:06:29] 06SRE, 06Infrastructure-Foundations, 10netops: Change codfw dns hosts BGP peering to top-of-rack switch - https://phabricator.wikimedia.org/T376894#11958977 (10ayounsi) [11:07:57] (03CR) 10Marostegui: [C:03+2] db2189: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1294249 (owner: 10Marostegui) [11:08:37] !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply [11:08:41] !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:08:47] (03Merged) 10jenkins-bot: Revert "citoid: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294250 (owner: 10Mvolz) [11:10:46] (03CR) 10Dreamy Jazz: [C:04-1] "How will this interact with https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/8098b104f08cb1bc91c2ddde9f1f669f2c84ab47/wmf-conf" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294243 (https://phabricator.wikimedia.org/T426973) (owner: 10STran) [11:10:46] !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply [11:10:55] !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:11:32] FIRING: [2x] HelmReleaseBadStatus: Helm release kube-system/calico on k8s@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:11:39] (03CR) 10Dreamy Jazz: [C:04-1] Set minimum edit count for skipcaptcha right to 10 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294243 (https://phabricator.wikimedia.org/T426973) (owner: 10STran) [11:12:08] 10ops-codfw, 06DBA, 06DC-Ops: db2189 crashed - https://phabricator.wikimedia.org/T427376 (10Marostegui) 03NEW [11:12:24] 10ops-codfw, 06DBA, 06DC-Ops: db2189 crashed - https://phabricator.wikimedia.org/T427376#11959011 (10Marostegui) p:05Triage→03Medium [11:12:33] 10ops-codfw, 06DBA, 06DC-Ops: db2189 crashed - https://phabricator.wikimedia.org/T427376#11959017 (10Marostegui) [11:12:50] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P93229 and previous config saved to /var/cache/conftool/dbconfig/20260527-111250-fceratto.json [11:13:39] (03CR) 10Dreamy Jazz: [C:04-1] Set minimum edit count for skipcaptcha right to 10 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294243 (https://phabricator.wikimedia.org/T426973) (owner: 10STran) [11:15:30] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1193.eqiad.wmnet with OS trixie [11:16:17] (03CR) 10Dreamy Jazz: [C:04-1] Set minimum edit count for skipcaptcha right to 10 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294243 (https://phabricator.wikimedia.org/T426973) (owner: 10STran) [11:17:32] (03CR) 10FNegri: sre.mysql.upgrade: fix looping logic (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1291999 (https://phabricator.wikimedia.org/T420203) (owner: 10FNegri) [11:19:41] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2165.codfw.wmnet with OS trixie [11:22:58] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T426633)', diff saved to https://phabricator.wikimedia.org/P93230 and previous config saved to /var/cache/conftool/dbconfig/20260527-112257-fceratto.json [11:23:20] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2188.codfw.wmnet with reason: Maintenance [11:23:28] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2188 (T426633)', diff saved to https://phabricator.wikimedia.org/P93231 and previous config saved to /var/cache/conftool/dbconfig/20260527-112327-fceratto.json [11:23:55] (03PS1) 10Cathal Mooney: Interface validator: support channlized interface names [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1294256 (https://phabricator.wikimedia.org/T427056) [11:24:20] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1193: Migration of db1193.eqiad.wmnet completed [11:29:02] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission pc1013.eqiad.wmnet - https://phabricator.wikimedia.org/T427190#11959086 (10Jclark-ctr) 05Open→03Resolved [11:29:17] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db2165: Migration of db2165.codfw.wmnet completed [11:30:14] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission pc1014.eqiad.wmnet - https://phabricator.wikimedia.org/T427270#11959092 (10Jclark-ctr) [11:30:28] (03PS1) 10Muehlenhoff: profile::mariadb::proxy: Use Puppet types [puppet] - 10https://gerrit.wikimedia.org/r/1294258 [11:30:39] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission pc1014.eqiad.wmnet - https://phabricator.wikimedia.org/T427270#11959094 (10Jclark-ctr) D6 U36 [11:31:16] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission pc1014.eqiad.wmnet - https://phabricator.wikimedia.org/T427270#11959117 (10Jclark-ctr) [11:31:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T426633)', diff saved to https://phabricator.wikimedia.org/P93235 and previous config saved to /var/cache/conftool/dbconfig/20260527-113142-fceratto.json [11:32:22] (03CR) 10Ayounsi: [C:03+1] Interface validator: support channlized interface names [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1294256 (https://phabricator.wikimedia.org/T427056) (owner: 10Cathal Mooney) [11:33:41] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.5 point update - https://phabricator.wikimedia.org/T427072#11959124 (10MoritzMuehlenhoff) [11:36:18] (03CR) 10Cathal Mooney: [C:03+2] Interface validator: support channlized interface names [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1294256 (https://phabricator.wikimedia.org/T427056) (owner: 10Cathal Mooney) [11:39:23] (03Merged) 10jenkins-bot: Interface validator: support channlized interface names [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1294256 (https://phabricator.wikimedia.org/T427056) (owner: 10Cathal Mooney) [11:39:37] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission pc1014.eqiad.wmnet - https://phabricator.wikimedia.org/T427270#11959173 (10Jclark-ctr) 05Open→03Resolved [11:40:06] 06SRE, 06Infrastructure-Foundations, 10netops: GRE Interfaces statistics not being returned by Juniper MX via gnmi - https://phabricator.wikimedia.org/T403936#11959179 (10ayounsi) 05Open→03Resolved a:03cmooney It's now showing up thanks to {T424683} https://grafana.wikimedia.org/goto/dfnbnedrb28sg... [11:40:55] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Map video and other large files to 'low-priority' network Qos queue - https://phabricator.wikimedia.org/T410133#11959189 (10cmooney) 05Open→03Resolved a:03cmooney We actaully added a mechanism to do this late last year when we had some une... [11:41:50] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P93237 and previous config saved to /var/cache/conftool/dbconfig/20260527-114149-fceratto.json [11:49:04] (03PS1) 10Majavah: memcached: Improve absenting support [puppet] - 10https://gerrit.wikimedia.org/r/1294259 (https://phabricator.wikimedia.org/T427189) [11:49:06] (03PS1) 10Majavah: prometheus: memcached_exporter: Improve absentability [puppet] - 10https://gerrit.wikimedia.org/r/1294260 (https://phabricator.wikimedia.org/T427189) [11:49:08] (03PS1) 10Majavah: P:openstack: cloudweb: Absent memcached and mcrouter services [puppet] - 10https://gerrit.wikimedia.org/r/1294261 (https://phabricator.wikimedia.org/T427189) [11:51:08] (03PS2) 10Majavah: P:openstack: cloudweb: Absent memcached and mcrouter services [puppet] - 10https://gerrit.wikimedia.org/r/1294261 (https://phabricator.wikimedia.org/T427189) [11:51:19] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Map internet-bound upload traffic to low-priority QoS queue - https://phabricator.wikimedia.org/T415649#11959238 (10cmooney) 05Open→03Declined I'm going to close this one. I hadn't fully thought out the way we serve things currently. `uplo... [11:51:59] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P93239 and previous config saved to /var/cache/conftool/dbconfig/20260527-115157-fceratto.json [11:53:42] (03PS3) 10Majavah: P:openstack: cloudweb: Absent memcached and mcrouter services [puppet] - 10https://gerrit.wikimedia.org/r/1294261 (https://phabricator.wikimedia.org/T427189) [11:55:30] (03PS4) 10Majavah: P:openstack: cloudweb: Absent memcached and mcrouter services [puppet] - 10https://gerrit.wikimedia.org/r/1294261 (https://phabricator.wikimedia.org/T427189) [11:56:38] !log cmooney@cumin1003 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [11:58:04] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8584/co" [puppet] - 10https://gerrit.wikimedia.org/r/1294261 (https://phabricator.wikimedia.org/T427189) (owner: 10Majavah) [11:58:24] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "is everything alright? /cc effie - ayounsi@cumin1003" [11:58:29] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "is everything alright? /cc effie - ayounsi@cumin1003" [11:58:44] !log cmooney@cumin1003 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [11:58:45] looks like we're all good [12:00:22] (03PS1) 10Matthias Mullie: MMV Carousel: Restore click-to-open for carousel thumbnails [extensions/MultimediaViewer] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294264 (https://phabricator.wikimedia.org/T426225) [12:01:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/MultimediaViewer] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294264 (https://phabricator.wikimedia.org/T426225) (owner: 10Matthias Mullie) [12:01:11] !log cmooney@cumin1003 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [12:01:26] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#11959265 (10Ladsgroup) [12:02:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T426633)', diff saved to https://phabricator.wikimedia.org/P93242 and previous config saved to /var/cache/conftool/dbconfig/20260527-120205-fceratto.json [12:02:47] (03CR) 10Marostegui: sre.mysql.upgrade: fix looping logic (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1291999 (https://phabricator.wikimedia.org/T420203) (owner: 10FNegri) [12:04:21] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 10): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8586/c" [puppet] - 10https://gerrit.wikimedia.org/r/1294259 (https://phabricator.wikimedia.org/T427189) (owner: 10Majavah) [12:04:45] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2212.codfw.wmnet with reason: Maintenance [12:04:53] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2212 (T426633)', diff saved to https://phabricator.wikimedia.org/P93243 and previous config saved to /var/cache/conftool/dbconfig/20260527-120452-fceratto.json [12:05:12] (03PS2) 10STran: Set minimum edit count for skipcaptcha right to 10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294243 (https://phabricator.wikimedia.org/T426973) [12:07:03] (03CR) 10STran: Set minimum edit count for skipcaptcha right to 10 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294243 (https://phabricator.wikimedia.org/T426973) (owner: 10STran) [12:08:11] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 18): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8585/c" [puppet] - 10https://gerrit.wikimedia.org/r/1294260 (https://phabricator.wikimedia.org/T427189) (owner: 10Majavah) [12:09:38] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [12:09:48] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1193: Migration of db1193.eqiad.wmnet completed [12:09:49] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [12:09:57] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: rdb201[34] implementation tracking - https://phabricator.wikimedia.org/T418924#11959280 (10jijiki) 05Stalled→03In progress a:05Clement_Goubert→03jijiki [12:10:24] cmooney@cumin1003 update-extras (PID 1314811) is awaiting input [12:11:52] !log fceratto@cumin1003 START - Cookbook sre.mysql.depool depool db2189: Test [12:12:09] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.depool (exit_code=99) depool db2189: Test [12:13:01] (03CR) 10Dreamy Jazz: Set minimum edit count for skipcaptcha right to 10 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294243 (https://phabricator.wikimedia.org/T426973) (owner: 10STran) [12:14:19] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: rdb201[34] implementation tracking - https://phabricator.wikimedia.org/T418924#11959306 (10jijiki) [12:14:33] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cloudvirt1078 to eqiad - jclark@cumin1003" [12:14:39] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cloudvirt1078 to eqiad - jclark@cumin1003" [12:14:39] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:14:41] (03CR) 10Hnowlan: [C:03+2] prometheus: add deployment label to appservers RED recording rules [puppet] - 10https://gerrit.wikimedia.org/r/1293080 (https://phabricator.wikimedia.org/T249663) (owner: 10Hnowlan) [12:14:46] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2165: Migration of db2165.codfw.wmnet completed [12:14:47] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [12:15:00] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [12:18:39] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [12:18:59] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1192: Upgrading db1192.eqiad.wmnet [12:19:33] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cloudvirt1078 to eqiad - jclark@cumin1003" [12:19:38] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1192: Upgrading db1192.eqiad.wmnet [12:19:38] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cloudvirt1078 to eqiad - jclark@cumin1003" [12:19:38] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:19:53] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1078 [12:19:56] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1079 [12:19:59] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [12:20:10] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db2164: Upgrading db2164.codfw.wmnet [12:20:12] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1079 [12:20:19] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1078 [12:20:24] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1080 [12:20:25] !log cmooney@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 36692 [12:20:27] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2164: Upgrading db2164.codfw.wmnet [12:20:32] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1077 [12:20:37] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1080 [12:20:46] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1077 [12:21:07] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1192.eqiad.wmnet with OS trixie [12:21:33] 10ops-eqiad, 06SRE, 06DC-Ops: Q3 :rack/setup/install cloudvirt refresh - https://phabricator.wikimedia.org/T425088#11959318 (10Jclark-ctr) >>! In T425088#11924352, @fgiunchedi wrote: > @Jclark-ctr once T426180 is resolved and hosts can be reimaged, please rack as follows > > 1077 -> `C8` > 1078 -> `D5` > 1... [12:21:34] (03PS3) 10STran: Set minimum edit count for skipcaptcha right to 10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294243 (https://phabricator.wikimedia.org/T426973) [12:21:45] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 36692 [12:21:53] (03CR) 10STran: Set minimum edit count for skipcaptcha right to 10 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294243 (https://phabricator.wikimedia.org/T426973) (owner: 10STran) [12:22:05] (03PS1) 10Dpogorzelski: fix: ml changelogs [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1294268 (https://phabricator.wikimedia.org/T419722) [12:22:29] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db2164.codfw.wmnet with OS trixie [12:23:48] 10ops-eqiad, 06SRE, 06DC-Ops, 06Wikidata Platform Team, and 2 others: Q4:rack/setup/install dse-k8s-wdqs100[1-3] (formerly wdqs103[6-8]) - https://phabricator.wikimedia.org/T423314#11959334 (10Jclark-ctr) I have updated server names, switchports and provisioned servers. pending puppet being updated @BTu... [12:24:01] (03CR) 10Dreamy Jazz: [C:03+1] "Beyond the open question about throttle exempted IPs, this LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294243 (https://phabricator.wikimedia.org/T426973) (owner: 10STran) [12:28:53] !log deleting binlogs older than a year [12:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:18] (03PS1) 10Effie Mouzeli: aliases: swap rdb2007 with rdb2011 [puppet] - 10https://gerrit.wikimedia.org/r/1294270 [12:32:18] (03PS1) 10Effie Mouzeli: site.pp: add rdb2013 and rdb2014 [puppet] - 10https://gerrit.wikimedia.org/r/1294271 (https://phabricator.wikimedia.org/T418924) [12:32:26] (03PS1) 10Muehlenhoff: Blocklist more unused network protocols [puppet] - 10https://gerrit.wikimedia.org/r/1294272 [12:35:16] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1192.eqiad.wmnet with reason: host reimage [12:35:29] (03PS2) 10Atsuko: httpd-cas: config option to disable httpd-cas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294257 (https://phabricator.wikimedia.org/T348763) [12:37:19] (03CR) 10Elukey: [C:03+2] Set pki-root1001 to role insetup [puppet] - 10https://gerrit.wikimedia.org/r/1294179 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey) [12:37:36] (03PS1) 10Marostegui: installserver: Add pc1024 to UEFI array. [puppet] - 10https://gerrit.wikimedia.org/r/1294273 [12:37:37] (03CR) 10Atsuko: httpd-cas: config option to disable httpd-cas (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294257 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [12:38:05] (03PS2) 10Effie Mouzeli: site.pp: add rdb2013 and rdb2014 [puppet] - 10https://gerrit.wikimedia.org/r/1294271 (https://phabricator.wikimedia.org/T418924) [12:40:23] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2164.codfw.wmnet with reason: host reimage [12:40:29] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:40:37] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1192.eqiad.wmnet with reason: host reimage [12:41:08] (03CR) 10Marostegui: [C:03+2] installserver: Add pc1024 to UEFI array. [puppet] - 10https://gerrit.wikimedia.org/r/1294273 (owner: 10Marostegui) [12:43:07] (03PS1) 10Effie Mouzeli: ratelimit: replace rdb2009 with rdb2013 #1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294274 (https://phabricator.wikimedia.org/T418924) [12:43:09] (03PS1) 10Effie Mouzeli: radioscope: replace rdb2009 with rdb2013 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294275 (https://phabricator.wikimedia.org/T418924) [12:44:13] (03PS1) 10Effie Mouzeli: rest-gateway: replace rdb2009 with rdb2013 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294276 (https://phabricator.wikimedia.org/T418924) [12:44:14] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:44:20] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2164.codfw.wmnet with reason: host reimage [12:45:19] (03CR) 10Ayounsi: [C:03+1] Blocklist more unused network protocols [puppet] - 10https://gerrit.wikimedia.org/r/1294272 (owner: 10Muehlenhoff) [12:45:34] (03PS1) 10Effie Mouzeli: changeprop: replace rdb2009 with rdb2013 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294277 (https://phabricator.wikimedia.org/T418924) [12:45:36] (03PS1) 10Effie Mouzeli: changeprop-jobqueue: replace rdb2009 with rdb2013 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294278 (https://phabricator.wikimedia.org/T418924) [12:48:32] (03PS1) 10Effie Mouzeli: docker_registry: replace rdb2009 with rdb2013 [puppet] - 10https://gerrit.wikimedia.org/r/1294279 (https://phabricator.wikimedia.org/T418924) [12:49:00] (03PS2) 10Effie Mouzeli: radioscope: replace rdb2009 with rdb2013 #2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294275 (https://phabricator.wikimedia.org/T418924) [12:49:13] (03PS2) 10Effie Mouzeli: rest-gateway: replace rdb2009 with rdb2013 #3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294276 (https://phabricator.wikimedia.org/T418924) [12:52:26] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.5 point update - https://phabricator.wikimedia.org/T427072#11959440 (10MoritzMuehlenhoff) [12:52:58] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1294270 (owner: 10Effie Mouzeli) [12:56:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:et-1/1/5 (Transport: cr2-codfw:et-0/1/4 (Lumen, 449169461) {#changeme_lumen_patch}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [12:57:21] 10ops-codfw, 06DBA, 06DC-Ops: db2212 failed to reboot - https://phabricator.wikimedia.org/T427388#11959448 (10FCeratto-WMF) [12:57:59] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1192.eqiad.wmnet with OS trixie [13:00:05] Lucas_WMDE, urbanecm, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260527T1300). [13:00:05] aude, phuedx, mfossati, Tran, and matthiasmullie: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:08] 10ops-codfw, 06DBA, 06DC-Ops: db2212 failed to reboot - https://phabricator.wikimedia.org/T427388#11959454 (10FCeratto-WMF) There are no events in `getsel` after `06/13/2025 14:24:15` [13:00:10] o/ [13:00:13] o/ [13:00:18] o/ [13:00:21] I can’t deploy, sorry – in a meeting [13:00:40] o/ [13:01:31] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2164.codfw.wmnet with OS trixie [13:01:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:02:04] While there is no deployer - mind if I get started with mfossati & my patches first? [13:02:15] (03CR) 10Atsuko: [V:03+2 C:03+2] image: Flink 2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1293664 (https://phabricator.wikimedia.org/T412978) (owner: 10JavierMonton) [13:02:46] (03CR) 10JavierMonton: [C:03+1] flink-app - default to setting metrics.internal.query-service.port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268071 (https://phabricator.wikimedia.org/T421216) (owner: 10Ottomata) [13:03:00] might as well unless aude or phuedx are here? [13:03:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:03:51] matthiasmullie: sounds good to me [13:04:14] I have started [13:04:15] I'm here but I'm happy to go second [13:04:17] phuedx Tran doe you need help backporting your patches? [13:04:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mlitn@deploy1003 using scap backport" [extensions/MultimediaViewer] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1290781 (https://phabricator.wikimedia.org/T426960) (owner: 10Krinkle) [13:04:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mlitn@deploy1003 using scap backport" [extensions/MultimediaViewer] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294264 (https://phabricator.wikimedia.org/T426225) (owner: 10Matthias Mullie) [13:04:27] I can backport my own patch after phuedx [13:04:34] Nope. I can self service w/ SpiderPig [13:04:49] Sweet. I'll ping you when I'm done! [13:05:37] mfossati: I'm pushing both our patches at the same time in the interest of time [13:05:49] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1192: Migration of db1192.eqiad.wmnet completed [13:05:57] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 99 days, 0:00:00 on db2212.codfw.wmnet with reason: failed to reboot T427388 T426633 [13:06:01] T427388: db2212 failed to reboot - https://phabricator.wikimedia.org/T427388 [13:06:19] mfossati: thx for deploying that! [13:06:24] matthiasmullie: sure! [13:06:28] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [13:07:38] (03Merged) 10jenkins-bot: mmv: Fix missing or stale arrow and counter controls [extensions/MultimediaViewer] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1290781 (https://phabricator.wikimedia.org/T426960) (owner: 10Krinkle) [13:07:41] (03Merged) 10jenkins-bot: MMV Carousel: Restore click-to-open for carousel thumbnails [extensions/MultimediaViewer] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294264 (https://phabricator.wikimedia.org/T426225) (owner: 10Matthias Mullie) [13:07:42] Krinkle: that made some waves :-) [13:08:08] !log mlitn@deploy1003 Started scap sync-world: Backport for [[gerrit:1290781|mmv: Fix missing or stale arrow and counter controls (T426960)]], [[gerrit:1294264|MMV Carousel: Restore click-to-open for carousel thumbnails (T426225)]] [13:08:14] T426960: Mediaviewer missing left/right arrows and X/Y counter is out of sync - https://phabricator.wikimedia.org/T426960 [13:08:14] T426225: Image Browsing: beta feature for rollout - https://phabricator.wikimedia.org/T426225 [13:10:07] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db2164: Migration of db2164.codfw.wmnet completed [13:10:45] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [13:11:11] (03CR) 10JMeybohm: [C:03+1] "Thanks!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1294268 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [13:12:46] (03PS1) 10Muehlenhoff: Retire the Ubuntu mirror [puppet] - 10https://gerrit.wikimedia.org/r/1294284 (https://phabricator.wikimedia.org/T416707) [13:13:13] !log mlitn@deploy1003 krinkle, mlitn: Backport for [[gerrit:1290781|mmv: Fix missing or stale arrow and counter controls (T426960)]], [[gerrit:1294264|MMV Carousel: Restore click-to-open for carousel thumbnails (T426225)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:13:19] T426960: Mediaviewer missing left/right arrows and X/Y counter is out of sync - https://phabricator.wikimedia.org/T426960 [13:13:20] T426225: Image Browsing: beta feature for rollout - https://phabricator.wikimedia.org/T426225 [13:13:38] mfossati: it's on test servers - please check and confirm we're good to move forward! [13:13:46] on it [13:14:04] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393 (10Papaul) 03NEW [13:14:24] (03CR) 10Dpogorzelski: [C:03+2] fix: ml changelogs [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1294268 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [13:14:28] (03CR) 10Dpogorzelski: [V:03+2 C:03+2] fix: ml changelogs [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1294268 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [13:14:51] (03PS1) 10Muehlenhoff: autoinstall: Stop using mirrors.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1294285 (https://phabricator.wikimedia.org/T416707) [13:15:18] matthiasmullie: I couldn't quickly reproduce "zoom out a bit. this causes the left and right arrows to temporarily appear". All other bugs look fixed [13:15:27] (03CR) 10Blake: site.pp: add rdb2013 and rdb2014 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1294271 (https://phabricator.wikimedia.org/T418924) (owner: 10Effie Mouzeli) [13:15:36] ok, moving forward [13:15:38] !log mlitn@deploy1003 krinkle, mlitn: Continuing with deployment [13:15:40] !log fceratto@cumin1003 START - Cookbook sre.mysql.depool depool db2189: Test [13:15:56] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2189: Test [13:16:51] (03CR) 10Bking: [C:03+2] relforge: remove logstash (gelf) profile [puppet] - 10https://gerrit.wikimedia.org/r/1293809 (https://phabricator.wikimedia.org/T324335) (owner: 10Bking) [13:16:53] (03PS2) 10Federico Ceratto: sre.mysql.pool: Support depooling unreachable hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1294265 (https://phabricator.wikimedia.org/T427381) [13:18:20] (03PS3) 10Atsuko: httpd-cas: config option to disable httpd-cas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294257 (https://phabricator.wikimedia.org/T348763) [13:18:47] (03PS1) 10Atsuko: eventstreams: new vendor modules check-in [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294283 (https://phabricator.wikimedia.org/T348763) [13:19:16] (03PS6) 10Atsuko: eventstreams: upgrade chart to ingress and idp [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289357 (https://phabricator.wikimedia.org/T348763) [13:19:28] (03PS5) 10Atsuko: eventstreams: copy eventstreams-internal to dse [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289979 (https://phabricator.wikimedia.org/T348763) [13:19:39] (03CR) 10Atsuko: "Done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289979 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [13:19:50] (03CR) 10Slyngshede: [V:03+1 C:03+2] R:cache::upload enable TCP Fast Open [puppet] - 10https://gerrit.wikimedia.org/r/1290678 (https://phabricator.wikimedia.org/T415454) (owner: 10Slyngshede) [13:20:40] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: wikikube-worker23[57-74] implementation tracking - https://phabricator.wikimedia.org/T418927#11959538 (10Blake) if i'm ever looking at this task for history, the docs are [[ https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/Add_or_remove_nodes |... [13:20:47] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: wikikube-worker23[57-74] implementation tracking - https://phabricator.wikimedia.org/T418927#11959539 (10Blake) 05In progress→03Resolved [13:21:31] !log mlitn@deploy1003 Finished scap sync-world: Backport for [[gerrit:1290781|mmv: Fix missing or stale arrow and counter controls (T426960)]], [[gerrit:1294264|MMV Carousel: Restore click-to-open for carousel thumbnails (T426225)]] (duration: 13m 23s) [13:21:35] phuedx - I'm done, over to you! (and thanks for letting me cut in front!) [13:21:39] T426960: Mediaviewer missing left/right arrows and X/Y counter is out of sync - https://phabricator.wikimedia.org/T426960 [13:21:39] T426225: Image Browsing: beta feature for rollout - https://phabricator.wikimedia.org/T426225 [13:21:51] mfossati: done! [13:21:52] 👍 [13:21:59] (03CR) 10Brouberol: [C:03+2] Revert "idp/idp_test: temporarily rollback growthbook(-next) access to nda/wmf" [puppet] - 10https://gerrit.wikimedia.org/r/1293585 (owner: 10Brouberol) [13:22:04] thanks matthiasmullie [13:22:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294217 (https://phabricator.wikimedia.org/T427092) (owner: 10Phuedx) [13:25:56] (03CR) 10Atsuko: [V:03+2 C:03+2] "Acknowledged" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1293664 (https://phabricator.wikimedia.org/T412978) (owner: 10JavierMonton) [13:27:52] (03PS1) 10Brouberol: idp_test: remove deprecated growthbook client_secret [labs/private] - 10https://gerrit.wikimedia.org/r/1294289 [13:28:02] (03CR) 10Brouberol: [C:03+2] idp_test: remove deprecated growthbook client_secret [labs/private] - 10https://gerrit.wikimedia.org/r/1294289 (owner: 10Brouberol) [13:28:07] (03CR) 10Brouberol: [V:03+2 C:03+2] idp_test: remove deprecated growthbook client_secret [labs/private] - 10https://gerrit.wikimedia.org/r/1294289 (owner: 10Brouberol) [13:28:22] (03Merged) 10jenkins-bot: ext.wikimediaEvents: Add hoisting error detection test [extensions/WikimediaEvents] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294217 (https://phabricator.wikimedia.org/T427092) (owner: 10Phuedx) [13:28:50] !log phuedx@deploy1003 Started scap sync-world: Backport for [[gerrit:1294217|ext.wikimediaEvents: Add hoisting error detection test (T427092)]] [13:28:55] T427092: Run and synthetic A/A test that captures UA to investigate hoisting errors - https://phabricator.wikimedia.org/T427092 [13:30:45] !log phuedx@deploy1003 phuedx: Backport for [[gerrit:1294217|ext.wikimediaEvents: Add hoisting error detection test (T427092)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:31:20] (03CR) 10Mforns: "Code looks good to me! :-)" [alerts] - 10https://gerrit.wikimedia.org/r/1294113 (https://phabricator.wikimedia.org/T423920) (owner: 10JavierMonton) [13:31:26] sorry i am late for the backport [13:31:30] window [13:31:59] i can do mine whenever everything else is done [13:33:25] FIRING: [5x] SystemdUnitFailed: opensearch-disable-readahead-relforge-eqiad-small-alpha.service on relforge1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:36:08] No errors in the console on a couple of different sites 👍 [13:36:13] !log phuedx@deploy1003 phuedx: Continuing with deployment [13:38:31] FIRING: [13x] SystemdUnitFailed: opensearch-disable-readahead-relforge-eqiad-small-alpha.service on relforge1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:40:15] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.14 point update - https://phabricator.wikimedia.org/T426759#11959614 (10MoritzMuehlenhoff) [13:40:25] !log phuedx@deploy1003 Finished scap sync-world: Backport for [[gerrit:1294217|ext.wikimediaEvents: Add hoisting error detection test (T427092)]] (duration: 11m 35s) [13:40:31] T427092: Run and synthetic A/A test that captures UA to investigate hoisting errors - https://phabricator.wikimedia.org/T427092 [13:40:35] Tran: Over to you [13:40:49] 👍 thanks, starting mine [13:41:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy1003 using scap backport" [extensions/ReportIncident] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294247 (https://phabricator.wikimedia.org/T427358) (owner: 10STran) [13:41:27] (03CR) 10Mforns: [C:03+2] html-enrichment: relax offset lag monitors [alerts] - 10https://gerrit.wikimedia.org/r/1294113 (https://phabricator.wikimedia.org/T423920) (owner: 10JavierMonton) [13:43:03] (03Merged) 10jenkins-bot: html-enrichment: relax offset lag monitors [alerts] - 10https://gerrit.wikimedia.org/r/1294113 (https://phabricator.wikimedia.org/T423920) (owner: 10JavierMonton) [13:43:25] FIRING: [22x] SystemdUnitFailed: opensearch-disable-readahead-relforge-eqiad-small-alpha.service on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:45:03] (03Merged) 10jenkins-bot: Update Direct Reporting email [extensions/ReportIncident] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294247 (https://phabricator.wikimedia.org/T427358) (owner: 10STran) [13:45:30] !log stran@deploy1003 Started scap sync-world: Backport for [[gerrit:1294247|Update Direct Reporting email (T427358)]] [13:45:35] T427358: Make direct reporting email subject lines unique enough to avoid VRT ticket threading - https://phabricator.wikimedia.org/T427358 [13:46:26] (03PS1) 10Elukey: profile::kafka::broker: add ACLs in a file [puppet] - 10https://gerrit.wikimedia.org/r/1294294 (https://phabricator.wikimedia.org/T425528) [13:46:55] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1294294 (https://phabricator.wikimedia.org/T425528) (owner: 10Elukey) [13:48:37] (03CR) 10Filippo Giunchedi: [C:03+1] memcached: Improve absenting support [puppet] - 10https://gerrit.wikimedia.org/r/1294259 (https://phabricator.wikimedia.org/T427189) (owner: 10Majavah) [13:51:18] (03CR) 10Filippo Giunchedi: "modules/profile/manifests/prometheus/ops.pp will need adjusting to pick up the class/define switch" [puppet] - 10https://gerrit.wikimedia.org/r/1294260 (https://phabricator.wikimedia.org/T427189) (owner: 10Majavah) [13:51:18] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1192: Migration of db1192.eqiad.wmnet completed [13:51:19] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [13:51:47] (03CR) 10Filippo Giunchedi: "actually not true, nevermind" [puppet] - 10https://gerrit.wikimedia.org/r/1294260 (https://phabricator.wikimedia.org/T427189) (owner: 10Majavah) [13:52:17] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [13:52:24] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [13:52:51] (03CR) 10Filippo Giunchedi: [C:03+1] prometheus: memcached_exporter: Improve absentability [puppet] - 10https://gerrit.wikimedia.org/r/1294260 (https://phabricator.wikimedia.org/T427189) (owner: 10Majavah) [13:53:19] (03CR) 10Filippo Giunchedi: [C:03+1] P:openstack: cloudweb: Absent memcached and mcrouter services [puppet] - 10https://gerrit.wikimedia.org/r/1294261 (https://phabricator.wikimedia.org/T427189) (owner: 10Majavah) [13:53:25] FIRING: [24x] SystemdUnitFailed: opensearch-disable-readahead-relforge-eqiad-small-alpha.service on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:55:35] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2164: Migration of db2164.codfw.wmnet completed [13:55:36] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [13:56:17] !log sukhe@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host durum5003.eqsin.wmnet with OS trixie [13:57:21] (03CR) 10Kamila Součková: [C:03+2] Remove k8s version from all services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273967 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [13:58:07] (03PS2) 10Elukey: profile::kafka::broker: add ACLs in a file [puppet] - 10https://gerrit.wikimedia.org/r/1294294 (https://phabricator.wikimedia.org/T425528) [13:58:16] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1294294 (https://phabricator.wikimedia.org/T425528) (owner: 10Elukey) [13:58:25] FIRING: [23x] SystemdUnitFailed: opensearch-disable-readahead-relforge-eqiad-small-alpha.service on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:59:43] (03CR) 10Kamila Součková: [C:03+2] ""Yes, that too" - there were two problems, the other one being a chart that isn't rendering (but to my surprise CI doesn't seem to mind). " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293757 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [14:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260527T1400) [14:00:40] (03PS1) 10Bking: relforge: Fix cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1294298 (https://phabricator.wikimedia.org/T427306) [14:00:57] (03Merged) 10jenkins-bot: Remove k8s version from all services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1273967 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [14:02:57] !log stran@deploy1003 stran: Backport for [[gerrit:1294247|Update Direct Reporting email (T427358)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:02:59] (03PS3) 10Elukey: profile::kafka::broker: add ACLs in a file [puppet] - 10https://gerrit.wikimedia.org/r/1294294 (https://phabricator.wikimedia.org/T425528) [14:03:03] T427358: Make direct reporting email subject lines unique enough to avoid VRT ticket threading - https://phabricator.wikimedia.org/T427358 [14:03:25] FIRING: [25x] SystemdUnitFailed: opensearch-disable-readahead-relforge-eqiad-small-alpha.service on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:04:00] testing now [14:05:54] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1294294 (https://phabricator.wikimedia.org/T425528) (owner: 10Elukey) [14:05:58] looks good, continuing [14:06:03] !log stran@deploy1003 stran: Continuing with deployment [14:06:06] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [14:06:15] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [14:06:27] I am ready to deploy mine when you are done (no hurry) [14:06:57] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [14:07:08] sorry mine had to do a scap rebuild but I'll ping you when mine's done. [14:07:18] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1178: Upgrading db1178.eqiad.wmnet [14:07:57] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1178: Upgrading db1178.eqiad.wmnet [14:08:06] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [14:08:17] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db2163: Upgrading db2163.codfw.wmnet [14:08:36] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2163: Upgrading db2163.codfw.wmnet [14:09:14] sounds good [14:09:54] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1178.eqiad.wmnet with OS trixie [14:10:54] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db2163.codfw.wmnet with OS trixie [14:13:58] (03PS4) 10Jelto: miscweb: remove wmf-navigator public and private config from web container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294208 (https://phabricator.wikimedia.org/T414405) [14:14:39] (03CR) 10Effie Mouzeli: site.pp: add rdb2013 and rdb2014 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1294271 (https://phabricator.wikimedia.org/T418924) (owner: 10Effie Mouzeli) [14:14:53] (03CR) 10Jelto: miscweb: remove wmf-navigator public and private config from web container (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294208 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [14:15:38] (03PS3) 10Effie Mouzeli: site.pp: add rdb2013 and rdb2014 [puppet] - 10https://gerrit.wikimedia.org/r/1294271 (https://phabricator.wikimedia.org/T418924) [14:17:47] (03CR) 10Bking: [C:03+2] relforge: Fix cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1294298 (https://phabricator.wikimedia.org/T427306) (owner: 10Bking) [14:18:31] !log stran@deploy1003 Finished scap sync-world: Backport for [[gerrit:1294247|Update Direct Reporting email (T427358)]] (duration: 33m 01s) [14:18:37] T427358: Make direct reporting email subject lines unique enough to avoid VRT ticket threading - https://phabricator.wikimedia.org/T427358 [14:19:13] aude: I'm done [14:19:26] thank you! [14:19:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by aude@deploy1003 using scap backport" [extensions/QuickSurveys] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1290924 (https://phabricator.wikimedia.org/T426457) (owner: 10Aude) [14:20:06] mine is the QuickSurveys change (wmf3 only) and then a config change [14:21:27] (03CR) 10Brouberol: [C:03+1] httpd-cas: config option to disable httpd-cas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294257 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [14:22:19] (03CR) 10Brouberol: [C:03+1] eventstreams: new vendor modules check-in [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294283 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [14:22:26] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1033.eqiad.wmnet with reason: Maintenance [14:22:42] (03CR) 10Brouberol: [C:03+1] eventstreams: upgrade chart to ingress and idp [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289357 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [14:22:53] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1178.eqiad.wmnet with reason: host reimage [14:23:24] (03Merged) 10jenkins-bot: Make logging of title and page ID optional [extensions/QuickSurveys] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1290924 (https://phabricator.wikimedia.org/T426457) (owner: 10Aude) [14:23:34] (03CR) 10Brouberol: eventstreams: copy eventstreams-internal to dse (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289979 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [14:23:55] !log aude@deploy1003 Started scap sync-world: Backport for [[gerrit:1290924|Make logging of title and page ID optional (T426457)]] [14:23:59] T426457: QuickSurveys: Make it possible to run surveys without capturing page title - https://phabricator.wikimedia.org/T426457 [14:25:53] (03Merged) 10jenkins-bot: CI: Fix race condition [deployment-charts] - 10https://gerrit.wikimedia.org/r/1293757 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [14:26:17] (03PS1) 10Bking: relforge: update list of mandatory plugins [puppet] - 10https://gerrit.wikimedia.org/r/1294302 (https://phabricator.wikimedia.org/T427306) [14:26:25] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [14:26:31] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [14:26:52] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1179.eqiad.wmnet with reason: Maintenance [14:26:59] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1179 (T426633)', diff saved to https://phabricator.wikimedia.org/P93260 and previous config saved to /var/cache/conftool/dbconfig/20260527-142659-fceratto.json [14:27:08] (03PS2) 10Bking: relforge: update list of mandatory plugins [puppet] - 10https://gerrit.wikimedia.org/r/1294302 (https://phabricator.wikimedia.org/T427306) [14:27:39] !log aude@deploy1003 aude: Backport for [[gerrit:1290924|Make logging of title and page ID optional (T426457)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:28:10] (03CR) 10Bking: [C:03+2] relforge: update list of mandatory plugins [puppet] - 10https://gerrit.wikimedia.org/r/1294302 (https://phabricator.wikimedia.org/T427306) (owner: 10Bking) [14:28:15] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1178.eqiad.wmnet with reason: host reimage [14:29:24] !log aude@deploy1003 aude: Continuing with deployment [14:29:27] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2163.codfw.wmnet with reason: host reimage [14:30:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260527T1400) [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260527T1430) [14:30:30] (03PS4) 10CWilliams: sre.mysql.pool: Add support for downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1289965 (https://phabricator.wikimedia.org/T426318) [14:30:45] (03PS1) 10Muehlenhoff: mirrors: Disable tails mirror [puppet] - 10https://gerrit.wikimedia.org/r/1294306 (https://phabricator.wikimedia.org/T416707) [14:33:33] 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: PuppetFailure (instance an-test-client1002:9100) - https://phabricator.wikimedia.org/T427399 (10LSobanski) 03NEW [14:33:45] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2163.codfw.wmnet with reason: host reimage [14:34:16] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T426633)', diff saved to https://phabricator.wikimedia.org/P93262 and previous config saved to /var/cache/conftool/dbconfig/20260527-143416-fceratto.json [14:34:20] RECOVERY - Host db2189 #page is UP: PING OK - Packet loss = 0%, RTA = 31.60 ms [14:34:20] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2189 crashed - https://phabricator.wikimedia.org/T427376#11959812 (10Jhancock.wm) working on it. might reboot a few times. [14:34:21] (03CR) 10Elukey: profile::kafka::broker: add ACLs in a file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1294294 (https://phabricator.wikimedia.org/T425528) (owner: 10Elukey) [14:34:32] PROBLEM - MariaDB Replica SQL: s2 #page on db2189 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [14:34:32] PROBLEM - MariaDB Events s2 on db2189 is CRITICAL: CRITICAL - Failed to query events: ERROR 2002 (HY000): Cant connect to local server through socket /run/mysqld/mysqld.sock (2) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [14:34:32] PROBLEM - mysqld processes on db2189 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [14:34:33] PROBLEM - MariaDB Replica IO: s2 #page on db2189 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [14:34:34] PROBLEM - MariaDB Replica Lag: s2 #page on db2189 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [14:34:34] PROBLEM - MariaDB Event Scheduler s2 on db2189 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [14:34:45] !ack [14:34:45] 8024 (ACKED) db2189 (paged)/MariaDB Replica SQL: s2 (paged) [14:34:46] 8025 (ACKED) db2189 (paged)/MariaDB Replica IO: s2 (paged) [14:34:46] 8026 (ACKED) db2189 (paged)/MariaDB Replica Lag: s2 (paged) [14:34:58] looking [14:35:03] PROBLEM - MariaDB read only s2 on db2189 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:35:11] !incidents [14:35:12] 8024 (ACKED) db2189 (paged)/MariaDB Replica SQL: s2 (paged) [14:35:12] 8025 (ACKED) db2189 (paged)/MariaDB Replica IO: s2 (paged) [14:35:12] 8026 (ACKED) db2189 (paged)/MariaDB Replica Lag: s2 (paged) [14:35:12] 8023 (RESOLVED) Host db2189 (paged) [14:35:21] thanks federico3 [14:35:25] !log aude@deploy1003 Finished scap sync-world: Backport for [[gerrit:1290924|Make logging of title and page ID optional (T426457)]] (duration: 11m 30s) [14:35:27] indeed it came up just now [14:35:29] server rebooted [14:35:30] T426457: QuickSurveys: Make it possible to run surveys without capturing page title - https://phabricator.wikimedia.org/T426457 [14:35:50] https://phabricator.wikimedia.org/T427376#11959812 (Jhancock.wm) working on it. might reboot a few times. [14:36:02] noticed this in the backlog [14:36:15] nothing in system event log [14:36:17] ah [14:36:31] RECOVERY - mysqld processes on db2189 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [14:36:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by aude@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290926 (https://phabricator.wikimedia.org/T426781) (owner: 10Aude) [14:37:14] right, Manuel actually also mentioned it in the earlier handoff [14:37:46] (03Merged) 10jenkins-bot: Re-enable ReadingLists QuickSurvey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290926 (https://phabricator.wikimedia.org/T426781) (owner: 10Aude) [14:38:11] !log aude@deploy1003 Started scap sync-world: Backport for [[gerrit:1290926|Re-enable ReadingLists QuickSurvey (T426781)]] [14:38:16] T426781: Re-enable ReadingLists QuickSurvey - https://phabricator.wikimedia.org/T426781 [14:38:25] FIRING: [24x] SystemdUnitFailed: opensearch-disable-readahead-relforge-eqiad-small-alpha.service on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:38:57] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 99 days, 0:00:00 on db2189.codfw.wmnet with reason: crashed T427376 [14:39:01] T427376: db2189 crashed - https://phabricator.wikimedia.org/T427376 [14:39:05] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2189 crashed - https://phabricator.wikimedia.org/T427376#11959868 (10FCeratto-WMF) (added a long downtime just in case) [14:40:03] !log aude@deploy1003 aude: Backport for [[gerrit:1290926|Re-enable ReadingLists QuickSurvey (T426781)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:40:14] (03PS1) 10Bking: relforge: disable the disabling of security plugin [puppet] - 10https://gerrit.wikimedia.org/r/1294308 (https://phabricator.wikimedia.org/T427306) [14:40:47] (03CR) 10Scott French: "Thanks for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/1293789 (https://phabricator.wikimedia.org/T427312) (owner: 10Scott French) [14:40:50] (03CR) 10Scott French: [C:03+2] aptrepo: add component/php83 to bookworm-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/1293789 (https://phabricator.wikimedia.org/T427312) (owner: 10Scott French) [14:42:29] (03CR) 10Scott French: [C:03+2] package_builder: Use @distribution in the D04php hook (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1293790 (https://phabricator.wikimedia.org/T427312) (owner: 10Scott French) [14:42:34] !log aude@deploy1003 aude: Continuing with deployment [14:43:25] FIRING: [24x] SystemdUnitFailed: opensearch-disable-readahead-relforge-eqiad-small-alpha.service on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:43:55] (03CR) 10Atsuko: eventstreams: copy eventstreams-internal to dse (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289979 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [14:44:10] (03CR) 10Atsuko: [C:03+2] httpd-cas: config option to disable httpd-cas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294257 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [14:44:17] (03CR) 10Atsuko: [C:03+2] eventstreams: new vendor modules check-in [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294283 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [14:44:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P93263 and previous config saved to /var/cache/conftool/dbconfig/20260527-144423-fceratto.json [14:44:24] (03CR) 10Atsuko: [C:03+2] eventstreams: upgrade chart to ingress and idp [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289357 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [14:44:31] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1178.eqiad.wmnet with OS trixie [14:44:42] (03CR) 10Bking: [C:03+2] relforge: disable the disabling of security plugin [puppet] - 10https://gerrit.wikimedia.org/r/1294308 (https://phabricator.wikimedia.org/T427306) (owner: 10Bking) [14:44:45] (03CR) 10Brouberol: [C:03+1] eventstreams: copy eventstreams-internal to dse (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289979 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [14:45:14] (03PS4) 10Elukey: profile::kafka::broker: add ACLs in a file [puppet] - 10https://gerrit.wikimedia.org/r/1294294 (https://phabricator.wikimedia.org/T425528) [14:45:16] (03CR) 10BBlack: "Maybe this is better discussed either back in the phab task or in some other forum, because things get complicated. But to address the th" [puppet] - 10https://gerrit.wikimedia.org/r/1282428 (https://phabricator.wikimedia.org/T425441) (owner: 10Dzahn) [14:46:03] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1282286 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede) [14:46:13] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1294219 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [14:46:24] (03Merged) 10jenkins-bot: httpd-cas: config option to disable httpd-cas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294257 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [14:46:43] !log aude@deploy1003 Finished scap sync-world: Backport for [[gerrit:1290926|Re-enable ReadingLists QuickSurvey (T426781)]] (duration: 08m 32s) [14:46:48] T426781: Re-enable ReadingLists QuickSurvey - https://phabricator.wikimedia.org/T426781 [14:46:57] (03PS5) 10CWilliams: sre.mysql.pool: Add support for downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1289965 (https://phabricator.wikimedia.org/T426318) [14:47:00] (03CR) 10Elukey: profile::kafka::broker: add ACLs in a file (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1294294 (https://phabricator.wikimedia.org/T425528) (owner: 10Elukey) [14:47:03] (03Merged) 10jenkins-bot: eventstreams: new vendor modules check-in [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294283 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [14:47:05] (03Merged) 10jenkins-bot: eventstreams: upgrade chart to ingress and idp [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289357 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [14:50:53] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/eventstreams-internal: apply [14:51:09] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/eventstreams-internal: apply [14:51:36] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2163.codfw.wmnet with OS trixie [14:52:06] (03CR) 10Majavah: [C:03+2] Replace role::mariadb::ferm with profile::mariadb::firewall [puppet] - 10https://gerrit.wikimedia.org/r/1292033 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [14:53:25] FIRING: [24x] SystemdUnitFailed: opensearch-disable-readahead-relforge-eqiad-small-alpha.service on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:54:01] !log cwilliams@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [14:54:31] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P93264 and previous config saved to /var/cache/conftool/dbconfig/20260527-145430-fceratto.json [14:55:42] (03PS6) 10Atsuko: eventstreams: copy eventstreams-internal to dse [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289979 (https://phabricator.wikimedia.org/T348763) [14:55:48] FIRING: PuppetFailure: Puppet has failed on relforge1010:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:56:28] FIRING: [2x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9400.service crashloop on relforge1008:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [14:58:02] (03CR) 10JavierMonton: stream: webrequest.page_view (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290687 (https://phabricator.wikimedia.org/T426092) (owner: 10JavierMonton) [14:58:02] (03CR) 10Atsuko: [C:03+2] eventstreams: copy eventstreams-internal to dse [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289979 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [14:58:07] (03PS1) 10Effie Mouzeli: ratelimite: update homepage [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294314 (https://phabricator.wikimedia.org/T426951) [14:58:10] (03PS1) 10Trueg: Add wdqs namespace for the new deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294315 (https://phabricator.wikimedia.org/T425007) [14:58:25] FIRING: [15x] SystemdUnitFailed: opensearch_2@relforge-eqiad-small-alpha.service on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:59:46] (03PS20) 10Majavah: firewall: Declare resources for both providers [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) [14:59:47] (03PS20) 10Majavah: P:wmcs::instance: Convert to firewall wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1211652 (https://phabricator.wikimedia.org/T411089) [14:59:49] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db2163: Migration of db2163.codfw.wmnet completed [15:00:04] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) (owner: 10Majavah) [15:00:09] (03CR) 10Clément Goubert: [C:03+1] ratelimite: update homepage [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294314 (https://phabricator.wikimedia.org/T426951) (owner: 10Effie Mouzeli) [15:00:10] (03PS1) 10Brouberol: data-platform: add alert on growthbook seat usage [alerts] - 10https://gerrit.wikimedia.org/r/1294316 (https://phabricator.wikimedia.org/T420694) [15:00:20] (03Merged) 10jenkins-bot: eventstreams: copy eventstreams-internal to dse [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289979 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [15:00:30] (03PS1) 10JavierMonton: stream: webrequest-page-view [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294318 (https://phabricator.wikimedia.org/T425624) [15:02:00] (03CR) 10CI reject: [V:04-1] data-platform: add alert on growthbook seat usage [alerts] - 10https://gerrit.wikimedia.org/r/1294316 (https://phabricator.wikimedia.org/T420694) (owner: 10Brouberol) [15:03:19] (03PS2) 10Brouberol: data-platform: add alert on growthbook seat usage [alerts] - 10https://gerrit.wikimedia.org/r/1294316 (https://phabricator.wikimedia.org/T420694) [15:03:25] FIRING: [15x] SystemdUnitFailed: opensearch_2@relforge-eqiad-small-alpha.service on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:04:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T426633)', diff saved to https://phabricator.wikimedia.org/P93267 and previous config saved to /var/cache/conftool/dbconfig/20260527-150438-fceratto.json [15:05:01] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1220.eqiad.wmnet with reason: Maintenance [15:05:08] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1220 (T426633)', diff saved to https://phabricator.wikimedia.org/P93268 and previous config saved to /var/cache/conftool/dbconfig/20260527-150508-fceratto.json [15:05:48] RESOLVED: PuppetFailure: Puppet has failed on relforge1010:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:06:26] (03CR) 10TChin: [C:03+1] stream: webrequest-page-view [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294318 (https://phabricator.wikimedia.org/T425624) (owner: 10JavierMonton) [15:07:04] (03CR) 10JavierMonton: [C:03+2] stream: webrequest-page-view [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294318 (https://phabricator.wikimedia.org/T425624) (owner: 10JavierMonton) [15:08:03] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [15:08:09] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [15:08:29] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [15:08:33] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [15:09:05] !log 💔cdanis@apt1002.wikimedia.org ~ 🕚☕ sudo -i reprepro --component main --restrict cidergrinder update trixie-wikimedia [15:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:17] (03Merged) 10jenkins-bot: stream: webrequest-page-view [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294318 (https://phabricator.wikimedia.org/T425624) (owner: 10JavierMonton) [15:09:38] (03PS1) 10Jdlrobson: Thumbnails are not being optimized in large mode [core] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294322 (https://phabricator.wikimedia.org/T427237) [15:09:55] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [15:10:00] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [15:10:27] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [15:10:58] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2013.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:10:58] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2013.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:10:59] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [15:11:05] !log cwilliams@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Icinga wait failed during run [15:11:14] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [15:11:34] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [15:11:58] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:12:18] FIRING: [2x] PuppetFailure: Puppet has failed on relforge1009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:12:23] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [15:12:41] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [15:13:20] !log 💙cdanis@cp5026.eqsin.wmnet ~ 🕚☕ sudo apt install lua5.4-ciderbloom lua5.4-ciderbloom-dbgsym [15:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:25] FIRING: [3x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9400.service on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:14:58] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:15:11] (03CR) 10Majavah: [V:03+1 C:03+2] memcached: Improve absenting support [puppet] - 10https://gerrit.wikimedia.org/r/1294259 (https://phabricator.wikimedia.org/T427189) (owner: 10Majavah) [15:16:19] (03PS3) 10JavierMonton: stream: webrequest.page_view [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290687 (https://phabricator.wikimedia.org/T426092) [15:16:44] (03CR) 10Majavah: [V:03+1 C:03+2] prometheus: memcached_exporter: Improve absentability [puppet] - 10https://gerrit.wikimedia.org/r/1294260 (https://phabricator.wikimedia.org/T427189) (owner: 10Majavah) [15:16:58] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:16:58] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:17:52] (03CR) 10Majavah: [V:03+1 C:03+2] P:openstack: cloudweb: Absent memcached and mcrouter services [puppet] - 10https://gerrit.wikimedia.org/r/1294261 (https://phabricator.wikimedia.org/T427189) (owner: 10Majavah) [15:18:03] (03CR) 10Hashar: jenkins: add firewall rule for new jenkins to gearman on legacy host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1275537 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [15:19:23] !log 💙cdanis@cp4047.ulsfo.wmnet ~ 🕦☕ sudo apt install lua5.4-ciderbloom lua5.4-ciderbloom-dbgsym [15:19:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:53] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host durum5003.eqsin.wmnet with OS trixie [15:19:58] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:19:58] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:20:58] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:21:58] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:22:14] (03CR) 10Federico Ceratto: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1289965 (https://phabricator.wikimedia.org/T426318) (owner: 10CWilliams) [15:22:18] RESOLVED: [2x] PuppetFailure: Puppet has failed on relforge1009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:22:34] !log cwilliams@cumin1003 START - Cookbook sre.hosts.remove-downtime for db1178.eqiad.wmnet [15:22:35] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db1178.eqiad.wmnet [15:23:01] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2189 crashed - https://phabricator.wikimedia.org/T427376#11960059 (10Jhancock.wm) @FCeratto-WMF okay the error code we got was inconclusive. it could mean a lot of things including just out of date firmware. I've updated the bios and the idrac. I do see a cpu machine... [15:23:11] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2189 crashed - https://phabricator.wikimedia.org/T427376#11960062 (10Jhancock.wm) a:03Jhancock.wm [15:23:25] FIRING: [3x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9400.service on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:24:07] (03CR) 10Bking: [C:03+1] data-platform: add alert on growthbook seat usage [alerts] - 10https://gerrit.wikimedia.org/r/1294316 (https://phabricator.wikimedia.org/T420694) (owner: 10Brouberol) [15:24:41] (03PS2) 10Herron: alertmanager: add ml-task webhook [puppet] - 10https://gerrit.wikimedia.org/r/1294323 [15:24:57] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1178: Recovering from failure in cookbook [15:24:58] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:24:58] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2008.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:25:12] PROBLEM - Memcached on cloudweb2002-dev is CRITICAL: connect to address 208.80.153.41 and port 11000: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [15:25:58] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:25:58] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:26:00] (03CR) 10TChin: [C:03+1] stream: webrequest.page_view [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290687 (https://phabricator.wikimedia.org/T426092) (owner: 10JavierMonton) [15:26:47] (03CR) 10CI reject: [V:04-1] Thumbnails are not being optimized in large mode [core] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294322 (https://phabricator.wikimedia.org/T427237) (owner: 10Jdlrobson) [15:28:19] (03CR) 10Brouberol: [C:03+2] data-platform: add alert on growthbook seat usage [alerts] - 10https://gerrit.wikimedia.org/r/1294316 (https://phabricator.wikimedia.org/T420694) (owner: 10Brouberol) [15:28:49] (03PS1) 10Atsuko: Provision stream-internal.w.o [dns] - 10https://gerrit.wikimedia.org/r/1294326 (https://phabricator.wikimedia.org/T348763) [15:29:14] (03CR) 10CI reject: [V:04-1] Provision stream-internal.w.o [dns] - 10https://gerrit.wikimedia.org/r/1294326 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [15:30:11] (03PS1) 10Atsuko: trafficserver: enable stream-internal.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1294327 (https://phabricator.wikimedia.org/T348763) [15:30:43] (03CR) 10CI reject: [V:04-1] trafficserver: enable stream-internal.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1294327 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [15:30:49] (03CR) 10Brouberol: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1294326 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [15:32:22] !log cwilliams@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool db2163: Migration of db2163.codfw.wmnet completed [15:32:49] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db2163: Migration of db2163.codfw.wmnet completed [15:33:00] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2163: Migration of db2163.codfw.wmnet completed [15:33:01] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [15:33:10] PROBLEM - Memcached on cloudweb1003 is CRITICAL: connect to address 208.80.154.150 and port 11000: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [15:33:25] FIRING: [3x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9400.service on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:34:23] (03PS2) 10Atsuko: trafficserver: enable stream-internal.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1294327 (https://phabricator.wikimedia.org/T348763) [15:36:04] (03CR) 10Atsuko: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1294327 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [15:37:21] (03CR) 10Brouberol: [C:03+1] Provision stream-internal.w.o [dns] - 10https://gerrit.wikimedia.org/r/1294326 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [15:37:39] (03CR) 10Brouberol: [C:03+1] trafficserver: enable stream-internal.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1294327 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [15:38:07] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - labweb-ssl_7443: Servers cloudweb1004.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:38:07] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - labweb-ssl_7443: Servers cloudweb1004.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:38:11] PROBLEM - Memcached on cloudweb1004 is CRITICAL: connect to address 208.80.155.117 and port 11000: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [15:38:25] RESOLVED: [3x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9400.service on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:39:08] (03PS1) 10Majavah: Revert "P:openstack: cloudweb: Absent memcached and mcrouter services" [puppet] - 10https://gerrit.wikimedia.org/r/1294333 [15:39:46] (03PS2) 10Majavah: Revert "P:openstack: cloudweb: Absent memcached and mcrouter services" [puppet] - 10https://gerrit.wikimedia.org/r/1294333 [15:40:12] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1220 (T426633)', diff saved to https://phabricator.wikimedia.org/P93274 and previous config saved to /var/cache/conftool/dbconfig/20260527-154011-fceratto.json [15:40:22] (03CR) 10Elukey: [C:03+1] alertmanager: add ml-task webhook [puppet] - 10https://gerrit.wikimedia.org/r/1294323 (owner: 10Herron) [15:40:38] (03CR) 10Majavah: [V:03+2 C:03+2] Revert "P:openstack: cloudweb: Absent memcached and mcrouter services" [puppet] - 10https://gerrit.wikimedia.org/r/1294333 (owner: 10Majavah) [15:41:06] (03CR) 10Clément Goubert: site.pp: add rdb2013 and rdb2014 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1294271 (https://phabricator.wikimedia.org/T418924) (owner: 10Effie Mouzeli) [15:41:19] (03CR) 10Herron: [C:03+2] alertmanager: add ml-task webhook [puppet] - 10https://gerrit.wikimedia.org/r/1294323 (owner: 10Herron) [15:43:10] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:43:10] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:43:54] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:44:04] 10ops-codfw, 06DC-Ops: Power Supply - Status - issue on aqs2011:9290 - https://phabricator.wikimedia.org/T427409 (10phaultfinder) 03NEW [15:44:05] 10ops-codfw, 06DC-Ops: Power Supply - Status - issue on mc-gp2005:9290 - https://phabricator.wikimedia.org/T427410 (10phaultfinder) 03NEW [15:44:06] 10ops-eqiad, 06DC-Ops: Power Supply - PS Redundancy - issue on dbproxy1024:9290 - https://phabricator.wikimedia.org/T427408 (10phaultfinder) 03NEW [15:45:09] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [15:45:17] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [15:45:29] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:45:52] (03CR) 10CWilliams: [C:03+2] sre.mysql.pool: Add support for downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1289965 (https://phabricator.wikimedia.org/T426318) (owner: 10CWilliams) [15:46:41] (03CR) 10BCornwall: [C:03+2] Revert "site: Set lvs1017 to insetup_noferm" [puppet] - 10https://gerrit.wikimedia.org/r/1286517 (https://phabricator.wikimedia.org/T421421) (owner: 10BCornwall) [15:46:44] (03CR) 10BCornwall: [C:03+2] Add lvs1017 to high-traffic1 [puppet] - 10https://gerrit.wikimedia.org/r/1286522 (https://phabricator.wikimedia.org/T421421) (owner: 10BCornwall) [15:49:36] (03PS5) 10Andrew Bogott: designate: remove leftover mcrouter code [puppet] - 10https://gerrit.wikimedia.org/r/1278528 (https://phabricator.wikimedia.org/T422646) [15:50:16] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Repurpose ganeti102[3456] for Zuul migration - https://phabricator.wikimedia.org/T427353#11960218 (10Dzahn) We have already established a zuul naming pattern for existing VMs and "1-3" are in use. Please use **zuul1004/zuul2004** and counting up from... [15:50:20] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1220', diff saved to https://phabricator.wikimedia.org/P93276 and previous config saved to /var/cache/conftool/dbconfig/20260527-155019-fceratto.json [15:50:51] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db2163: Testing cookbook [15:51:11] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2163: Testing cookbook [15:51:18] (03Merged) 10jenkins-bot: sre.mysql.pool: Add support for downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1289965 (https://phabricator.wikimedia.org/T426318) (owner: 10CWilliams) [15:52:19] !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp6016.drmrs.wmnet,cp[1112,1114].eqiad.wmnet,cp[5024,5031-5032].eqsin.wmnet} and A:cp [15:52:23] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db2163: Repooling after testing patch [15:52:49] (03CR) 10Clément Goubert: "This and I011084cdc1fc4e850b74e28de0b5e52d5ee32175 should be done fairly close together (redioscope uses data from the rest-gateway rate l" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294276 (https://phabricator.wikimedia.org/T418924) (owner: 10Effie Mouzeli) [15:52:58] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs1017.eqiad.wmnet [15:53:03] !log cwilliams@cumin1003 START - Cookbook sre.hosts.remove-downtime for db2163.codfw.wmnet [15:53:03] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2163.codfw.wmnet [15:53:19] (03CR) 10Clément Goubert: [C:03+1] "Needs to be done fairly close to Iadd2b5525978ce8726c0ecb3aec5b484efb1b639 as redioscope uses the redis data from the rest-gateway ratelim" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294275 (https://phabricator.wikimedia.org/T418924) (owner: 10Effie Mouzeli) [15:53:38] 06SRE, 06Traffic, 13Patch-For-Review: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11960227 (10BCornwall) 05Open→03In progress [15:53:40] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: re-rack mc2055 (before Jun 9th) - https://phabricator.wikimedia.org/T427373#11960232 (10Jhancock.wm) @jijiki we can rerack this in A3 no problem. I will be around most days from 1700 UTC to 2100 UTC. The days that work best f... [15:55:14] PROBLEM - Confd vcl based reload on cp6012 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [15:59:17] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1017.eqiad.wmnet [16:00:27] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1220', diff saved to https://phabricator.wikimedia.org/P93280 and previous config saved to /var/cache/conftool/dbconfig/20260527-160027-fceratto.json [16:01:39] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11960266 (10cmooney) That looks good to me @papaul good stuff. If we use vlan IDs 512/522 I guess the plan would be to change the vlan i... [16:01:59] (03PS1) 10Ottomata: EventStreams - Expose mediawiki.page_outlink_topic_prediction_change.v1 stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294341 (https://phabricator.wikimedia.org/T427416) [16:02:35] 06SRE, 06Traffic, 13Patch-For-Review: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11960280 (10BCornwall) [16:03:16] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp6016.drmrs.wmnet [16:03:30] 06SRE, 06Traffic, 13Patch-For-Review: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11960281 (10BCornwall) [16:03:42] (03PS1) 10Sbisson: Allow disabling experiment for experienced editors (>=100 edits) [extensions/ArticleGuidance] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1294342 (https://phabricator.wikimedia.org/T426871) [16:04:17] (03PS1) 10Sbisson: Allow disabling experiment for experienced editors (>=100 edits) [extensions/ArticleGuidance] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294343 (https://phabricator.wikimedia.org/T426871) [16:04:27] RECOVERY - Confd vcl based reload on cp6015 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [16:04:30] (03CR) 10Hnowlan: [C:03+2] tests/integration: readability improvements [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1293147 (https://phabricator.wikimedia.org/T385798) (owner: 10Hnowlan) [16:04:35] (03CR) 10Ilias Sarantopoulos: [C:03+1] "LGTM! Thank you for the patch!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294341 (https://phabricator.wikimedia.org/T427416) (owner: 10Ottomata) [16:04:56] (03CR) 10Ottomata: [C:03+2] EventStreams - Expose mediawiki.page_outlink_topic_prediction_change.v1 stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294341 (https://phabricator.wikimedia.org/T427416) (owner: 10Ottomata) [16:05:06] !log sukhe@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host durum5003.eqsin.wmnet with OS trixie [16:06:55] (03PS1) 10Sbisson: frwiki: restrict Article Guidance experiment to junior editors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294344 (https://phabricator.wikimedia.org/T426871) [16:07:01] (03Merged) 10jenkins-bot: EventStreams - Expose mediawiki.page_outlink_topic_prediction_change.v1 stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294341 (https://phabricator.wikimedia.org/T427416) (owner: 10Ottomata) [16:07:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/ArticleGuidance] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1294342 (https://phabricator.wikimedia.org/T426871) (owner: 10Sbisson) [16:08:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/ArticleGuidance] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294343 (https://phabricator.wikimedia.org/T426871) (owner: 10Sbisson) [16:08:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294344 (https://phabricator.wikimedia.org/T426871) (owner: 10Sbisson) [16:08:54] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:09:25] FIRING: [4x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9400.service on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:10:16] !log otto@deploy1003 helmfile [staging] START helmfile.d/services/eventstreams: apply [16:10:25] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1178: Recovering from failure in cookbook [16:10:26] !log otto@deploy1003 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [16:10:35] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1220 (T426633)', diff saved to https://phabricator.wikimedia.org/P93283 and previous config saved to /var/cache/conftool/dbconfig/20260527-161034-fceratto.json [16:10:55] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1224.eqiad.wmnet with reason: Maintenance [16:11:02] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1224 (T426633)', diff saved to https://phabricator.wikimedia.org/P93284 and previous config saved to /var/cache/conftool/dbconfig/20260527-161101-fceratto.json [16:11:43] (03CR) 10Ladsgroup: "Do you want me to leave it for Ceri to try or you're okay with me moving forward? 😄" [puppet] - 10https://gerrit.wikimedia.org/r/1292346 (https://phabricator.wikimedia.org/T426984) (owner: 10Ladsgroup) [16:12:02] (03Merged) 10jenkins-bot: tests/integration: readability improvements [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1293147 (https://phabricator.wikimedia.org/T385798) (owner: 10Hnowlan) [16:12:44] !log otto@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [16:13:15] !log otto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [16:13:21] !log otto@deploy1003 helmfile [codfw] START helmfile.d/services/eventstreams: apply [16:14:05] !log otto@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [16:15:34] 10ops-codfw, 06DC-Ops: Power Supply - Status - issue on aqs2011:9290 - https://phabricator.wikimedia.org/T427409#11960354 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm bad cpu. replaced [16:16:42] (03PS3) 10BCornwall: Remove lvs1016, promote lvs1017 [puppet] - 10https://gerrit.wikimedia.org/r/1286523 (https://phabricator.wikimedia.org/T421421) [16:16:42] (03PS4) 10BCornwall: Remove lvs1016 hieradata, demote to insetup_noferm [puppet] - 10https://gerrit.wikimedia.org/r/1286524 (https://phabricator.wikimedia.org/T421421) [16:16:42] (03PS1) 10BCornwall: lvs: Set lvs1017 interface name [puppet] - 10https://gerrit.wikimedia.org/r/1294346 (https://phabricator.wikimedia.org/T421421) [16:17:02] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Repurpose ganeti102[3456] for Zuul migration - https://phabricator.wikimedia.org/T427353#11960361 (10Dzahn) Are these machines supposed to replace the main zuul VMs zuul1001/2001? I am missing the context a bit how we got to physical hardware being a... [16:17:54] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T426633)', diff saved to https://phabricator.wikimedia.org/P93285 and previous config saved to /var/cache/conftool/dbconfig/20260527-161753-fceratto.json [16:20:02] (03PS1) 10Dzahn: site: add zuul[12]00[4-9] with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1294347 (https://phabricator.wikimedia.org/T427353) [16:20:36] 10SRE-Access-Requests: Requesting access to releasers-mediawiki for matmarex, ariel, jgiannelos - https://phabricator.wikimedia.org/T427421 (10ArielGlenn) 03NEW [16:20:42] (03CR) 10CI reject: [V:04-1] site: add zuul[12]00[4-9] with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1294347 (https://phabricator.wikimedia.org/T427353) (owner: 10Dzahn) [16:21:11] jouncebot: nowandnext [16:21:11] No deployments scheduled for the next 0 hour(s) and 8 minute(s) [16:21:11] In 0 hour(s) and 8 minute(s): MediaWiki infrastructure (UTC late) (extended edition) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260527T1630) [16:21:14] 10SRE-Access-Requests: Requesting access to releasers-mediawiki for matmarex, ariel, jgiannelos - https://phabricator.wikimedia.org/T427421#11960374 (10ArielGlenn) [16:21:33] PROBLEM - PyBal backends health check on lvs1017 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [16:21:50] (03CR) 10ArielGlenn: "Will https://phabricator.wikimedia.org/T427421 suffice?" [puppet] - 10https://gerrit.wikimedia.org/r/1293769 (https://phabricator.wikimedia.org/T423255) (owner: 10ArielGlenn) [16:22:05] 10ops-eqiad, 06DC-Ops: Power Supply - PS Redundancy - issue on dbproxy1024:9290 - https://phabricator.wikimedia.org/T427408#11960378 (10Jclark-ctr) a:03Jclark-ctr [16:24:15] (03PS1) 10Dzahn: installserver: update partman for mixed VM/physical zuul machines [puppet] - 10https://gerrit.wikimedia.org/r/1294348 (https://phabricator.wikimedia.org/T427353) [16:24:25] FIRING: [4x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9400.service on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:26:33] PROBLEM - PyBal connections to etcd on lvs1017 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [16:27:36] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, 13Patch-For-Review: Repurpose ganeti102[3456] for Zuul migration - https://phabricator.wikimedia.org/T427353#11960386 (10Dzahn) @thcipriani @dduvall Was this requested by you because the existing VMs are too limited? Is the idea to replace (just) t... [16:27:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 28 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270986 (https://phabricator.wikimedia.org/T413331) (owner: 10Robertsky) [16:28:01] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P93287 and previous config saved to /var/cache/conftool/dbconfig/20260527-162800-fceratto.json [16:28:42] 10ops-codfw, 06DC-Ops: Power Supply - Status - issue on mc-gp2005:9290 - https://phabricator.wikimedia.org/T427410#11960388 (10Jhancock.wm) PSU is bad. don't have an easy replacement. opened a ticket with dell. [16:29:26] (03CR) 10Dzahn: [C:03+1] add new members of mw release working group to releasers-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/1293769 (https://phabricator.wikimedia.org/T423255) (owner: 10ArielGlenn) [16:30:05] swfrench-wmf: OwO what's this, a deployment window?? MediaWiki infrastructure (UTC late) (extended edition). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260527T1630). nyaa~ [16:30:10] o/ [16:30:17] (03PS2) 10Scott French: profile::services_proxy::envoy: Add non-discovery shellbox listeners [puppet] - 10https://gerrit.wikimedia.org/r/1293771 [16:30:19] (03PS2) 10Scott French: profile::services_proxy::envoy: Enable non-discovery shellbox listeners [puppet] - 10https://gerrit.wikimedia.org/r/1293772 [16:30:20] (03PS2) 10Scott French: ProductionServices: Temporarily use shellbox in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293774 [16:30:21] (03PS2) 10Scott French: ProductionServices: Temporarily use shellbox in eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293775 [16:30:21] (03PS2) 10Scott French: ProductionServices: Revert to discovery shellbox listeners [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293776 [16:31:45] we'll be starting some maintenance shortly. please do not start any new MediaWiki deployments. [16:33:05] (03PS3) 10Dzahn: jenkins: add firewall rule for new jenkins to gearman on legacy host [puppet] - 10https://gerrit.wikimedia.org/r/1275537 (https://phabricator.wikimedia.org/T418521) [16:33:06] (03CR) 10Dzahn: jenkins: add firewall rule for new jenkins to gearman on legacy host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1275537 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [16:33:34] (03CR) 10CDanis: [C:03+1] profile::services_proxy::envoy: Add non-discovery shellbox listeners [puppet] - 10https://gerrit.wikimedia.org/r/1293771 (owner: 10Scott French) [16:33:46] (03CR) 10CDanis: [C:03+1] profile::services_proxy::envoy: Enable non-discovery shellbox listeners [puppet] - 10https://gerrit.wikimedia.org/r/1293772 (owner: 10Scott French) [16:33:54] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:35:41] (03CR) 10Scott French: [C:03+2] profile::services_proxy::envoy: Add non-discovery shellbox listeners [puppet] - 10https://gerrit.wikimedia.org/r/1293771 (owner: 10Scott French) [16:35:45] (03CR) 10Dzahn: [C:03+2] jenkins: add firewall rule for new jenkins to gearman on legacy host [puppet] - 10https://gerrit.wikimedia.org/r/1275537 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [16:35:47] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11960461 (10Papaul) @cmooney yes we will change the VLAN-id and rename the VLAN for rack 0603 during the switch migration. so it will be... [16:35:54] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:35:56] 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - PS Redundancy - issue on dbproxy1024:9290 - https://phabricator.wikimedia.org/T427408#11960466 (10Jclark-ctr) 05Open→03Resolved [16:36:14] (03CR) 10Scott French: [C:03+2] profile::services_proxy::envoy: Enable non-discovery shellbox listeners [puppet] - 10https://gerrit.wikimedia.org/r/1293772 (owner: 10Scott French) [16:36:37] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11960468 (10Papaul) [16:37:51] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2163: Repooling after testing patch [16:38:09] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P93290 and previous config saved to /var/cache/conftool/dbconfig/20260527-163808-fceratto.json [16:40:25] (03PS3) 10Dzahn: add new members of mw release working group to releasers-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/1293769 (https://phabricator.wikimedia.org/T427421) (owner: 10ArielGlenn) [16:40:33] PROBLEM - Check unit status of ipip-multiqueue-optimizer on lvs1017 is CRITICAL: CRITICAL: Status of the systemd unit ipip-multiqueue-optimizer https://wikitech.wikimedia.org/wiki/LVS%23IPIP_encapsulation_experiments [16:41:48] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to releasers-mediawiki for matmarex, ariel, jgiannelos - https://phabricator.wikimedia.org/T427421#11960511 (10Dzahn) Since the group approver has already +1ed and all users are existing shell users there is nothing to be done besides mergin... [16:41:59] !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1017.eqiad.wmnet with reason: Setting up [16:43:16] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp1112.eqiad.wmnet [16:43:58] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to releasers-mediawiki for matmarex, ariel, jgiannelos - https://phabricator.wikimedia.org/T427421#11960525 (10Dzahn) 05Open→03In progress [16:44:11] (03CR) 10Dzahn: [C:03+2] add new members of mw release working group to releasers-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/1293769 (https://phabricator.wikimedia.org/T427421) (owner: 10ArielGlenn) [16:45:25] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2212 failed to reboot - https://phabricator.wikimedia.org/T427388#11960531 (10Jhancock.wm) it halted in the boot and i had to pull the power entirely to get it to reboot and make it past post. There still isn't anything new in the event logs. Can I update the firmware... [16:46:09] (03CR) 10Dzahn: [C:04-2] "syntax error" [puppet] - 10https://gerrit.wikimedia.org/r/1294347 (https://phabricator.wikimedia.org/T427353) (owner: 10Dzahn) [16:46:12] (03CR) 10CWilliams: "@Ladsgroup@gmail.com I am out until next week, so if it can wait until then I can give it a go, else proceed if it is blocking you" [puppet] - 10https://gerrit.wikimedia.org/r/1292346 (https://phabricator.wikimedia.org/T426984) (owner: 10Ladsgroup) [16:46:30] (03PS2) 10Dzahn: installserver: update partman for mixed VM/physical zuul machines [puppet] - 10https://gerrit.wikimedia.org/r/1294348 (https://phabricator.wikimedia.org/T427353) [16:47:43] (03CR) 10Dzahn: [C:03+2] "[releases1003:~] $ id ariel" [puppet] - 10https://gerrit.wikimedia.org/r/1293769 (https://phabricator.wikimedia.org/T427421) (owner: 10ArielGlenn) [16:47:49] (03PS3) 10Federico Ceratto: sre.mysql.pool: Support depooling unreachable hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1294265 (https://phabricator.wikimedia.org/T427381) [16:48:17] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T426633)', diff saved to https://phabricator.wikimedia.org/P93291 and previous config saved to /var/cache/conftool/dbconfig/20260527-164815-fceratto.json [16:48:29] (03CR) 10Clément Goubert: [C:03+1] ratelimit: replace rdb2009 with rdb2013 #1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294274 (https://phabricator.wikimedia.org/T418924) (owner: 10Effie Mouzeli) [16:48:38] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1264.eqiad.wmnet with reason: Maintenance [16:48:46] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1264 (T426633)', diff saved to https://phabricator.wikimedia.org/P93292 and previous config saved to /var/cache/conftool/dbconfig/20260527-164846-fceratto.json [16:49:06] (03PS16) 10FNegri: sre.mysql.multiinstance_reboot: new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1290806 (https://phabricator.wikimedia.org/T420203) [16:49:53] (03CR) 10FNegri: "I refactored this to be a separate cookbook, as we discussed in I123f2c5c8a9aa3f52c5a29ed4d600b80781e46dc." [cookbooks] - 10https://gerrit.wikimedia.org/r/1290806 (https://phabricator.wikimedia.org/T420203) (owner: 10FNegri) [16:50:06] (03CR) 10CWilliams: "Fine with me" [cookbooks] - 10https://gerrit.wikimedia.org/r/1291993 (https://phabricator.wikimedia.org/T420203) (owner: 10Federico Ceratto) [16:50:26] (03CR) 10Atsuko: [C:03+2] Provision stream-internal.w.o [dns] - 10https://gerrit.wikimedia.org/r/1294326 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [16:50:49] (03PS3) 10Scott French: ProductionServices: Temporarily use shellbox in eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293775 [16:50:49] (03PS3) 10Scott French: ProductionServices: Temporarily use shellbox in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293774 [16:50:49] (03PS3) 10Scott French: ProductionServices: Revert to discovery shellbox listeners [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293776 [16:51:25] !log atsuko@dns1004 START - running authdns-update [16:51:26] (03CR) 10CDanis: [C:03+1] ProductionServices: Temporarily use shellbox in eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293775 (owner: 10Scott French) [16:51:31] (03PS2) 10Arlolra: Deploy PRV to 6 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293805 (https://phabricator.wikimedia.org/T427331) [16:52:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293775 (owner: 10Scott French) [16:53:24] !log atsuko@dns1004 END - running authdns-update [16:53:49] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to releasers-mediawiki for matmarex, ariel, jgiannelos - https://phabricator.wikimedia.org/T427421#11960573 (10Dzahn) Users have been created / added to the group on `releases1003/releases2003`. The currently active server is `releases2003`... [16:54:07] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to releasers-mediawiki for matmarex, ariel, jgiannelos - https://phabricator.wikimedia.org/T427421#11960575 (10Dzahn) 05In progress→03Resolved a:03Dzahn [16:56:00] (03Merged) 10jenkins-bot: ProductionServices: Temporarily use shellbox in eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293775 (owner: 10Scott French) [16:56:25] !log swfrench@deploy1003 Started scap sync-world: Backport for [[gerrit:1293775|ProductionServices: Temporarily use shellbox in eqiad]] [16:58:04] (03CR) 10Atsuko: [C:03+2] trafficserver: enable stream-internal.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1294327 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [16:58:30] !log swfrench@deploy1003 swfrench: Backport for [[gerrit:1293775|ProductionServices: Temporarily use shellbox in eqiad]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:00:50] !log swfrench@deploy1003 swfrench: Continuing with deployment [17:02:02] (03CR) 10Dzahn: [C:03+2] installserver: update partman for mixed VM/physical zuul machines [puppet] - 10https://gerrit.wikimedia.org/r/1294348 (https://phabricator.wikimedia.org/T427353) (owner: 10Dzahn) [17:04:25] FIRING: [4x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9400.service on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:05:09] !log swfrench@deploy1003 Finished scap sync-world: Backport for [[gerrit:1293775|ProductionServices: Temporarily use shellbox in eqiad]] (duration: 08m 44s) [17:06:23] (03PS2) 10BCornwall: lvs: Set lvs1017 interface overrides [puppet] - 10https://gerrit.wikimedia.org/r/1294346 (https://phabricator.wikimedia.org/T421421) [17:06:23] (03PS4) 10BCornwall: Remove lvs1016, promote lvs1017 [puppet] - 10https://gerrit.wikimedia.org/r/1286523 (https://phabricator.wikimedia.org/T421421) [17:06:23] (03PS5) 10BCornwall: Remove lvs1016 hieradata, demote to insetup_noferm [puppet] - 10https://gerrit.wikimedia.org/r/1286524 (https://phabricator.wikimedia.org/T421421) [17:09:25] FIRING: [9x] SystemdUnitFailed: opensearch-disable-readahead-relforge-eqiad-small-alpha.service on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:09:58] (03CR) 10Ssingh: lvs: Set lvs1017 interface overrides (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1294346 (https://phabricator.wikimedia.org/T421421) (owner: 10BCornwall) [17:10:20] (03PS3) 10BCornwall: lvs: Set lvs1017 interface overrides [puppet] - 10https://gerrit.wikimedia.org/r/1294346 (https://phabricator.wikimedia.org/T421421) [17:10:20] (03PS5) 10BCornwall: Remove lvs1016, promote lvs1017 [puppet] - 10https://gerrit.wikimedia.org/r/1286523 (https://phabricator.wikimedia.org/T421421) [17:10:20] (03PS6) 10BCornwall: Remove lvs1016 hieradata, demote to insetup_noferm [puppet] - 10https://gerrit.wikimedia.org/r/1286524 (https://phabricator.wikimedia.org/T421421) [17:10:34] (03CR) 10BCornwall: lvs: Set lvs1017 interface overrides (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1294346 (https://phabricator.wikimedia.org/T421421) (owner: 10BCornwall) [17:10:36] (03CR) 10Ssingh: [C:03+1] lvs: Set lvs1017 interface overrides [puppet] - 10https://gerrit.wikimedia.org/r/1294346 (https://phabricator.wikimedia.org/T421421) (owner: 10BCornwall) [17:11:03] (03PS4) 10BCornwall: lvs: Set lvs1017 interface overrides [puppet] - 10https://gerrit.wikimedia.org/r/1294346 (https://phabricator.wikimedia.org/T421421) [17:11:39] (03CR) 10CWilliams: sre.mysql.pool: Support depooling unreachable hosts (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1294265 (https://phabricator.wikimedia.org/T427381) (owner: 10Federico Ceratto) [17:11:46] (03CR) 10BCornwall: [C:03+2] lvs: Set lvs1017 interface overrides [puppet] - 10https://gerrit.wikimedia.org/r/1294346 (https://phabricator.wikimedia.org/T421421) (owner: 10BCornwall) [17:12:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://eventstreams-internal.svc.codfw.wmnet:4992 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [17:13:55] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox: apply [17:14:03] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs1016 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [17:14:44] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [17:14:44] (03CR) 10CWilliams: sre.mysql.global-read-only Set all sections as RO/RW (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [17:14:45] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [17:15:39] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [17:15:40] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [17:16:09] RECOVERY - Check unit status of ipip-multiqueue-optimizer on lvs1017 is OK: OK: Status of the systemd unit ipip-multiqueue-optimizer https://wikitech.wikimedia.org/wiki/LVS%23IPIP_encapsulation_experiments [17:16:12] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [17:16:13] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [17:16:19] (03PS2) 10Dzahn: site: add zuul[12]00[4-9] with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1294347 (https://phabricator.wikimedia.org/T427353) [17:16:20] (03PS2) 10Jdlrobson: Thumbnails are not being optimized in large mode [core] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294322 (https://phabricator.wikimedia.org/T427237) [17:16:31] (03PS1) 10Jdlrobson: Thumbnails are not being optimized in large mode [core] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1294360 (https://phabricator.wikimedia.org/T427237) [17:16:48] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [17:16:50] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [17:17:21] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [17:17:22] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [17:18:08] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [17:18:55] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:19:25] FIRING: [9x] SystemdUnitFailed: opensearch-disable-readahead-relforge-eqiad-small-alpha.service on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:21:10] (03PS21) 10Majavah: firewall: Declare resources for both providers [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) [17:21:10] (03PS21) 10Majavah: P:wmcs::instance: Convert to firewall wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1211652 (https://phabricator.wikimedia.org/T411089) [17:21:10] (03PS1) 10Majavah: P:openstack: encapi: Fix type of firewall source port [puppet] - 10https://gerrit.wikimedia.org/r/1294361 [17:21:11] (03PS1) 10Majavah: firewall: client: Add missing src_ips parameter [puppet] - 10https://gerrit.wikimedia.org/r/1294362 [17:22:59] (03CR) 10CI reject: [V:04-1] firewall: client: Add missing src_ips parameter [puppet] - 10https://gerrit.wikimedia.org/r/1294362 (owner: 10Majavah) [17:23:29] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1003 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [17:23:45] RECOVERY - PyBal backends health check on lvs1017 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:23:54] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:23:55] RECOVERY - PyBal connections to etcd on lvs1017 is OK: OK: 12 connections established with conf1007.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [17:24:17] (03PS4) 10Scott French: ProductionServices: Temporarily use shellbox in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293774 [17:24:25] FIRING: [9x] SystemdUnitFailed: opensearch-disable-readahead-relforge-eqiad-small-alpha.service on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:24:49] (03CR) 10Majavah: [C:03+2] P:openstack: encapi: Fix type of firewall source port [puppet] - 10https://gerrit.wikimedia.org/r/1294361 (owner: 10Majavah) [17:25:04] (03CR) 10Majavah: [C:03+2] firewall: Declare resources for both providers [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) (owner: 10Majavah) [17:25:14] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp1114.eqiad.wmnet [17:25:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293774 (owner: 10Scott French) [17:27:33] (03Merged) 10jenkins-bot: ProductionServices: Temporarily use shellbox in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293774 (owner: 10Scott French) [17:28:00] !log swfrench@deploy1003 Started scap sync-world: Backport for [[gerrit:1293774|ProductionServices: Temporarily use shellbox in codfw]] [17:28:02] (03CR) 10Dzahn: [C:03+2] site: add zuul[12]00[4-9] with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1294347 (https://phabricator.wikimedia.org/T427353) (owner: 10Dzahn) [17:30:20] (03CR) 10Majavah: [C:03+1] "that one has now been merged, so this is obsolete" [puppet] - 10https://gerrit.wikimedia.org/r/1289378 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [17:30:25] (03Abandoned) 10Majavah: Rename role::mariadb::ferm to role::mariadb::firewall [puppet] - 10https://gerrit.wikimedia.org/r/1289378 (https://phabricator.wikimedia.org/T411089) (owner: 10JHathaway) [17:31:07] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs1020 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [17:31:42] !log swfrench@deploy1003 swfrench: Backport for [[gerrit:1293774|ProductionServices: Temporarily use shellbox in codfw]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:33:29] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [17:34:25] FIRING: [9x] SystemdUnitFailed: opensearch-disable-readahead-relforge-eqiad-small-alpha.service on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:37:15] (03PS22) 10Majavah: P:wmcs::instance: Convert to firewall wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1211652 (https://phabricator.wikimedia.org/T411089) [17:37:15] (03PS2) 10Majavah: firewall: client: Remove reference to nonexistent param [puppet] - 10https://gerrit.wikimedia.org/r/1294362 [17:38:48] !log swfrench@deploy1003 swfrench: Continuing with deployment [17:38:55] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8587/co" [puppet] - 10https://gerrit.wikimedia.org/r/1211652 (https://phabricator.wikimedia.org/T411089) (owner: 10Majavah) [17:39:25] FIRING: [2x] SystemdUnitFailed: push_cross_cluster_settings_9400.service on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:40:13] (03CR) 10Majavah: [V:03+1] "This one is finally ready, I think." [puppet] - 10https://gerrit.wikimedia.org/r/1211652 (https://phabricator.wikimedia.org/T411089) (owner: 10Majavah) [17:40:19] (03CR) 10Dzahn: trafficserver: add a map for gitlab as a backend (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1290731 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [17:40:42] jouncebot: nowandnext [17:40:42] For the next 0 hour(s) and 19 minute(s): MediaWiki infrastructure (UTC late) (extended edition) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260527T1630) [17:40:42] In 2 hour(s) and 19 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260527T2000) [17:41:43] (03CR) 10Dzahn: "looks like this needs to wait for the port discussion to conclude" [puppet] - 10https://gerrit.wikimedia.org/r/1290684 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [17:42:51] FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [17:43:01] !log swfrench@deploy1003 Finished scap sync-world: Backport for [[gerrit:1293774|ProductionServices: Temporarily use shellbox in codfw]] (duration: 15m 01s) [17:43:55] (03PS4) 10Scott French: ProductionServices: Revert to discovery shellbox listeners [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293776 [17:47:54] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1278528 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [17:49:01] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1264 (T426633)', diff saved to https://phabricator.wikimedia.org/P93293 and previous config saved to /var/cache/conftool/dbconfig/20260527-174900-fceratto.json [17:51:28] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox: apply [17:52:38] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [17:52:39] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [17:53:19] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [17:53:21] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [17:53:45] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [17:53:46] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [17:54:09] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [17:54:10] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [17:54:25] FIRING: [3x] SystemdUnitFailed: push_cross_cluster_settings_9400.service on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:54:39] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [17:54:40] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [17:55:31] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [17:56:09] (03CR) 10CDanis: [C:03+1] ProductionServices: Revert to discovery shellbox listeners [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293776 (owner: 10Scott French) [17:57:20] (03CR) 10CDanis: [C:03+2] ProductionServices: Revert to discovery shellbox listeners [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293776 (owner: 10Scott French) [17:57:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293776 (owner: 10Scott French) [17:58:12] (03Merged) 10jenkins-bot: ProductionServices: Revert to discovery shellbox listeners [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293776 (owner: 10Scott French) [17:58:39] !log swfrench@deploy1003 Started scap sync-world: Backport for [[gerrit:1293776|ProductionServices: Revert to discovery shellbox listeners]] [17:59:08] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1264', diff saved to https://phabricator.wikimedia.org/P93294 and previous config saved to /var/cache/conftool/dbconfig/20260527-175908-fceratto.json [18:00:36] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox: apply [18:00:40] !log swfrench@deploy1003 swfrench: Backport for [[gerrit:1293776|ProductionServices: Revert to discovery shellbox listeners]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:00:58] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox: apply [18:01:00] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [18:01:12] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [18:01:14] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-media: apply [18:01:26] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [18:01:27] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [18:01:40] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [18:01:42] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [18:02:00] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [18:02:02] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-video: apply [18:02:10] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [18:03:31] !log swfrench@deploy1003 swfrench: Continuing with deployment [18:05:58] (03PS1) 10Eric Gardner: Carousel only on articles [extensions/MultimediaViewer] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294370 (https://phabricator.wikimedia.org/T427336) [18:07:16] (03PS1) 10Scott French: profile::services_proxy::envoy: Disable non-discovery shellbox listeners [puppet] - 10https://gerrit.wikimedia.org/r/1294371 [18:07:26] (03CR) 10JHathaway: [C:03+1] firewall: client: Remove reference to nonexistent param [puppet] - 10https://gerrit.wikimedia.org/r/1294362 (owner: 10Majavah) [18:07:44] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp5024.eqsin.wmnet [18:08:26] !log brett@cumin2002 START - Cookbook sre.hosts.remove-downtime for lvs1017.eqiad.wmnet [18:08:27] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1017.eqiad.wmnet [18:09:03] !log swfrench@deploy1003 Finished scap sync-world: Backport for [[gerrit:1293776|ProductionServices: Revert to discovery shellbox listeners]] (duration: 10m 24s) [18:09:07] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [18:09:16] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1264', diff saved to https://phabricator.wikimedia.org/P93295 and previous config saved to /var/cache/conftool/dbconfig/20260527-180915-fceratto.json [18:10:05] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [18:10:22] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/mw-experimental: apply [18:11:16] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply [18:11:50] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [18:12:28] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [18:12:39] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [18:13:14] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [18:13:23] (03CR) 10Scott French: [C:03+2] profile::services_proxy::envoy: Disable non-discovery shellbox listeners [puppet] - 10https://gerrit.wikimedia.org/r/1294371 (owner: 10Scott French) [18:16:46] (03CR) 10BCornwall: [C:03+2] Remove lvs1016, promote lvs1017 [puppet] - 10https://gerrit.wikimedia.org/r/1286523 (https://phabricator.wikimedia.org/T421421) (owner: 10BCornwall) [18:16:56] (03PS6) 10BCornwall: Remove lvs1016, promote lvs1017 [puppet] - 10https://gerrit.wikimedia.org/r/1286523 (https://phabricator.wikimedia.org/T421421) [18:17:08] (03PS7) 10BCornwall: Remove lvs1016 hieradata, demote to insetup_noferm [puppet] - 10https://gerrit.wikimedia.org/r/1286524 (https://phabricator.wikimedia.org/T421421) [18:18:04] (03PS3) 10Majavah: firewall: client: Remove reference to nonexistent param [puppet] - 10https://gerrit.wikimedia.org/r/1294362 [18:18:52] (03PS1) 10Ebernhardson: identity: Prune private ips from x-forwarded-for [extensions/CirrusSearch] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294373 (https://phabricator.wikimedia.org/T407432) [18:19:14] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/CirrusSearch] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294373 (https://phabricator.wikimedia.org/T407432) (owner: 10Ebernhardson) [18:19:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1264 (T426633)', diff saved to https://phabricator.wikimedia.org/P93296 and previous config saved to /var/cache/conftool/dbconfig/20260527-181923-fceratto.json [18:19:33] (03CR) 10Majavah: [C:03+2] firewall: client: Remove reference to nonexistent param [puppet] - 10https://gerrit.wikimedia.org/r/1294362 (owner: 10Majavah) [18:19:41] (03CR) 10BCornwall: [C:03+2] Remove lvs1016, promote lvs1017 [puppet] - 10https://gerrit.wikimedia.org/r/1286523 (https://phabricator.wikimedia.org/T421421) (owner: 10BCornwall) [18:20:39] jouncebot: nowandnext [18:20:39] For the next 0 hour(s) and 9 minute(s): MediaWiki infrastructure (UTC late) (extended edition) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260527T1630) [18:20:39] In 1 hour(s) and 39 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260527T2000) [18:20:41] (03PS1) 10Ebernhardson: Revert^2 "cirrus: AB test query suggester variants" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294374 [18:21:09] (03PS2) 10Ebernhardson: Revert^2 "cirrus: AB test query suggester variants" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294374 (https://phabricator.wikimedia.org/T407432) [18:21:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294374 (https://phabricator.wikimedia.org/T407432) (owner: 10Ebernhardson) [18:23:10] !log swfrench@deploy1003 Started scap sync-world: Helmfile-only deployment to clean up unused mesh listeners [18:24:14] !log swfrench@deploy1003 swfrench: Helmfile-only deployment to clean up unused mesh listeners synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:24:56] (03PS1) 10Catrope: auth: Mark the hidden token field used for reauth as skippable [core] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294375 (https://phabricator.wikimedia.org/T427398) [18:25:07] (03PS1) 10Catrope: Fix lastAuthTimestamp hack [extensions/CentralAuth] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294376 (https://phabricator.wikimedia.org/T427398) [18:25:20] !log swfrench@deploy1003 swfrench: Continuing with deployment [18:26:08] swfrench-wmf: Once you're done I would like to deploy fixes for the current train blocker, could you ping me when I'm good to go? [18:26:36] RoanKattouw: will do! should be done in ~ 3-4m [18:27:09] PROBLEM - pybal on lvs1016 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [18:27:11] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [18:28:08] ^Expected [18:29:05] PROBLEM - PyBal connections to etcd on lvs1016 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [18:29:22] !log swfrench@deploy1003 Finished scap sync-world: Helmfile-only deployment to clean up unused mesh listeners (duration: 06m 12s) [18:30:01] RoanKattouw: alright, I think the dust has settled. all yours! [18:30:09] RECOVERY - pybal on lvs1016 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [18:30:11] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:30:18] (03PS2) 10Dr0ptp4kt: Reactivate wikimedia.de email addresses for GrowthBook SSO [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294372 (https://phabricator.wikimedia.org/T418665) [18:31:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294376 (https://phabricator.wikimedia.org/T427398) (owner: 10Catrope) [18:31:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294375 (https://phabricator.wikimedia.org/T427398) (owner: 10Catrope) [18:31:29] (03CR) 10Dr0ptp4kt: "I believe we may need this in order for the wikimedia.de account holders to SSO login, coupled with their other accoutrements in IDM and G" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294372 (https://phabricator.wikimedia.org/T418665) (owner: 10Dr0ptp4kt) [18:32:49] !incidents [18:32:49] 8024 (ACKED) db2189 (paged)/MariaDB Replica SQL: s2 (paged) [18:32:49] 8025 (ACKED) db2189 (paged)/MariaDB Replica IO: s2 (paged) [18:32:49] 8026 (ACKED) db2189 (paged)/MariaDB Replica Lag: s2 (paged) [18:32:50] 8023 (RESOLVED) Host db2189 (paged) [18:34:03] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs1016 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [18:34:05] RECOVERY - PyBal connections to etcd on lvs1016 is OK: OK: 12 connections established with conf1007.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [18:35:08] !log joal@deploy1003 Started deploy [analytics/refinery@96cf761] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@96cf761f] [18:37:13] !log joal@deploy1003 Finished deploy [analytics/refinery@96cf761] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@96cf761f] (duration: 02m 04s) [18:39:00] (03Merged) 10jenkins-bot: Fix lastAuthTimestamp hack [extensions/CentralAuth] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294376 (https://phabricator.wikimedia.org/T427398) (owner: 10Catrope) [18:39:08] !log joal@deploy1003 Started deploy [analytics/refinery@96cf761]: Regular analytics weekly train [analytics/refinery@96cf761f] [18:40:14] !log joal@deploy1003 Finished deploy [analytics/refinery@96cf761]: Regular analytics weekly train [analytics/refinery@96cf761f] (duration: 01m 05s) [18:42:55] FIRING: [2x] PyBalBGPUnstable: PyBal BGP sessions on instance lvs1017 with peer 208.80.154.196 are failing #page - https://wikitech.wikimedia.org/wiki/PyBal#Alerts - https://grafana.wikimedia.org/d/000000488/pybal-bgp?var-datasource=eqiad%20prometheus/ops&var-server=lvs1017 - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable [18:43:07] ah wow, the page for this [18:43:08] !ack [18:43:09] 8028 (ACKED) [2x] PyBalBGPUnstable lvs sre (lvs1017:9090 pybal 64600 eqiad) [18:43:15] it's been ages [18:43:27] expected, brett ^ [18:43:38] thanks [18:44:03] PROBLEM - SSH on stat1008 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:45:26] (03Merged) 10jenkins-bot: auth: Mark the hidden token field used for reauth as skippable [core] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294375 (https://phabricator.wikimedia.org/T427398) (owner: 10Catrope) [18:45:54] !log catrope@deploy1003 Started scap sync-world: Backport for [[gerrit:1294376|Fix lastAuthTimestamp hack (T427398)]], [[gerrit:1294375|auth: Mark the hidden token field used for reauth as skippable (T427398)]] [18:45:59] T427398: Unable to edit pages on Mediawiki namespace on 1.47.0-wmf.4, redirects to Verify your Identity page - https://phabricator.wikimedia.org/T427398 [18:47:03] RECOVERY - SSH on stat1008 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:47:45] !log catrope@deploy1003 catrope: Backport for [[gerrit:1294376|Fix lastAuthTimestamp hack (T427398)]], [[gerrit:1294375|auth: Mark the hidden token field used for reauth as skippable (T427398)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:49:30] !log catrope@deploy1003 catrope: Continuing with deployment [18:49:54] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp5031.eqsin.wmnet [18:51:56] lvs1017 is for some planned maintenance and will recover and do I need to look into anything? [18:52:19] moritzm: thanks but brett is on it and there should be no user-impact [18:52:27] ok! [18:53:36] !log catrope@deploy1003 Finished scap sync-world: Backport for [[gerrit:1294376|Fix lastAuthTimestamp hack (T427398)]], [[gerrit:1294375|auth: Mark the hidden token field used for reauth as skippable (T427398)]] (duration: 07m 41s) [18:53:40] !log joal@deploy1003 Started deploy [analytics/refinery@96cf761]: Regular analytics weekly train [analytics/refinery@96cf761f] [18:53:41] T427398: Unable to edit pages on Mediawiki namespace on 1.47.0-wmf.4, redirects to Verify your Identity page - https://phabricator.wikimedia.org/T427398 [18:56:46] (03CR) 10Dzahn: [C:03+1] gitlab: use service name for upstream addr [puppet] - 10https://gerrit.wikimedia.org/r/1294219 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [18:58:41] !log joal@deploy1003 Finished deploy [analytics/refinery@96cf761]: Regular analytics weekly train [analytics/refinery@96cf761f] (duration: 05m 01s) [18:59:19] !log joal@deploy1003 Started deploy [analytics/refinery@96cf761] (thin): Regular analytics weekly train THIN [analytics/refinery@96cf761f] [19:01:28] !log joal@deploy1003 Finished deploy [analytics/refinery@96cf761] (thin): Regular analytics weekly train THIN [analytics/refinery@96cf761f] (duration: 02m 08s) [19:04:25] FIRING: [4x] SystemdUnitFailed: push_cross_cluster_settings_9400.service on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:06:38] (03CR) 10C. Scott Ananian: [C:03+1] Deploy PRV to 6 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293805 (https://phabricator.wikimedia.org/T427331) (owner: 10Arlolra) [19:08:02] (03PS1) 10Cathal Mooney: lvs1017: change configured set of BGP peers to top-of-rack siwtch [puppet] - 10https://gerrit.wikimedia.org/r/1294385 (https://phabricator.wikimedia.org/T421421) [19:08:10] (03CR) 10Ladsgroup: "It's blocking community and a high ranking committee in the movement. So I push it forward." [puppet] - 10https://gerrit.wikimedia.org/r/1292346 (https://phabricator.wikimedia.org/T426984) (owner: 10Ladsgroup) [19:08:39] (03CR) 10CI reject: [V:04-1] lvs1017: change configured set of BGP peers to top-of-rack siwtch [puppet] - 10https://gerrit.wikimedia.org/r/1294385 (https://phabricator.wikimedia.org/T421421) (owner: 10Cathal Mooney) [19:10:25] (03PS2) 10Cathal Mooney: lvs1017: change configured set of BGP peers to top-of-rack siwtch [puppet] - 10https://gerrit.wikimedia.org/r/1294385 (https://phabricator.wikimedia.org/T421421) [19:11:41] (03PS1) 10Eevans: linked-artifacts: deploy hoarde v1.3.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294387 (https://phabricator.wikimedia.org/T414112) [19:12:41] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1282764 (owner: 10Ayounsi) [19:13:02] (03CR) 10BCornwall: [C:03+1] lvs1017: change configured set of BGP peers to top-of-rack siwtch [puppet] - 10https://gerrit.wikimedia.org/r/1294385 (https://phabricator.wikimedia.org/T421421) (owner: 10Cathal Mooney) [19:13:09] (03PS1) 10Dzahn: ci::firewall: srange and drange need to be arrays [puppet] - 10https://gerrit.wikimedia.org/r/1294388 (https://phabricator.wikimedia.org/T418521) [19:15:35] (03CR) 10Eevans: [C:03+2] linked-artifacts: deploy hoarde v1.3.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294387 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [19:17:48] (03Merged) 10jenkins-bot: linked-artifacts: deploy hoarde v1.3.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294387 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [19:18:24] (03CR) 10Dzahn: [C:03+2] ci::firewall: srange and drange need to be arrays [puppet] - 10https://gerrit.wikimedia.org/r/1294388 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [19:19:25] FIRING: [7x] SystemdUnitFailed: opensearch_2@relforge-eqiad-small-alpha.service on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:19:31] (03CR) 10BCornwall: [V:03+1 C:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1294385 (https://phabricator.wikimedia.org/T421421) (owner: 10Cathal Mooney) [19:20:16] !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: apply [19:20:17] (03PS1) 10DDesouza: miscweb: bump (design|research)-landing-page [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294389 (https://phabricator.wikimedia.org/T344471) [19:20:33] !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: apply [19:20:58] (03CR) 10Thcipriani: [C:03+1] "One note here for @kharlan@wikimedia.org, you may want to clear out the main branch of any extraneous files since this will be checked out" [puppet] - 10https://gerrit.wikimedia.org/r/1287007 (https://phabricator.wikimedia.org/T403829) (owner: 10Ahmon Dancy) [19:21:10] (03CR) 10BCornwall: [V:03+1 C:03+2] lvs1017: change configured set of BGP peers to top-of-rack siwtch [puppet] - 10https://gerrit.wikimedia.org/r/1294385 (https://phabricator.wikimedia.org/T421421) (owner: 10Cathal Mooney) [19:24:02] (03CR) 10Ssingh: "Let's plan to merge tomorrow, 14:00 UTC." [puppet] - 10https://gerrit.wikimedia.org/r/1287007 (https://phabricator.wikimedia.org/T403829) (owner: 10Ahmon Dancy) [19:24:25] FIRING: [7x] SystemdUnitFailed: opensearch_2@relforge-eqiad-small-alpha.service on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:27:55] RESOLVED: [2x] PyBalBGPUnstable: PyBal BGP sessions on instance lvs1017 with peer 208.80.154.196 are failing #page - https://wikitech.wikimedia.org/wiki/PyBal#Alerts - https://grafana.wikimedia.org/d/000000488/pybal-bgp?var-datasource=eqiad%20prometheus/ops&var-server=lvs1017 - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable [19:27:58] nice [19:28:01] hooray [19:28:10] (03CR) 10Thcipriani: [C:04-1] scap.cfg.erb: Add hcaptcha checkout in production (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1287007 (https://phabricator.wikimedia.org/T403829) (owner: 10Ahmon Dancy) [19:29:34] (03PS1) 10Dzahn: CI: better naming; avoid using terms "new" and "legacy" [puppet] - 10https://gerrit.wikimedia.org/r/1294392 [19:30:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288370 (https://phabricator.wikimedia.org/T423766) (owner: 10Pppery) [19:31:07] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs1020 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [19:31:26] (03CR) 10DDesouza: [C:03+2] miscweb: bump (design|research)-landing-page [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294389 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza) [19:31:39] (03PS1) 10Catrope: Permissions: Create wmf-officeit group on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294393 [19:32:08] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp5032.eqsin.wmnet [19:32:08] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp6016.drmrs.wmnet,cp[1112,1114].eqiad.wmnet,cp[5024,5031-5032].eqsin.wmnet} and A:cp [19:32:37] (03PS3) 10Cathal Mooney: ulsfo LVS: peer with the ToR switch [puppet] - 10https://gerrit.wikimedia.org/r/1282731 (https://phabricator.wikimedia.org/T408892) (owner: 10Ayounsi) [19:32:38] (03PS3) 10Cathal Mooney: LVS BGP: peer with the gateway if no exception is set [puppet] - 10https://gerrit.wikimedia.org/r/1282764 (owner: 10Ayounsi) [19:33:57] (03PS2) 10Dzahn: CI: better naming; avoid using terms "new" and "legacy" [puppet] - 10https://gerrit.wikimedia.org/r/1294392 [19:34:08] (03Merged) 10jenkins-bot: miscweb: bump (design|research)-landing-page [deployment-charts] - 10https://gerrit.wikimedia.org/r/1294389 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza) [19:34:25] FIRING: [7x] SystemdUnitFailed: opensearch_2@relforge-eqiad-small-alpha.service on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:35:52] (03PS4) 10Cathal Mooney: LVS BGP: peer with the gateway if no exception is set [puppet] - 10https://gerrit.wikimedia.org/r/1282764 (owner: 10Ayounsi) [19:36:25] (03CR) 10Thcipriani: [C:03+1] scap.cfg.erb: Add hcaptcha checkout in production (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1287007 (https://phabricator.wikimedia.org/T403829) (owner: 10Ahmon Dancy) [19:36:27] (03PS2) 10Ahmon Dancy: scap.cfg.erb: Add hcaptcha checkout in production [puppet] - 10https://gerrit.wikimedia.org/r/1287007 (https://phabricator.wikimedia.org/T403829) [19:37:00] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1282764 (owner: 10Ayounsi) [19:37:05] (03CR) 10Ahmon Dancy: scap.cfg.erb: Add hcaptcha checkout in production (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1287007 (https://phabricator.wikimedia.org/T403829) (owner: 10Ahmon Dancy) [19:37:09] PROBLEM - pybal on lvs1016 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [19:37:11] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [19:37:19] (03CR) 10Thcipriani: [C:03+1] scap.cfg.erb: Add hcaptcha checkout in production [puppet] - 10https://gerrit.wikimedia.org/r/1287007 (https://phabricator.wikimedia.org/T403829) (owner: 10Ahmon Dancy) [19:39:25] RESOLVED: [5x] SystemdUnitFailed: opensearch_2@relforge-eqiad-small-alpha.service on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:41:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293819 (https://phabricator.wikimedia.org/T426614) (owner: 10Bartosz Dziewoński) [19:42:05] PROBLEM - PyBal connections to etcd on lvs1016 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [19:42:22] (03CR) 10BCornwall: [C:03+2] Remove lvs1016 hieradata, demote to insetup_noferm [puppet] - 10https://gerrit.wikimedia.org/r/1286524 (https://phabricator.wikimedia.org/T421421) (owner: 10BCornwall) [19:43:06] (03PS5) 10Cathal Mooney: LVS BGP: peer with the gateway if no exception is set [puppet] - 10https://gerrit.wikimedia.org/r/1282764 (owner: 10Ayounsi) [19:43:15] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1282764 (owner: 10Ayounsi) [19:45:46] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [19:45:59] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 86062048 and 12 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:45:59] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [19:46:00] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [19:46:03] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs1016 is CRITICAL: CRITICAL: Service pybal.service is not active. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [19:46:12] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [19:46:13] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [19:46:28] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [19:46:48] (03CR) 10BCornwall: [V:03+1 C:03+2] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8598/co" [puppet] - 10https://gerrit.wikimedia.org/r/1286524 (https://phabricator.wikimedia.org/T421421) (owner: 10BCornwall) [19:46:59] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 2736424 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:48:06] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [19:48:19] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [19:48:20] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [19:48:36] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [19:48:37] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [19:48:56] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [19:51:24] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs1016.eqiad.wmnet with OS bullseye [19:57:39] (03PS8) 10Jdlrobson: Remove MinervaNightMode config after skin cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285523 (https://phabricator.wikimedia.org/T426689) (owner: 10HakanIST) [20:00:05] RoanKattouw, urbanecm, TheresNoTime, kindrobot, and cjming: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260527T2000). nyaa~ [20:00:05] stephanebisson, ebernhardson, Pppery, and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:08] here [20:00:09] \o [20:00:12] o/ [20:00:17] hi [20:00:48] I'm starting... [20:00:57] i' not a deployer, i'd appreciate if someone could ship my change. it's not risky, it can go out together with whatever else. [20:02:02] I can do it MatmaRex [20:02:21] hmm looks like my deploy disappeared? [20:02:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy1003 using scap backport" [extensions/ArticleGuidance] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1294342 (https://phabricator.wikimedia.org/T426871) (owner: 10Sbisson) [20:02:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy1003 using scap backport" [extensions/ArticleGuidance] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294343 (https://phabricator.wikimedia.org/T426871) (owner: 10Sbisson) [20:02:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294344 (https://phabricator.wikimedia.org/T426871) (owner: 10Sbisson) [20:04:28] k ill do mine later since the deploy window looks very busy [20:04:34] !log cmooney@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 12355 [20:05:49] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 12355 [20:06:38] i wonder sometimes if we need a second deploy window that works for west coast? I dunno if later (4pm?) would be reasoable [20:08:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294393 (owner: 10Catrope) [20:14:09] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs1020.eqiad.wmnet [20:16:01] (03Merged) 10jenkins-bot: Allow disabling experiment for experienced editors (>=100 edits) [extensions/ArticleGuidance] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1294342 (https://phabricator.wikimedia.org/T426871) (owner: 10Sbisson) [20:16:08] (03Merged) 10jenkins-bot: frwiki: restrict Article Guidance experiment to junior editors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294344 (https://phabricator.wikimedia.org/T426871) (owner: 10Sbisson) [20:16:55] (03Merged) 10jenkins-bot: Allow disabling experiment for experienced editors (>=100 edits) [extensions/ArticleGuidance] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294343 (https://phabricator.wikimedia.org/T426871) (owner: 10Sbisson) [20:17:22] !log sbisson@deploy1003 Started scap sync-world: Backport for [[gerrit:1294342|Allow disabling experiment for experienced editors (>=100 edits) (T426871)]], [[gerrit:1294343|Allow disabling experiment for experienced editors (>=100 edits) (T426871)]], [[gerrit:1294344|frwiki: restrict Article Guidance experiment to junior editors (T426871)]] [20:17:27] T426871: Enable AG experiment on phase 2 wikis - https://phabricator.wikimedia.org/T426871 [20:19:14] !log sbisson@deploy1003 sbisson: Backport for [[gerrit:1294342|Allow disabling experiment for experienced editors (>=100 edits) (T426871)]], [[gerrit:1294343|Allow disabling experiment for experienced editors (>=100 edits) (T426871)]], [[gerrit:1294344|frwiki: restrict Article Guidance experiment to junior editors (T426871)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be v [20:19:14] erified there. [20:20:39] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs1020 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [20:20:54] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1020.eqiad.wmnet [20:21:19] !log sbisson@deploy1003 sbisson: Continuing with deployment [20:21:43] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs1016.eqiad.wmnet with OS bullseye [20:22:36] 06SRE, 06Traffic, 13Patch-For-Review: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11961220 (10BCornwall) [20:25:33] !log sbisson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1294342|Allow disabling experiment for experienced editors (>=100 edits) (T426871)]], [[gerrit:1294343|Allow disabling experiment for experienced editors (>=100 edits) (T426871)]], [[gerrit:1294344|frwiki: restrict Article Guidance experiment to junior editors (T426871)]] (duration: 08m 11s) [20:25:39] T426871: Enable AG experiment on phase 2 wikis - https://phabricator.wikimedia.org/T426871 [20:25:39] !log brett@cumin2002 START - Cookbook sre.hosts.decommission for hosts lvs1016.eqiad.wmnet [20:25:56] (03PS1) 10Bking: OpenSearch: Add required config for bootstrapping a cluster [puppet] - 10https://gerrit.wikimedia.org/r/1294402 (https://phabricator.wikimedia.org/T427306) [20:26:16] ok, I'm done. Over to you ebernhardson [20:26:18] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1294402 (https://phabricator.wikimedia.org/T427306) (owner: 10Bking) [20:27:13] !log reprepro include php8.3_8.3.31-1+wmf12u2 into component/php83 for bookworm-wikimedia - T427312 [20:27:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:18] T427312: Build PHP 8.3 packages for bookworm - https://phabricator.wikimedia.org/T427312 [20:29:04] stashbot: thanks! [20:29:04] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [20:29:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy1003 using scap backport" [extensions/CirrusSearch] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294373 (https://phabricator.wikimedia.org/T407432) (owner: 10Ebernhardson) [20:29:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294374 (https://phabricator.wikimedia.org/T407432) (owner: 10Ebernhardson) [20:30:38] (03Merged) 10jenkins-bot: Revert^2 "cirrus: AB test query suggester variants" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294374 (https://phabricator.wikimedia.org/T407432) (owner: 10Ebernhardson) [20:31:09] !log brett@cumin2002 START - Cookbook sre.dns.netbox [20:37:11] !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs1016.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brett@cumin2002" [20:38:20] !log reprepro include php-defaults_94+wmf12u1 into component/php83 for bookworm-wikimedia - T427312 [20:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:26] T427312: Build PHP 8.3 packages for bookworm - https://phabricator.wikimedia.org/T427312 [20:39:06] !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs1016.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brett@cumin2002" [20:39:07] !log brett@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:39:08] !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts lvs1016.eqiad.wmnet [20:40:17] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission lvs1016.eqiad.wmnet - https://phabricator.wikimedia.org/T427451#11961277 (10BCornwall) [20:40:34] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission lvs1016.eqiad.wmnet - https://phabricator.wikimedia.org/T427451#11961281 (10BCornwall) [20:40:37] 06SRE, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11961282 (10BCornwall) [20:41:37] 06SRE, 06Traffic: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11961284 (10BCornwall) 05In progress→03Resolved [20:43:44] (03Merged) 10jenkins-bot: identity: Prune private ips from x-forwarded-for [extensions/CirrusSearch] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294373 (https://phabricator.wikimedia.org/T407432) (owner: 10Ebernhardson) [20:43:54] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:43:56] !log reprepro include dh-php_5.5+wmf12u1 into component/php83 for bookworm-wikimedia - T427312 [20:44:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:01] T427312: Build PHP 8.3 packages for bookworm - https://phabricator.wikimedia.org/T427312 [20:44:14] !log ebernhardson@deploy1003 Started scap sync-world: Backport for [[gerrit:1294373|identity: Prune private ips from x-forwarded-for (T407432)]], [[gerrit:1294374|Revert^2 "cirrus: AB test query suggester variants" (T407432)]] [20:44:19] T407432: Follow-up AB test of dym language model variants - https://phabricator.wikimedia.org/T407432 [20:46:07] !log ebernhardson@deploy1003 ebernhardson: Backport for [[gerrit:1294373|identity: Prune private ips from x-forwarded-for (T407432)]], [[gerrit:1294374|Revert^2 "cirrus: AB test query suggester variants" (T407432)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:47:33] !log ebernhardson@deploy1003 ebernhardson: Continuing with deployment [20:48:27] (03CR) 10Kosta Harlan: scap.cfg.erb: Add hcaptcha checkout in production (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1287007 (https://phabricator.wikimedia.org/T403829) (owner: 10Ahmon Dancy) [20:48:54] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:51:45] !log ebernhardson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1294373|identity: Prune private ips from x-forwarded-for (T407432)]], [[gerrit:1294374|Revert^2 "cirrus: AB test query suggester variants" (T407432)]] (duration: 07m 30s) [20:51:50] T407432: Follow-up AB test of dym language model variants - https://phabricator.wikimedia.org/T407432 [20:52:21] Pppery: you're up, config's should probbly fit in 10min [20:52:29] Not a deployer [20:52:53] hmm, ok i can ship. Yours and MatmaRex's? [20:53:18] sure. thanks [20:53:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288370 (https://phabricator.wikimedia.org/T423766) (owner: 10Pppery) [20:53:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293819 (https://phabricator.wikimedia.org/T426614) (owner: 10Bartosz Dziewoński) [20:55:55] (03Merged) 10jenkins-bot: Allow Vector 2022 font size changes in namespace 100 for enwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288370 (https://phabricator.wikimedia.org/T423766) (owner: 10Pppery) [20:55:59] (03Merged) 10jenkins-bot: Fix case of 'commonsfinder' in $wgUrlProtocols [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293819 (https://phabricator.wikimedia.org/T426614) (owner: 10Bartosz Dziewoński) [20:56:25] !log ebernhardson@deploy1003 Started scap sync-world: Backport for [[gerrit:1288370|Allow Vector 2022 font size changes in namespace 100 for enwiktionary (T423766)]], [[gerrit:1293819|Fix case of 'commonsfinder' in $wgUrlProtocols (T426614)]] [20:56:31] T423766: Allow Vector-2022 font size changes in namespace 100 on the English Wiktionary - https://phabricator.wikimedia.org/T423766 [20:56:32] T426614: add "CommonsFinder://" custom scheme to $wgUrlProtocols for native app OAuth2 support - https://phabricator.wikimedia.org/T426614 [20:58:26] !log ebernhardson@deploy1003 matmarex, ebernhardson, pppery: Backport for [[gerrit:1288370|Allow Vector 2022 font size changes in namespace 100 for enwiktionary (T423766)]], [[gerrit:1293819|Fix case of 'commonsfinder' in $wgUrlProtocols (T426614)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:58:40] Looking [20:58:46] thanks [20:59:22] Looks good [20:59:29] MatmaRex: yours look ok? [20:59:34] ebernhardson: looks good, thanks [20:59:50] !log ebernhardson@deploy1003 matmarex, ebernhardson, pppery: Continuing with deployment [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260527T2100) [21:04:03] !log ebernhardson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1288370|Allow Vector 2022 font size changes in namespace 100 for enwiktionary (T423766)]], [[gerrit:1293819|Fix case of 'commonsfinder' in $wgUrlProtocols (T426614)]] (duration: 07m 38s) [21:04:09] T423766: Allow Vector-2022 font size changes in namespace 100 on the English Wiktionary - https://phabricator.wikimedia.org/T423766 [21:04:10] T426614: add "CommonsFinder://" custom scheme to $wgUrlProtocols for native app OAuth2 support - https://phabricator.wikimedia.org/T426614 [21:04:48] (03PS3) 10Ahmon Dancy: scap.cfg.erb: Add hcaptcha checkout in production [puppet] - 10https://gerrit.wikimedia.org/r/1287007 (https://phabricator.wikimedia.org/T403829) [21:04:58] (03CR) 10Ahmon Dancy: scap.cfg.erb: Add hcaptcha checkout in production (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1287007 (https://phabricator.wikimedia.org/T403829) (owner: 10Ahmon Dancy) [21:05:00] all set! deploy window complete [21:06:05] thank you! [21:09:25] FIRING: [2x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:14:25] FIRING: [7x] SystemdUnitFailed: opensearch-disable-readahead-relforge-eqiad.service on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:19:56] !log arlolra@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [21:20:21] !log arlolra@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [21:20:22] !log arlolra@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [21:20:49] !log arlolra@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [21:23:54] RESOLVED: [3x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on relforge1008:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [21:24:25] FIRING: [7x] SystemdUnitFailed: opensearch-disable-readahead-relforge-eqiad.service on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:37:41] !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 15 days, 0:00:00 on relforge[1008-1010].eqiad.wmnet with reason: non-production environment [21:43:06] FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [21:45:40] (03PS1) 10Eric Gardner: Exclude more content from selection [extensions/ReaderExperiments] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294432 (https://phabricator.wikimedia.org/T426308) [21:52:00] Heads up that I will be deploying two small patches in the readers window in about 10 minutes [22:00:04] Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260527T2200) [22:00:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by egardner@deploy1003 using scap backport" [extensions/MultimediaViewer] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294370 (https://phabricator.wikimedia.org/T427336) (owner: 10Eric Gardner) [22:02:00] EricGardner: me 2. [22:02:09] EricGardner: are yours config only or backports? [22:02:31] I'm doing backports [22:03:07] I have 2, just started the first one [22:03:12] (03Merged) 10jenkins-bot: Carousel only on articles [extensions/MultimediaViewer] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294370 (https://phabricator.wikimedia.org/T427336) (owner: 10Eric Gardner) [22:03:41] !log egardner@deploy1003 Started scap sync-world: Backport for [[gerrit:1294370|Carousel only on articles (T427336)]] [22:03:46] T427336: Carousel: Limit the feature to article pages only - https://phabricator.wikimedia.org/T427336 [22:04:13] Since they are both backports and scap takes a long time, mind if we builk the next ones together? There is an issue with thumbnail rendering on all pages impacting readers so pretty important this goes out [22:04:24] (i was unable to find space in the earlier backport window) [22:05:16] I'd prefer to backport my patches separately. I can wait to do my second one until you are done [22:05:36] !log egardner@deploy1003 egardner: Backport for [[gerrit:1294370|Carousel only on articles (T427336)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:05:46] Okay that works. I can bundle https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1294322 and https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1294360 together. [22:09:32] !log egardner@deploy1003 egardner: Continuing with deployment [22:10:07] (03CR) 10Cwhite: [C:04-1] OpenSearch: Add required config for bootstrapping a cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1294402 (https://phabricator.wikimedia.org/T427306) (owner: 10Bking) [22:10:26] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11961402 (10Papaul) [22:10:45] PROBLEM - Host ml-serve1014 is DOWN: PING CRITICAL - Packet loss = 100% [22:12:21] RECOVERY - Host ml-serve1014 is UP: PING OK - Packet loss = 0%, RTA = 3.61 ms [22:13:41] !log egardner@deploy1003 Finished scap sync-world: Backport for [[gerrit:1294370|Carousel only on articles (T427336)]] (duration: 10m 00s) [22:13:46] T427336: Carousel: Limit the feature to article pages only - https://phabricator.wikimedia.org/T427336 [22:14:20] Jdlrobson: feel free to deploy your patches now [22:14:31] thanks EricGardner on it [22:15:20] (03PS9) 10Jdlrobson: Remove MinervaNightMode config after skin cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285523 (https://phabricator.wikimedia.org/T426689) (owner: 10HakanIST) [22:16:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1294360 (https://phabricator.wikimedia.org/T427237) (owner: 10Jdlrobson) [22:16:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294322 (https://phabricator.wikimedia.org/T427237) (owner: 10Jdlrobson) [22:19:57] (03PS1) 10Catrope: passwordlessLogin: Limit conditional mediation to the main login form [extensions/OATHAuth] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294435 (https://phabricator.wikimedia.org/T427419) [22:22:09] (03CR) 10Ladsgroup: Add config for conductwiki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1292346 (https://phabricator.wikimedia.org/T426984) (owner: 10Ladsgroup) [22:22:14] (03PS2) 10Ladsgroup: Add config for conductwiki [puppet] - 10https://gerrit.wikimedia.org/r/1292346 (https://phabricator.wikimedia.org/T426984) [22:22:19] Jdlrobson: Do you mind if I tag along with another patch after you're done? [22:22:20] (03CR) 10Ladsgroup: [V:03+2 C:03+2] Add config for conductwiki [puppet] - 10https://gerrit.wikimedia.org/r/1292346 (https://phabricator.wikimedia.org/T426984) (owner: 10Ladsgroup) [22:23:06] RoanKattouw: Eric is after me but you can go after. I had a small cleanup patch (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1285523?usp=search) but that's not urgent. I can attempt that tomorrow. [22:23:45] OK, do what you need to do and then please ping me when you're both done [22:24:12] RoanKattouw: mine should be quick [22:28:07] (03Merged) 10jenkins-bot: Thumbnails are not being optimized in large mode [core] (wmf/1.47.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1294360 (https://phabricator.wikimedia.org/T427237) (owner: 10Jdlrobson) [22:30:37] (03CR) 10CI reject: [V:04-1] Thumbnails are not being optimized in large mode [core] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294322 (https://phabricator.wikimedia.org/T427237) (owner: 10Jdlrobson) [22:31:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294322 (https://phabricator.wikimedia.org/T427237) (owner: 10Jdlrobson) [22:32:00] :( flaky tests [22:33:19] !log ladsgroup@deploy1003 Started scap sync-world: Add conduct.wikimedia.org (T426984) [22:33:24] T426984: Create Conductwiki wiki - https://phabricator.wikimedia.org/T426984 [22:34:14] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11961461 (10Papaul) [22:34:16] !log ladsgroup@deploy1003 ladsgroup: Add conduct.wikimedia.org (T426984) synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:35:25] !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment [22:36:08] (03Merged) 10jenkins-bot: Thumbnails are not being optimized in large mode [core] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294322 (https://phabricator.wikimedia.org/T427237) (owner: 10Jdlrobson) [22:36:54] Did Amir start a deploy...? [22:37:00] Yes without being in this channel [22:37:04] I'm pinging him on Slack about this now [22:37:14] i think that just merged my chance without testing o_o [22:37:46] Parallel deploys? How is that even possible with the deployment lock? [22:37:53] I guess scap backport only acquires the lock after the change merges? [22:37:59] hmm i dont know what happened to my deploys [22:38:12] Ok. I have to go right at 4pm so I will relinquish my spot in the queue. RoanKattouw: you are welcome to proceed once Jdlrobson is done [22:38:19] My thing is less urgent and we can do it tomorrow [22:38:26] EricGardner: want me to do yours if I have time? [22:38:27] Jdlrobson: Your deploy is waiting for Amir's to be done [22:38:34] 22:36:16 concurrent prep is locked by ladsgroup (pid 3468335) on Wed May 27 22:32:23 2026; reason is "Add conduct.wikimedia.org (T426984)". [22:38:35] T426984: Create Conductwiki wiki - https://phabricator.wikimedia.org/T426984 [22:38:37] RoanKattouw Spiderpig says "All changes have been merged" but did not give me the chance to test [22:39:05] Yeah Spiderpig is paused, it will try again in 10 minutes to see if Amir's is done by then [22:39:12] urggh ok [22:39:12] If not, idk if it waits again or just fails [22:39:15] !log ladsgroup@deploy1003 Finished scap sync-world: Add conduct.wikimedia.org (T426984) (duration: 07m 16s) [22:39:29] Aha, it finished and yours immediately resumed [22:39:35] !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1294360|Thumbnails are not being optimized in large mode (T427237)]], [[gerrit:1294322|Thumbnails are not being optimized in large mode (T427237)]] [22:39:40] T427237: Regression: Thumbnails on content pages are not scaled for large preference without losing quality - https://phabricator.wikimedia.org/T427237 [22:39:45] ok cool [22:40:06] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.sanitarium_restart [22:40:06] !log ladsgroup@cumin1003 END (FAIL) - Cookbook sre.mysql.sanitarium_restart (exit_code=99) [22:40:18] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.sanitarium_restart [22:40:20] Jdlrobson the second patch I was going to deploy was https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ReaderExperiments/+/1294432 [22:40:42] But I can do it tomorrow if you run out of time [22:40:49] EricGardner: np [22:40:56] we'll see what happens :) [22:40:59] thanks! [22:41:01] but yeh will deploy if i can [22:41:29] !log jdlrobson@deploy1003 jdlrobson: Backport for [[gerrit:1294360|Thumbnails are not being optimized in large mode (T427237)]], [[gerrit:1294322|Thumbnails are not being optimized in large mode (T427237)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:42:24] !log jdlrobson@deploy1003 jdlrobson: Continuing with deployment [22:43:04] I'm so sorry. I should have checked the window. It was late, I assumed everything is over [22:43:13] If there is anything I can do to help, let me know [22:43:26] ladsgroup@cumin1003 sanitarium_restart (PID 1976244) is awaiting input [22:43:33] Amir1: All good, your deploy was really fast and Spiderpig's locking mechanism worked perfectly [22:43:45] It paused Jon's deploy for 3 minutes and then automatically resumed when yours was done [22:44:03] I was pushing a simple apache change [22:44:11] glad it worked and sorry again [22:45:02] RoanKattouw: you can go now and then i'll try and fit Eric's in [22:45:09] (mine is just syncing now) [22:46:30] !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1294360|Thumbnails are not being optimized in large mode (T427237)]], [[gerrit:1294322|Thumbnails are not being optimized in large mode (T427237)]] (duration: 06m 54s) [22:46:34] T427237: Regression: Thumbnails on content pages are not scaled for large preference without losing quality - https://phabricator.wikimedia.org/T427237 [22:47:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1003 using scap backport" [extensions/OATHAuth] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294435 (https://phabricator.wikimedia.org/T427419) (owner: 10Catrope) [22:49:47] PROBLEM - VRRP status on cr3-eqsin is CRITICAL: VRRP CRITICAL - 1 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [22:50:20] (03Merged) 10jenkins-bot: passwordlessLogin: Limit conditional mediation to the main login form [extensions/OATHAuth] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294435 (https://phabricator.wikimedia.org/T427419) (owner: 10Catrope) [22:50:47] !log catrope@deploy1003 Started scap sync-world: Backport for [[gerrit:1294435|passwordlessLogin: Limit conditional mediation to the main login form (T427419)]] [22:50:52] T427419: Unable to finish 2FA - https://phabricator.wikimedia.org/T427419 [22:52:38] !log catrope@deploy1003 catrope: Backport for [[gerrit:1294435|passwordlessLogin: Limit conditional mediation to the main login form (T427419)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:54:27] !log catrope@deploy1003 catrope: Continuing with deployment [22:55:01] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.sanitarium_restart (exit_code=0) [22:58:36] !log catrope@deploy1003 Finished scap sync-world: Backport for [[gerrit:1294435|passwordlessLogin: Limit conditional mediation to the main login form (T427419)]] (duration: 07m 49s) [22:58:41] T427419: Unable to finish 2FA - https://phabricator.wikimedia.org/T427419 [22:58:47] Jdlrobson: Mine is done, go ahead [23:00:38] thanks RoanKattouw [23:01:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [extensions/ReaderExperiments] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294432 (https://phabricator.wikimedia.org/T426308) (owner: 10Eric Gardner) [23:01:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285523 (https://phabricator.wikimedia.org/T426689) (owner: 10HakanIST) [23:02:11] (03Merged) 10jenkins-bot: Remove MinervaNightMode config after skin cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285523 (https://phabricator.wikimedia.org/T426689) (owner: 10HakanIST) [23:02:27] (03PS1) 10Ladsgroup: Init conductwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294438 (https://phabricator.wikimedia.org/T426984) [23:03:35] (03CR) 10CI reject: [V:04-1] Init conductwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294438 (https://phabricator.wikimedia.org/T426984) (owner: 10Ladsgroup) [23:04:15] (03Merged) 10jenkins-bot: Exclude more content from selection [extensions/ReaderExperiments] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1294432 (https://phabricator.wikimedia.org/T426308) (owner: 10Eric Gardner) [23:04:42] !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1294432|Exclude more content from selection (T426308)]], [[gerrit:1285523|Remove MinervaNightMode config after skin cleanup (T426689)]] [23:04:49] T426308: [Share Highlights] Share card display edge cases - https://phabricator.wikimedia.org/T426308 [23:04:50] T426689: Remove night mode flags in Minerva and Vector - https://phabricator.wikimedia.org/T426689 [23:05:34] (03PS2) 10Ladsgroup: Init conductwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294438 (https://phabricator.wikimedia.org/T426984) [23:06:26] (03CR) 10CI reject: [V:04-1] Init conductwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294438 (https://phabricator.wikimedia.org/T426984) (owner: 10Ladsgroup) [23:06:37] !log jdlrobson@deploy1003 jdlrobson, h2o, egardner: Backport for [[gerrit:1294432|Exclude more content from selection (T426308)]], [[gerrit:1285523|Remove MinervaNightMode config after skin cleanup (T426689)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:09:12] !log jdlrobson@deploy1003 jdlrobson, h2o, egardner: Continuing with deployment [23:09:28] ok lgtm [23:10:37] (03PS3) 10Ladsgroup: Init conductwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294438 (https://phabricator.wikimedia.org/T426984) [23:11:27] (03CR) 10CI reject: [V:04-1] Init conductwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294438 (https://phabricator.wikimedia.org/T426984) (owner: 10Ladsgroup) [23:13:17] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:13:24] !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1294432|Exclude more content from selection (T426308)]], [[gerrit:1285523|Remove MinervaNightMode config after skin cleanup (T426689)]] (duration: 08m 42s) [23:13:28] all done [23:13:30] T426308: [Share Highlights] Share card display edge cases - https://phabricator.wikimedia.org/T426308 [23:13:30] T426689: Remove night mode flags in Minerva and Vector - https://phabricator.wikimedia.org/T426689 [23:16:17] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:21:50] (03PS4) 10Ladsgroup: Init conductwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294438 (https://phabricator.wikimedia.org/T426984) [23:22:40] (03CR) 10CI reject: [V:04-1] Init conductwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294438 (https://phabricator.wikimedia.org/T426984) (owner: 10Ladsgroup) [23:23:54] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:25:54] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:26:46] (03PS5) 10Ladsgroup: Init conductwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294438 (https://phabricator.wikimedia.org/T426984) [23:27:35] (03CR) 10CI reject: [V:04-1] Init conductwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294438 (https://phabricator.wikimedia.org/T426984) (owner: 10Ladsgroup) [23:30:04] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1293821 (owner: 10TrainBranchBot) [23:39:37] (03PS6) 10Ladsgroup: Init conductwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294438 (https://phabricator.wikimedia.org/T426984) [23:39:38] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1294440 [23:39:38] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1294440 (owner: 10TrainBranchBot) [23:40:41] (03CR) 10CI reject: [V:04-1] Init conductwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294438 (https://phabricator.wikimedia.org/T426984) (owner: 10Ladsgroup) [23:53:03] (03PS7) 10Ladsgroup: Init conductwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1294438 (https://phabricator.wikimedia.org/T426984) [23:54:47] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1294440 (owner: 10TrainBranchBot) [23:59:57] jouncebot: nowandnext [23:59:57] No deployments scheduled for the next 6 hour(s) and 0 minute(s) [23:59:57] In 6 hour(s) and 0 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260528T0600) [23:59:57] In 6 hour(s) and 0 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260528T0600)