[00:00:37] jouncebot: nowandnext [00:00:37] No deployments scheduled for the next 6 hour(s) and 59 minute(s) [00:00:37] In 6 hour(s) and 59 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260225T0700) [00:01:14] (03CR) 10Zabe: [C:03+2] Update documenation to reference config-schema.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1241167 (owner: 10Zabe) [00:02:11] (03Merged) 10jenkins-bot: Update documenation to reference config-schema.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1241167 (owner: 10Zabe) [00:02:50] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1241167|Update documenation to reference config-schema.php]] [00:04:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 21.55% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:05:08] !log zabe@deploy2002 zabe: Backport for [[gerrit:1241167|Update documenation to reference config-schema.php]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:06:14] !log zabe@deploy2002 zabe: Continuing with sync [00:10:10] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1241167|Update documenation to reference config-schema.php]] (duration: 07m 20s) [00:10:45] (03PS3) 10Scott French: mesh: Support injection of extra env vars into envoy container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242520 (https://phabricator.wikimedia.org/T364245) [00:10:46] (03PS3) 10Scott French: mediawiki: Bump mesh.configuration and mesh.deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242521 (https://phabricator.wikimedia.org/T364245) [00:10:46] (03PS4) 10Scott French: mw-debug: Pilot new drain configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242522 (https://phabricator.wikimedia.org/T364245) [00:11:06] (03CR) 10Scott French: "Thank you both for the reviews!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242518 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [00:11:33] (03CR) 10Scott French: "Thanks for digging into it! And yeah, the documentation situation is non-ideal in terms of what exactly an exec action does ... My fuzzy m" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242520 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [00:11:46] !log zabe@deploy2002:~$ foreachwiki extensions/TimedMediaHandler/maintenance/migrateTranscodeStates.php # T415064 [00:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:51] T415064: Backfill new status and touched columns - https://phabricator.wikimedia.org/T415064 [00:13:13] (03CR) 10Scott French: "Ah, good catch! I completely forgot to update package.lock, which sextant clearly would have done for me (as you no doubt guessed, this wa" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242521 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [00:14:54] (03CR) 10Scott French: "Thank you both!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242522 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [00:15:27] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:15:27] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:17:46] (03CR) 10Zabe: [C:03+2] Start reading from new file tables on all small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243274 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe) [00:18:41] (03Merged) 10jenkins-bot: Start reading from new file tables on all small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243274 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe) [00:19:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.36% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:19:16] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1243274|Start reading from new file tables on all small wikis (T416548)]] [00:19:21] T416548: Start reading from file table on wmf production - https://phabricator.wikimedia.org/T416548 [00:19:42] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:21:33] !log zabe@deploy2002 zabe: Backport for [[gerrit:1243274|Start reading from new file tables on all small wikis (T416548)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:22:00] !log zabe@deploy2002 zabe: Continuing with sync [00:23:22] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:25:56] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1243274|Start reading from new file tables on all small wikis (T416548)]] (duration: 06m 40s) [00:26:10] T416548: Start reading from file table on wmf production - https://phabricator.wikimedia.org/T416548 [00:27:23] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:28:23] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1019.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:40:11] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1243288 [00:40:11] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1243288 (owner: 10TrainBranchBot) [00:45:23] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:45:23] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:49:42] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [00:52:43] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1243288 (owner: 10TrainBranchBot) [01:01:03] FIRING: PuppetDisabled: Puppet disabled on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=relforge&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [01:08:58] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1243294 [01:08:58] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1243294 (owner: 10TrainBranchBot) [01:10:39] (03CR) 10ArielGlenn: [C:03+1] "I think this is ok to go live." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1239972 (owner: 10Daniel Kinzler) [01:34:27] FIRING: HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=kserve - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [01:34:29] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2008 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [01:34:34] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1243294 (owner: 10TrainBranchBot) [01:35:29] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [01:39:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1260 (T415786)', diff saved to https://phabricator.wikimedia.org/P89018 and previous config saved to /var/cache/conftool/dbconfig/20260225-013921-marostegui.json [01:39:27] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [01:54:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1260', diff saved to https://phabricator.wikimedia.org/P89019 and previous config saved to /var/cache/conftool/dbconfig/20260225-015430-marostegui.json [02:00:55] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [02:08:22] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:09:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1260', diff saved to https://phabricator.wikimedia.org/P89020 and previous config saved to /var/cache/conftool/dbconfig/20260225-020938-marostegui.json [02:11:17] FIRING: [4x] ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:12:29] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:12:29] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:13:44] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 12m 49s) [02:16:17] FIRING: [14x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:21:17] FIRING: [22x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:23:32] !log [WDQS] Restart codfw wdqs-main [02:23:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:24:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1260 (T415786)', diff saved to https://phabricator.wikimedia.org/P89021 and previous config saved to /var/cache/conftool/dbconfig/20260225-022446-marostegui.json [02:24:55] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1261.eqiad.wmnet with reason: Maintenance [02:24:55] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [02:25:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1261 (T415786)', diff saved to https://phabricator.wikimedia.org/P89022 and previous config saved to /var/cache/conftool/dbconfig/20260225-022502-marostegui.json [02:26:17] FIRING: [32x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:31:17] FIRING: [34x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:33:22] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:36:17] FIRING: [32x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:56:17] FIRING: [24x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:49:42] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:55:21] PROBLEM - Debian mirror in sync with upstream on mirror1001 is CRITICAL: /srv/mirrors/debian is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [05:01:04] FIRING: PuppetDisabled: Puppet disabled on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=relforge&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [05:03:10] FIRING: BFDdown: BFD session down between cr2-eqdfw and fe80::a6e1:1a00:1a6f:d3a3 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:08:10] RESOLVED: BFDdown: BFD session down between cr2-eqdfw and fe80::a6e1:1a00:1a6f:d3a3 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:27:14] (03CR) 10Marostegui: [C:03+2] db2230: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1243070 (https://phabricator.wikimedia.org/T285079) (owner: 10Marostegui) [05:31:59] (03CR) 10Marostegui: "This host isn't accessible, is it still being installed? If so, wouldn't it make sense to put it with role insetup, install the OS and the" [puppet] - 10https://gerrit.wikimedia.org/r/1243134 (https://phabricator.wikimedia.org/T317179) (owner: 10Federico Ceratto) [05:34:27] FIRING: HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=kserve - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:36:01] (03PS1) 10Marostegui: dbproxy1023: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1243590 (https://phabricator.wikimedia.org/T414656) [05:36:53] (03CR) 10Marostegui: [C:03+2] dbproxy1023: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1243590 (https://phabricator.wikimedia.org/T414656) (owner: 10Marostegui) [05:38:38] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host dbproxy1023.eqiad.wmnet with OS trixie [05:44:53] PROBLEM - pt-heartbeat-wikimedia process on db2230 is CRITICAL: PROCS CRITICAL: 0 processes with args pt-heartbeat-wikimedia https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23pt-heartbeat [05:45:30] ^ me testing [05:45:53] RECOVERY - pt-heartbeat-wikimedia process on db2230 is OK: PROCS OK: 1 process with args pt-heartbeat-wikimedia https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23pt-heartbeat [05:46:57] (03PS1) 10Marostegui: Revert "db2230: Enable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1243593 [05:47:34] (03CR) 10Marostegui: [C:03+2] Revert "db2230: Enable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1243593 (owner: 10Marostegui) [05:54:48] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy1023.eqiad.wmnet with reason: host reimage [05:58:04] (03PS1) 10Marostegui: mariadb: Add monitor_heartbeat to core hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1243594 (https://phabricator.wikimedia.org/T285079) [05:59:10] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy1023.eqiad.wmnet with reason: host reimage [06:00:23] (03CR) 10Marostegui: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243594 (https://phabricator.wikimedia.org/T285079) (owner: 10Marostegui) [06:00:26] (03CR) 10CI reject: [V:04-1] mariadb: Add monitor_heartbeat to core hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1243594 (https://phabricator.wikimedia.org/T285079) (owner: 10Marostegui) [06:01:59] (03PS2) 10Marostegui: mariadb: Add monitor_heartbeat to core hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1243594 (https://phabricator.wikimedia.org/T285079) [06:04:27] (03CR) 10Marostegui: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243594 (https://phabricator.wikimedia.org/T285079) (owner: 10Marostegui) [06:13:22] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:13:49] (03CR) 10Marostegui: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243594 (https://phabricator.wikimedia.org/T285079) (owner: 10Marostegui) [06:16:26] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy1023.eqiad.wmnet with OS trixie [06:18:13] (03CR) 10Marostegui: "PCC looks good: https://puppet-compiler.wmflabs.org/output/1243594/5908/" [puppet] - 10https://gerrit.wikimedia.org/r/1243594 (https://phabricator.wikimedia.org/T285079) (owner: 10Marostegui) [06:19:15] (03PS1) 10Marostegui: Revert "dbproxy1023: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1243598 [06:19:49] (03CR) 10Marostegui: [C:03+2] Revert "dbproxy1023: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1243598 (owner: 10Marostegui) [06:24:42] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:28:22] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260225T0700) [07:23:29] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:23:31] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:33:25] (03PS1) 10Arnaudb: gerrit: fix known host for gerrit2002 [puppet] - 10https://gerrit.wikimedia.org/r/1243633 [07:33:40] (03CR) 10Arnaudb: [C:03+2] gerrit: fix known host for gerrit2002 [puppet] - 10https://gerrit.wikimedia.org/r/1243633 (owner: 10Arnaudb) [07:37:26] (03CR) 10Muehlenhoff: [C:03+1] admin: rename gerrit system user [puppet] - 10https://gerrit.wikimedia.org/r/1243188 (https://phabricator.wikimedia.org/T338470) (owner: 10Dzahn) [07:37:57] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1243175 (owner: 10Elukey) [07:38:07] (03CR) 10Arnaudb: [C:03+1] admin: rename gerrit system user [puppet] - 10https://gerrit.wikimedia.org/r/1243188 (https://phabricator.wikimedia.org/T338470) (owner: 10Dzahn) [07:38:40] (03CR) 10Arnaudb: [C:03+1] gerrit: cleanup Hiera and tests after gerrit2 renaming [puppet] - 10https://gerrit.wikimedia.org/r/1243187 (https://phabricator.wikimedia.org/T338470) (owner: 10Dzahn) [07:40:07] (03CR) 10Muehlenhoff: "Yes, that's one of the perks of using firewall::service, the DNS resolution is performed by the Puppet server and not locally on the clien" [puppet] - 10https://gerrit.wikimedia.org/r/1242430 (owner: 10Muehlenhoff) [07:40:55] (03CR) 10Filippo Giunchedi: [C:03+1] wmcs: infra-tracing-nfs improve requests failures [puppet] - 10https://gerrit.wikimedia.org/r/1243151 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [07:41:54] (03CR) 10Muehlenhoff: [C:03+2] Remove create_ecdsa_cert [puppet] - 10https://gerrit.wikimedia.org/r/1243090 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [07:48:00] (03CR) 10Muehlenhoff: [C:03+2] Remove various Hiera files only necessary for Puppet 5 [puppet] - 10https://gerrit.wikimedia.org/r/1243087 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [07:48:29] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:48:31] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:52:31] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2013.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:53:29] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2013.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:56:32] FIRING: [22x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:57:26] (03PS1) 10Slyngshede: Signup: Rename setting for domain blocking [software/bitu] - 10https://gerrit.wikimedia.org/r/1243694 (https://phabricator.wikimedia.org/T418201) [07:58:43] (03CR) 10Slyngshede: [C:03+2] Allow blacklisting of domains for signup (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1243007 (https://phabricator.wikimedia.org/T418201) (owner: 10Slyngshede) [07:59:47] (03PS1) 10Ecarg: turn on custom oTel spans for Wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243695 (https://phabricator.wikimedia.org/T417750) [07:59:56] !log slyngshede@cumin1003 START - Cookbook sre.hosts.reimage for host cp2045.codfw.wmnet with OS trixie [08:00:05] Amir1, Urbanecm, and awight: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260225T0800). [08:00:06] No Gerrit patches in the queue for this window AFAICS. [08:02:48] (03PS1) 10DCausse: opensearch-semantic-search: configure cluster capacity [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243696 [08:03:26] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1243694 (https://phabricator.wikimedia.org/T418201) (owner: 10Slyngshede) [08:03:42] (03PS3) 10Muehlenhoff: pcc_update_facts: Rename variables [puppet] - 10https://gerrit.wikimedia.org/r/1227734 (https://phabricator.wikimedia.org/T365798) [08:04:48] (03PS3) 10Muehlenhoff: ferm: Remove obsolete OS check [puppet] - 10https://gerrit.wikimedia.org/r/1243045 [08:05:03] (03CR) 10DCausse: "Based on https://docs.google.com/spreadsheets/d/17Ipli-b1Mlrqx22cihgsiJOKUFDQSREFYQKVQPC5zPo/edit?gid=1128647813#gid=1128647813" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243696 (owner: 10DCausse) [08:09:38] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243045 (owner: 10Muehlenhoff) [08:13:25] (03CR) 10Slyngshede: [C:03+1] "It wouldn't appear that we use "--puppet-master" anywhere, so should be safe to rename to --puppet-server" [puppet] - 10https://gerrit.wikimedia.org/r/1227734 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [08:14:30] !log slyngshede@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2045.codfw.wmnet with reason: host reimage [08:17:00] (03CR) 10Slyngshede: [C:03+2] Signup: Rename setting for domain blocking [software/bitu] - 10https://gerrit.wikimedia.org/r/1243694 (https://phabricator.wikimedia.org/T418201) (owner: 10Slyngshede) [08:17:04] (03CR) 10Elukey: [C:03+2] .wmfconfig: remove Buster [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1243175 (owner: 10Elukey) [08:17:23] (03PS1) 10Jelto: aptrepo: upgrade gitlab-ce and gitlab-runner to 18.7 [puppet] - 10https://gerrit.wikimedia.org/r/1243697 (https://phabricator.wikimedia.org/T418344) [08:18:22] (03CR) 10Arnaudb: [C:03+1] "lgtm, welcome back!" [puppet] - 10https://gerrit.wikimedia.org/r/1243697 (https://phabricator.wikimedia.org/T418344) (owner: 10Jelto) [08:19:41] (03Merged) 10jenkins-bot: Signup: Rename setting for domain blocking [software/bitu] - 10https://gerrit.wikimedia.org/r/1243694 (https://phabricator.wikimedia.org/T418201) (owner: 10Slyngshede) [08:20:06] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2045.codfw.wmnet with reason: host reimage [08:20:08] (03CR) 10Jelto: [C:03+2] aptrepo: upgrade gitlab-ce and gitlab-runner to 18.7 [puppet] - 10https://gerrit.wikimedia.org/r/1243697 (https://phabricator.wikimedia.org/T418344) (owner: 10Jelto) [08:27:12] (03CR) 10Fabfur: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1243258 (owner: 10BCornwall) [08:30:07] (03PS1) 10Ryan Kemper: wdqs: Add Blazegraph deadlock auto-remediation [puppet] - 10https://gerrit.wikimedia.org/r/1243698 (https://phabricator.wikimedia.org/T242453) [08:30:09] (03PS1) 10Ryan Kemper: wdqs: Enable deadlock auto-remediation for codfw [puppet] - 10https://gerrit.wikimedia.org/r/1243699 (https://phabricator.wikimedia.org/T242453) [08:30:12] (03PS1) 10Ryan Kemper: wdqs: Enable deadlock auto-remediation for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1243700 (https://phabricator.wikimedia.org/T242453) [08:30:51] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243698 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [08:31:01] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243699 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [08:31:27] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243137 (owner: 10Muehlenhoff) [08:31:39] (03PS3) 10Muehlenhoff: wmflib::service::probe::tcp_module_options: Remove support for Buster [puppet] - 10https://gerrit.wikimedia.org/r/1243137 [08:32:29] (03CR) 10CI reject: [V:04-1] wdqs: Add Blazegraph deadlock auto-remediation [puppet] - 10https://gerrit.wikimedia.org/r/1243698 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [08:33:52] (03CR) 10Muehlenhoff: [C:03+2] Reapply "Update two hooks to the variants from the puppetserver module" [puppet] - 10https://gerrit.wikimedia.org/r/1243107 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [08:35:46] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 2 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11649709 (10MatthewVernon) >>! In T414805#11642586, @ShakespeareFan00 wrote: > This change to standardised size has also broken the "... [08:37:41] (03PS2) 10Ryan Kemper: wdqs: Add Blazegraph deadlock auto-remediation [puppet] - 10https://gerrit.wikimedia.org/r/1243698 (https://phabricator.wikimedia.org/T242453) [08:37:41] (03PS2) 10Ryan Kemper: wdqs: Enable deadlock auto-remediation for codfw [puppet] - 10https://gerrit.wikimedia.org/r/1243699 (https://phabricator.wikimedia.org/T242453) [08:37:41] (03PS2) 10Ryan Kemper: wdqs: Enable deadlock auto-remediation for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1243700 (https://phabricator.wikimedia.org/T242453) [08:39:17] (03CR) 10Slyngshede: [C:03+2] Inform about gitlab profile updating quirks [software/bitu] - 10https://gerrit.wikimedia.org/r/1242389 (https://phabricator.wikimedia.org/T416898) (owner: 10Slyngshede) [08:40:18] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243698 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [08:40:25] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243699 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [08:40:27] (03CR) 10Slyngshede: [C:03+1] admin: add rsilvola to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1243196 (https://phabricator.wikimedia.org/T418004) (owner: 10Dzahn) [08:42:15] (03Merged) 10jenkins-bot: Inform about gitlab profile updating quirks [software/bitu] - 10https://gerrit.wikimedia.org/r/1242389 (https://phabricator.wikimedia.org/T416898) (owner: 10Slyngshede) [08:44:11] (03CR) 10Brouberol: "This is requiring 768GB of RAM and 96CPU. The ram request/limit is above the 150GB quota (https://grafana-rw.wikimedia.org/d/ca9c0221-4a0d" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243696 (owner: 10DCausse) [08:45:37] !log ammarpad@deploy2002 mwscript-k8s job started: extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=gawiki --logwiki=metawiki DroopyDoggy AlterDiegos # T418330 [08:45:42] T418330: Unblock stuck global rename of AlterDiegos - https://phabricator.wikimedia.org/T418330 [08:46:29] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2008 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [08:46:34] !log ammarpad@deploy2002 mwscript-k8s job started: extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=mediawikiwiki --logwiki=metawiki Egortropeano Fortuna1992 # T418331 [08:46:38] T418331: Unblock stuck global rename of Fortuna1992 - https://phabricator.wikimedia.org/T418331 [08:47:29] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [08:49:24] (03PS3) 10Ryan Kemper: wdqs: Add Blazegraph deadlock auto-remediation [puppet] - 10https://gerrit.wikimedia.org/r/1243698 (https://phabricator.wikimedia.org/T242453) [08:49:24] (03PS3) 10Ryan Kemper: wdqs: Enable deadlock auto-remediation for codfw [puppet] - 10https://gerrit.wikimedia.org/r/1243699 (https://phabricator.wikimedia.org/T242453) [08:49:24] (03PS3) 10Ryan Kemper: wdqs: Enable deadlock auto-remediation for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1243700 (https://phabricator.wikimedia.org/T242453) [08:49:28] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243137 (owner: 10Muehlenhoff) [08:50:44] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243698 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [08:50:51] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243699 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [08:52:23] (03CR) 10Ayounsi: [C:03+1] ferm: Remove obsolete OS check [puppet] - 10https://gerrit.wikimedia.org/r/1243045 (owner: 10Muehlenhoff) [08:53:27] FIRING: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2007:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [08:55:40] (03CR) 10Arnaudb: [C:03+1] backup: adjust gerrit file set after renaming of gerrit2 [puppet] - 10https://gerrit.wikimedia.org/r/1243183 (https://phabricator.wikimedia.org/T417247) (owner: 10Dzahn) [08:56:34] (03PS1) 10Elukey: CHANGELOG: add changelogs for release v3.0.0 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1243717 [08:57:38] (03CR) 10Elukey: "I decided to go for a major version since we dropped support for older Python versions and upgraded the deps to match Bullseye's, the mino" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1243717 (owner: 10Elukey) [08:57:38] (03CR) 10Brouberol: [C:03+1] "Looks great!" [puppet] - 10https://gerrit.wikimedia.org/r/1243698 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [08:57:46] (03CR) 10Arnaudb: [C:03+2] "I added the comment # pint disable promql/series" [alerts] - 10https://gerrit.wikimedia.org/r/1243102 (https://phabricator.wikimedia.org/T418084) (owner: 10Arnaudb) [08:59:27] (03Merged) 10jenkins-bot: gerrit: limit GerritHAProxyServiceUnavailable scope [alerts] - 10https://gerrit.wikimedia.org/r/1243102 (https://phabricator.wikimedia.org/T418084) (owner: 10Arnaudb) [08:59:56] (03CR) 10Ayounsi: [C:03+1] "Great idea, and PCC looks sane. Careful rollout is still needed considering the blast radius." [puppet] - 10https://gerrit.wikimedia.org/r/1241003 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [09:01:04] FIRING: PuppetDisabled: Puppet disabled on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=relforge&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [09:01:51] (03PS1) 10Federico Ceratto: site.pp: Add dborch1003 as insetup [puppet] - 10https://gerrit.wikimedia.org/r/1243719 (https://phabricator.wikimedia.org/T317179) [09:05:43] (03CR) 10Federico Ceratto: "AFAICT it should be possible to skip the insetup CR, puppet-merge etc by first getting the ipaddr for the new VM and then redeploy it dire" [puppet] - 10https://gerrit.wikimedia.org/r/1243134 (https://phabricator.wikimedia.org/T317179) (owner: 10Federico Ceratto) [09:08:22] (03PS1) 10Slyngshede: P:cache::haproxy set cp2045 to haproxy 3.0 [puppet] - 10https://gerrit.wikimedia.org/r/1243720 (https://phabricator.wikimedia.org/T418161) [09:08:29] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:08:31] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:08:49] (03CR) 10Vgutierrez: "looks good, please see inline comments" [puppet] - 10https://gerrit.wikimedia.org/r/1243258 (owner: 10BCornwall) [09:13:27] RESOLVED: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2007:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [09:13:29] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:13:31] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:13:38] (03CR) 10Muehlenhoff: [C:03+1] site.pp: Add dborch1003 as insetup [puppet] - 10https://gerrit.wikimedia.org/r/1243719 (https://phabricator.wikimedia.org/T317179) (owner: 10Federico Ceratto) [09:18:39] (03CR) 10Gehel: "suggestion(non-blocking)" [puppet] - 10https://gerrit.wikimedia.org/r/1243698 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [09:21:25] FIRING: SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:21:29] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:21:31] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:22:15] !log push pfw policies - T418305 [09:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:09] (03CR) 10Fabfur: [C:03+1] P:cache::haproxy set cp2045 to haproxy 3.0 [puppet] - 10https://gerrit.wikimedia.org/r/1243720 (https://phabricator.wikimedia.org/T418161) (owner: 10Slyngshede) [09:25:34] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to Superset for mikez - https://phabricator.wikimedia.org/T418098#11649823 (10Vgutierrez) a:03Vgutierrez [09:26:25] RESOLVED: SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:26:37] 06SRE, 10SRE-swift-storage, 10Ceph, 06Data-Persistence, and 2 others: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11649825 (10elukey) I have just dropped everything from the restricted bucket (codfw endpoint via s3cmd) and flushed the redis databases in codfw and eqiad,... [09:27:39] (03CR) 10Muehlenhoff: [C:03+2] ferm: Remove obsolete OS check [puppet] - 10https://gerrit.wikimedia.org/r/1243045 (owner: 10Muehlenhoff) [09:30:45] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to Superset for mikez - https://phabricator.wikimedia.org/T418098#11649836 (10Vgutierrez) waiting for mcollins approval, I've pinged them on Slack cause I've failed to find their phabricator user so far [09:34:19] (03CR) 10Muehlenhoff: [C:03+2] mtail: Use the Debian version of mtail universally [puppet] - 10https://gerrit.wikimedia.org/r/1243048 (owner: 10Muehlenhoff) [09:34:27] FIRING: HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=kserve - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:35:06] (03CR) 10Ayounsi: "Overall lgtm." [puppet] - 10https://gerrit.wikimedia.org/r/1243137 (owner: 10Muehlenhoff) [09:35:11] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1243717 (owner: 10Elukey) [09:36:04] (03CR) 10Federico Ceratto: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1243594 (https://phabricator.wikimedia.org/T285079) (owner: 10Marostegui) [09:37:00] (03PS1) 10Elukey: docker_registry: remove the /test prefix special handling [puppet] - 10https://gerrit.wikimedia.org/r/1243726 (https://phabricator.wikimedia.org/T394476) [09:37:03] (03PS1) 10Elukey: docker_registry: move the /v2/restricted prefix to s3/apus [puppet] - 10https://gerrit.wikimedia.org/r/1243727 (https://phabricator.wikimedia.org/T412951) [09:37:23] (03PS1) 10Slyngshede: P:cache::haproxy install haproxy from main on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1243728 (https://phabricator.wikimedia.org/T418161) [09:39:38] 10SRE-swift-storage, 10Ceph, 06ServiceOps new, 07Epic, and 3 others: Move the docker registry's /restricted prefix to Docker Distribution backed up by Ceph - https://phabricator.wikimedia.org/T412951#11649870 (10elukey) 05Stalled→03Open The new Ceph Reef version running on apus seems to work way better... [09:39:49] (03CR) 10CI reject: [V:04-1] P:cache::haproxy install haproxy from main on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1243728 (https://phabricator.wikimedia.org/T418161) (owner: 10Slyngshede) [09:39:50] (03CR) 10Elukey: [C:03+2] CHANGELOG: add changelogs for release v3.0.0 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1243717 (owner: 10Elukey) [09:39:51] (03CR) 10Federico Ceratto: [C:03+2] site.pp: Add dborch1003 as insetup [puppet] - 10https://gerrit.wikimedia.org/r/1243719 (https://phabricator.wikimedia.org/T317179) (owner: 10Federico Ceratto) [09:41:01] (03CR) 10Muehlenhoff: P:cache::haproxy install haproxy from main on Trixie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1243728 (https://phabricator.wikimedia.org/T418161) (owner: 10Slyngshede) [09:41:16] (03PS2) 10Slyngshede: P:cache::haproxy install haproxy from main on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1243728 (https://phabricator.wikimedia.org/T418161) [09:41:47] (03PS1) 10Elukey: Upstream release v3.0.0 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/1243729 [09:42:00] (03CR) 10Elukey: [V:03+2 C:03+2] Upstream release v3.0.0 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/1243729 (owner: 10Elukey) [09:45:46] (03PS3) 10Slyngshede: P:cache::haproxy install haproxy from main on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1243728 (https://phabricator.wikimedia.org/T418161) [09:46:15] (03CR) 10Slyngshede: P:cache::haproxy install haproxy from main on Trixie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1243728 (https://phabricator.wikimedia.org/T418161) (owner: 10Slyngshede) [09:48:47] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [09:49:38] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243731 [09:52:05] !log uploaded python3-wmflib_3.0.0 to apt.wikimedia.org bullseye-wikimedia,bookworm-wikimedia,trixie-wikimedia [09:52:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:58] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Updating records after renaming and moving vlan of some an-worker hosts - btullis@cumin1003" [09:54:02] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Updating records after renaming and moving vlan of some an-worker hosts - btullis@cumin1003" [09:54:02] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:54:10] !log fceratto@cumin1003 START - Cookbook sre.hosts.reimage for host dborch1003.eqiad.wmnet with OS trixie [09:54:55] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243728 (https://phabricator.wikimedia.org/T418161) (owner: 10Slyngshede) [09:57:09] (03CR) 10Volans: CHANGELOG: add changelogs for release v3.0.0 (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1243717 (owner: 10Elukey) [09:57:29] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1024.eqiad.wmnet with OS bookworm [10:02:44] !log fceratto@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dborch1003.eqiad.wmnet with reason: host reimage [10:03:14] (03PS4) 10Daniel Kinzler: rest-gateway: use MINUTE limits in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1239669 [10:05:39] (03CR) 10Muehlenhoff: [C:03+1] "LGTM, one nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/1243728 (https://phabricator.wikimedia.org/T418161) (owner: 10Slyngshede) [10:05:42] (03PS4) 10Slyngshede: P:cache::haproxy install haproxy from main on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1243728 (https://phabricator.wikimedia.org/T418161) [10:06:42] (03PS2) 10Daniel Kinzler: rest-gateway: fix x-wmf-ratelimit-policy in access log [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240753 (https://phabricator.wikimedia.org/T413186) [10:06:50] (03PS5) 10Slyngshede: P:cache::haproxy install haproxy from main on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1243728 (https://phabricator.wikimedia.org/T418161) [10:09:08] (03CR) 10Elukey: [C:03+2] "Thanks!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1240850 (owner: 10Brouberol) [10:09:23] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dborch1003.eqiad.wmnet with reason: host reimage [10:09:32] (03PS1) 10Elukey: Revert "setup.py: Pin setuptools < 82.0.0 to make pkg_resources available." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1243734 [10:10:50] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243728 (https://phabricator.wikimedia.org/T418161) (owner: 10Slyngshede) [10:12:05] (03CR) 10Elukey: [C:03+2] CHANGELOG: add changelogs for release v3.0.0 (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1243717 (owner: 10Elukey) [10:12:06] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1024.eqiad.wmnet with reason: host reimage [10:12:12] (03CR) 10Elukey: [C:03+2] CHANGELOG: add changelogs for release v3.0.0 (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1243717 (owner: 10Elukey) [10:13:11] (03CR) 10Slyngshede: P:cache::haproxy install haproxy from main on Trixie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1243728 (https://phabricator.wikimedia.org/T418161) (owner: 10Slyngshede) [10:13:11] (03PS4) 10Daniel Kinzler: rest-gateway: improve readability of tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1239972 [10:15:51] (03CR) 10Blake: [C:03+1] "Thanks!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1243734 (owner: 10Elukey) [10:16:13] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: always fetch project name from keystone [puppet] - 10https://gerrit.wikimedia.org/r/1243125 (https://phabricator.wikimedia.org/T418236) (owner: 10Filippo Giunchedi) [10:17:34] (03PS7) 10Daniel Kinzler: rest-gateway: remove support for insecure user ID cookies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237295 (https://phabricator.wikimedia.org/T405578) [10:18:18] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1024.eqiad.wmnet with reason: host reimage [10:20:45] PROBLEM - HAProxy HTTPS wikipedia25.org ECDSA on cp2045 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [10:20:45] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp2045 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [10:20:45] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp2045 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [10:20:45] PROBLEM - haproxy process on cp2045 is CRITICAL: PROCS CRITICAL: 0 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [10:22:27] (03PS1) 10Filippo Giunchedi: pontoon: fix missing type annotation [puppet] - 10https://gerrit.wikimedia.org/r/1243735 [10:22:41] FIRING: ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [10:23:08] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dborch1003.eqiad.wmnet with OS trixie [10:23:16] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: fix missing type annotation [puppet] - 10https://gerrit.wikimedia.org/r/1243735 (owner: 10Filippo Giunchedi) [10:24:51] (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: disable external_services for minikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242428 (https://phabricator.wikimedia.org/T414333) (owner: 10Daniel Kinzler) [10:24:59] (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: use MINUTE limits in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1239669 (owner: 10Daniel Kinzler) [10:25:10] (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: fix x-wmf-ratelimit-policy in access log [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240753 (https://phabricator.wikimedia.org/T413186) (owner: 10Daniel Kinzler) [10:26:51] (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: improve readability of tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1239972 (owner: 10Daniel Kinzler) [10:27:01] (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: remove support for insecure user ID cookies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237295 (https://phabricator.wikimedia.org/T405578) (owner: 10Daniel Kinzler) [10:27:02] (03Merged) 10jenkins-bot: rest-gateway: disable external_services for minikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242428 (https://phabricator.wikimedia.org/T414333) (owner: 10Daniel Kinzler) [10:27:05] (03CR) 10Volans: "Sorry, I had drafted a reply on monday afternoon and then forgot to finish it. Here it is." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1239368 (https://phabricator.wikimedia.org/T330997) (owner: 10Blake) [10:27:16] (03Merged) 10jenkins-bot: rest-gateway: use MINUTE limits in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1239669 (owner: 10Daniel Kinzler) [10:27:26] (03Merged) 10jenkins-bot: rest-gateway: fix x-wmf-ratelimit-policy in access log [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240753 (https://phabricator.wikimedia.org/T413186) (owner: 10Daniel Kinzler) [10:28:36] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:29:08] (03Merged) 10jenkins-bot: rest-gateway: improve readability of tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1239972 (owner: 10Daniel Kinzler) [10:29:12] (03Merged) 10jenkins-bot: rest-gateway: remove support for insecure user ID cookies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237295 (https://phabricator.wikimedia.org/T405578) (owner: 10Daniel Kinzler) [10:29:42] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:30:25] (03PS3) 10Federico Ceratto: site.pp: Setup dborch1003 [puppet] - 10https://gerrit.wikimedia.org/r/1243134 (https://phabricator.wikimedia.org/T317179) [10:30:49] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11649996 (10ayounsi) For asw1-23-ulsfo gNMI/TLS issue I've opened Nokia support case 05482268. --- ` We're currently provisioning two new switches. The first... [10:33:16] (03Abandoned) 10Filippo Giunchedi: hieradata: route toolhub probe alerts to wmcs [puppet] - 10https://gerrit.wikimedia.org/r/1243042 (https://phabricator.wikimedia.org/T316682) (owner: 10Filippo Giunchedi) [10:34:47] (03PS1) 10Jelto: aptrepo: add gitlab-runner-helper-images to ListShellHook [puppet] - 10https://gerrit.wikimedia.org/r/1243743 (https://phabricator.wikimedia.org/T418344) [10:35:03] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Degraded RAID on kubestage2004 - https://phabricator.wikimedia.org/T416726#11650024 (10JMeybohm) Rebuild started, thanks! ` Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md0 : active raid1 sda2[2] sdb... [10:35:45] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [10:36:06] !log btullis@cumin2002 START - Cookbook sre.hosts.provision for host dse-k8s-worker1025.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [10:36:42] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [10:36:42] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1024.eqiad.wmnet with OS bookworm [10:36:46] !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:37:44] !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:39:12] !log btullis@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1025.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [10:39:35] PROBLEM - Host an-worker1206 is DOWN: PING CRITICAL - Packet loss = 100% [10:39:51] !log btullis@cumin2002 START - Cookbook sre.hosts.provision for host dse-k8s-worker1025.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [10:40:01] (03CR) 10JMeybohm: [C:03+1] docker_registry: remove the /test prefix special handling [puppet] - 10https://gerrit.wikimedia.org/r/1243726 (https://phabricator.wikimedia.org/T394476) (owner: 10Elukey) [10:40:19] (03CR) 10JMeybohm: [C:03+1] docker_registry: move the /v2/restricted prefix to s3/apus [puppet] - 10https://gerrit.wikimedia.org/r/1243727 (https://phabricator.wikimedia.org/T412951) (owner: 10Elukey) [10:41:45] !log btullis@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1025.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [10:41:46] (03CR) 10Blake: locking: Add a mechanism for a global Spicerack lock. (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1239368 (https://phabricator.wikimedia.org/T330997) (owner: 10Blake) [10:43:23] !log btullis@cumin2002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1025.eqiad.wmnet with OS bookworm [10:43:33] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:43:41] (03CR) 10Clément Goubert: [C:03+1] mw-debug: Pilot new drain configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242522 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [10:43:43] RECOVERY - Host an-worker1206 is UP: PING OK - Packet loss = 0%, RTA = 0.17 ms [10:43:47] (03CR) 10Clément Goubert: [C:03+1] mediawiki: Bump mesh.configuration and mesh.deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242521 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [10:43:56] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243748 [10:44:05] (03CR) 10Clément Goubert: [C:03+1] mesh: Support injection of extra env vars into envoy container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242520 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [10:44:33] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:46:15] !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab replica [10:46:36] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1026.eqiad.wmnet with OS bookworm [10:47:49] (03PS3) 10Tiziano Fogli: thanos/querier (TMP): filter out non local ruler from query configs [puppet] - 10https://gerrit.wikimedia.org/r/1243133 (https://phabricator.wikimedia.org/T412924) [10:47:49] (03PS7) 10Tiziano Fogli: Thanos/Store: add support for multi-instance setup [puppet] - 10https://gerrit.wikimedia.org/r/1219145 (https://phabricator.wikimedia.org/T412924) [10:47:49] (03PS8) 10Tiziano Fogli: Thanos/Store: add a ruler(s)-dedicated store gateway [puppet] - 10https://gerrit.wikimedia.org/r/1219146 (https://phabricator.wikimedia.org/T412924) [10:51:53] (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243133 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli) [10:53:24] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1243743 (https://phabricator.wikimedia.org/T418344) (owner: 10Jelto) [10:53:29] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:53:29] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:53:36] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1027.eqiad.wmnet with OS bookworm [10:54:16] (03CR) 10Muehlenhoff: [C:03+2] apt: Remove support for Buster [puppet] - 10https://gerrit.wikimedia.org/r/1243035 (owner: 10Muehlenhoff) [10:54:55] (03CR) 10Jelto: [C:03+2] aptrepo: add gitlab-runner-helper-images to ListShellHook [puppet] - 10https://gerrit.wikimedia.org/r/1243743 (https://phabricator.wikimedia.org/T418344) (owner: 10Jelto) [10:55:59] !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab replica [10:56:30] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:56:30] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:57:54] !log btullis@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1025.eqiad.wmnet with reason: host reimage [10:58:22] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:59:20] (03CR) 10Muehlenhoff: P:cache::haproxy install haproxy from main on Trixie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1243728 (https://phabricator.wikimedia.org/T418161) (owner: 10Slyngshede) [10:59:30] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB [10:59:30] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260225T1100) [11:00:22] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1028.eqiad.wmnet with OS bookworm [11:00:43] (03PS3) 10Fabfur: hiera: test haproxy 3.0 on cp7009 [puppet] - 10https://gerrit.wikimedia.org/r/1242427 [11:00:44] (03CR) 10Michael Große: [C:03+1] [Growth] Log read failures when JSON schema validation is enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242392 (https://phabricator.wikimedia.org/T417422) (owner: 10Urbanecm) [11:01:21] (03CR) 10Michael Große: [C:03+1] [Growth] Enable wmgGEMentorListJsonSchemaEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242466 (https://phabricator.wikimedia.org/T417422) (owner: 10Urbanecm) [11:02:52] !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1026.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [11:03:05] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1026.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [11:03:22] RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:03:56] !log btullis@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1025.eqiad.wmnet with reason: host reimage [11:04:17] !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1026.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [11:05:15] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1242427 (owner: 10Fabfur) [11:05:38] (03CR) 10Michael Große: [C:03+1] feat(DataProvider): Allow logging of read validation failures [extensions/CommunityConfiguration] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243190 (https://phabricator.wikimedia.org/T417893) (owner: 10Urbanecm) [11:06:49] (03PS2) 10DCausse: opensearch-semantic-search: test cluster capacity [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243696 [11:08:32] (03CR) 10Tiziano Fogli: "@kherron@wikimedia.org I won’t manage the isolation between the instances through firewall rules, but by configuring each instance to use " [puppet] - 10https://gerrit.wikimedia.org/r/1219146 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli) [11:09:11] !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab replica [11:09:16] (03CR) 10Muehlenhoff: [C:03+2] wmflib::service::probe::tcp_module_options: Remove support for Buster [puppet] - 10https://gerrit.wikimedia.org/r/1243137 (owner: 10Muehlenhoff) [11:13:07] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1026.eqiad.wmnet with OS bookworm [11:13:14] (03PS1) 10Muehlenhoff: Remove role::insetup::buster [puppet] - 10https://gerrit.wikimedia.org/r/1243750 (https://phabricator.wikimedia.org/T365798) [11:13:33] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1026.eqiad.wmnet with OS bookworm [11:13:39] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1026.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [11:13:48] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1026.eqiad.wmnet with OS bookworm [11:14:41] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1028.eqiad.wmnet with reason: host reimage [11:14:44] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1026.eqiad.wmnet with OS bookworm [11:14:58] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1026.eqiad.wmnet with OS bookworm [11:15:34] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1026.eqiad.wmnet with OS bookworm [11:15:48] (03CR) 10Vgutierrez: [C:03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1243728 (https://phabricator.wikimedia.org/T418161) (owner: 10Slyngshede) [11:16:17] FIRING: [24x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:16:31] (03PS1) 10Muehlenhoff: profile::dns::recursor: Unconditionally enable the webserver [puppet] - 10https://gerrit.wikimedia.org/r/1243751 [11:17:32] !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1027.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [11:18:07] (03PS1) 10Brouberol: httpd-cas: enable proxy, http_proxy and auth_basic modules [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1243753 (https://phabricator.wikimedia.org/T417990) [11:18:52] (03CR) 10Joal: [C:03+1] "LGTM!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1243753 (https://phabricator.wikimedia.org/T417990) (owner: 10Brouberol) [11:19:01] (03PS1) 10Muehlenhoff: Install systemd-timesyncd universally [puppet] - 10https://gerrit.wikimedia.org/r/1243756 [11:19:01] !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab replica [11:19:16] (03CR) 10Brouberol: [C:03+2] httpd-cas: enable proxy, http_proxy and auth_basic modules [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1243753 (https://phabricator.wikimedia.org/T417990) (owner: 10Brouberol) [11:19:18] (03CR) 10Brouberol: [V:03+2 C:03+2] httpd-cas: enable proxy, http_proxy and auth_basic modules [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1243753 (https://phabricator.wikimedia.org/T417990) (owner: 10Brouberol) [11:19:26] (03PS4) 10Tiziano Fogli: thanos/querier (TMP): filter out non local ruler from query configs [puppet] - 10https://gerrit.wikimedia.org/r/1243133 (https://phabricator.wikimedia.org/T412924) [11:19:27] (03PS8) 10Tiziano Fogli: Thanos/Store: add support for multi-instance setup [puppet] - 10https://gerrit.wikimedia.org/r/1219145 (https://phabricator.wikimedia.org/T412924) [11:19:27] (03PS9) 10Tiziano Fogli: Thanos/Store: add a ruler(s)-dedicated store gateway [puppet] - 10https://gerrit.wikimedia.org/r/1219146 (https://phabricator.wikimedia.org/T412924) [11:20:15] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1028.eqiad.wmnet with reason: host reimage [11:20:48] (03PS2) 10Muehlenhoff: Install systemd-timesyncd universally [puppet] - 10https://gerrit.wikimedia.org/r/1243756 [11:20:59] !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab [11:21:25] !log btullis@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin2002" [11:22:22] (03PS2) 10Muehlenhoff: Remove role::insetup::buster [puppet] - 10https://gerrit.wikimedia.org/r/1243750 (https://phabricator.wikimedia.org/T365798) [11:22:27] FIRING: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2013:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [11:22:28] (03CR) 10Vgutierrez: [C:03+1] hiera: test haproxy 3.0 on cp7009 [puppet] - 10https://gerrit.wikimedia.org/r/1242427 (owner: 10Fabfur) [11:22:42] (03CR) 10CI reject: [V:04-1] Install systemd-timesyncd universally [puppet] - 10https://gerrit.wikimedia.org/r/1243756 (owner: 10Muehlenhoff) [11:23:21] (03CR) 10Pmiazga: [C:03+1] "I know it's already merged, just giving a info it's reviewed and LGTM." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237295 (https://phabricator.wikimedia.org/T405578) (owner: 10Daniel Kinzler) [11:23:35] !log btullis@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin2002" [11:23:35] !log btullis@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1025.eqiad.wmnet with OS bookworm [11:24:01] (03PS1) 10Muehlenhoff: Remove LXC buster config [puppet] - 10https://gerrit.wikimedia.org/r/1243759 [11:25:47] !log depooling cp7009 to upgrade haproxy (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1242427) (T417253) [11:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:52] T417253: Upgrade to HAProxy 3.0 on cache (bullseye) hosts - https://phabricator.wikimedia.org/T417253 [11:26:09] (03PS4) 10Fabfur: hiera: test haproxy 3.0 on cp7009 [puppet] - 10https://gerrit.wikimedia.org/r/1242427 (https://phabricator.wikimedia.org/T417253) [11:26:13] (03Abandoned) 10David Caro: Revert "puppetdb: Drop firewall rule for access to Puppet 5 servers" [puppet] - 10https://gerrit.wikimedia.org/r/1243106 (https://phabricator.wikimedia.org/T365798) (owner: 10David Caro) [11:26:17] FIRING: [28x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:26:17] !log fabfur@cumin1003 conftool action : set/pooled=no; selector: name=cp7009.* [11:26:43] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1026.eqiad.wmnet with OS bookworm [11:27:09] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1026.eqiad.wmnet with OS bookworm [11:27:21] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1027.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [11:27:50] (03CR) 10Fabfur: [C:03+2] hiera: test haproxy 3.0 on cp7009 [puppet] - 10https://gerrit.wikimedia.org/r/1242427 (https://phabricator.wikimedia.org/T417253) (owner: 10Fabfur) [11:28:34] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:30:45] (03PS1) 10AikoChou: httpbb: fix the revertrisk test [puppet] - 10https://gerrit.wikimedia.org/r/1243768 [11:32:10] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1026.eqiad.wmnet with OS bookworm [11:32:24] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1026.eqiad.wmnet with OS bookworm [11:33:14] (03PS1) 10Daniel Kinzler: rest-gateway: use page/summary for testing shadow mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243770 [11:33:20] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1243728 (https://phabricator.wikimedia.org/T418161) (owner: 10Slyngshede) [11:33:31] (03PS2) 10Daniel Kinzler: rest-gateway: use page/summary for testing shadow mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243770 [11:33:34] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:34:16] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1027.eqiad.wmnet with OS bookworm [11:34:33] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1027.eqiad.wmnet with OS bookworm [11:35:12] (03CR) 10Federico Ceratto: "Rebased after host deploy with insetup" [puppet] - 10https://gerrit.wikimedia.org/r/1243134 (https://phabricator.wikimedia.org/T317179) (owner: 10Federico Ceratto) [11:35:44] !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:36:28] !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:36:29] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp7009*} and A:cp - 3.0 upgrade () [11:37:02] (03CR) 10Muehlenhoff: [C:03+2] Remove LXC buster config [puppet] - 10https://gerrit.wikimedia.org/r/1243759 (owner: 10Muehlenhoff) [11:37:16] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [11:37:30] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [11:37:30] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1028.eqiad.wmnet with OS bookworm [11:38:09] (03CR) 10Muehlenhoff: [C:03+2] Remove role::insetup::buster [puppet] - 10https://gerrit.wikimedia.org/r/1243750 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [11:38:09] (03PS1) 10David Caro: data: added backup yubikey [puppet] - 10https://gerrit.wikimedia.org/r/1243775 [11:41:27] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1026.eqiad.wmnet with OS bookworm [11:41:35] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp7009*} and A:cp - 3.0 upgrade () [11:41:35] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1027.eqiad.wmnet with OS bookworm [11:42:00] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1026.eqiad.wmnet with OS bookworm [11:42:13] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243751 (owner: 10Muehlenhoff) [11:47:24] (03PS1) 10Muehlenhoff: etcd::client::globalconfig: Remove inactive check [puppet] - 10https://gerrit.wikimedia.org/r/1243787 [11:48:51] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1026.eqiad.wmnet with OS bookworm [11:48:54] (03CR) 10Clément Goubert: "I'm not sure we want to add Python to all our envoy images to support this though, do we?" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1242462 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [11:49:13] (03PS1) 10Muehlenhoff: openstack: Remove two buster checks [puppet] - 10https://gerrit.wikimedia.org/r/1243788 [11:51:10] (03CR) 10Slyngshede: [C:03+2] P:cache::haproxy install haproxy from main on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1243728 (https://phabricator.wikimedia.org/T418161) (owner: 10Slyngshede) [11:51:14] (03PS1) 10Muehlenhoff: uwsgi: Remove support for Buster [puppet] - 10https://gerrit.wikimedia.org/r/1243792 [11:53:05] (03CR) 10Slyngshede: [C:03+2] P:cache::haproxy install haproxy from main on Trixie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1243728 (https://phabricator.wikimedia.org/T418161) (owner: 10Slyngshede) [11:53:49] (03CR) 10Clément Goubert: [C:03+2] trafficserver: cleanup redundant lint-related rest gateway routing config [puppet] - 10https://gerrit.wikimedia.org/r/1210631 (owner: 10Aaron Schulz) [11:53:53] (03PS1) 10Muehlenhoff: ldap::client::sssd: Only support socket activation [puppet] - 10https://gerrit.wikimedia.org/r/1243795 [11:54:30] (03CR) 10CI reject: [V:04-1] ldap::client::sssd: Only support socket activation [puppet] - 10https://gerrit.wikimedia.org/r/1243795 (owner: 10Muehlenhoff) [11:57:43] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1026.eqiad.wmnet with OS bookworm [11:59:01] (03CR) 10Slyngshede: [C:03+2] P:cache::haproxy set cp2045 to haproxy 3.0 [puppet] - 10https://gerrit.wikimedia.org/r/1243720 (https://phabricator.wikimedia.org/T418161) (owner: 10Slyngshede) [12:00:05] mvolz: Your horoscope predicts another Services – Citoid / Zotero deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260225T1200). [12:00:38] (03CR) 10Clément Goubert: [C:03+1] Simplify spec-json-wikimedia route and use meta.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242576 (https://phabricator.wikimedia.org/T418188) (owner: 10Aaron Schulz) [12:01:37] (03PS4) 10Itamar Givon: Add configurations for graphql usage survey and its pipeline tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242416 (https://phabricator.wikimedia.org/T414476) [12:01:46] RECOVERY - HAProxy HTTPS wikipedia25.org ECDSA on cp2045 is OK: SSL OK - Certificate wikipedia25.org contains all required SANs:Certificate wikipedia25.org (ECDSA) valid until 2026-04-07 07:52:16 +0000 (expires in 40 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:01:46] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp2045 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2026-05-07 21:41:31 +0000 (expires in 71 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:01:46] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp2045 is OK: SSL OK - Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2026-05-13 04:44:41 +0000 (expires in 76 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:01:46] RECOVERY - haproxy process on cp2045 is OK: PROCS OK: 2 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [12:01:58] (03PS3) 10Daniel Kinzler: rest-gateway: use page/summary for testing shadow mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243770 [12:03:32] !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [12:04:05] !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [12:06:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, ... [12:06:51] IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F0%2F1%3A2 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [12:08:12] <_joe_> here [12:08:15] <_joe_> !ack [12:08:16] 7476 (ACKED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [12:08:22] (03PS4) 10Daniel Kinzler: rest-gateway: use page/summary for testing shadow mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243770 [12:08:37] (03PS5) 10Daniel Kinzler: rest-gateway: use page/summary for testing shadow mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243770 [12:08:40] (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243133 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli) [12:08:47] (03PS6) 10Daniel Kinzler: rest-gateway: use w/api.php for testing shadow mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243770 [12:08:48] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1243775 (owner: 10David Caro) [12:09:26] (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1219145 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli) [12:09:35] (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1219146 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli) [12:09:38] (03PS7) 10Daniel Kinzler: rest-gateway: use w/api.php for testing shadow mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243770 [12:09:46] <_joe_> uhm [12:10:11] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243792 (owner: 10Muehlenhoff) [12:12:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242416 (https://phabricator.wikimedia.org/T414476) (owner: 10Itamar Givon) [12:12:27] FIRING: [2x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2011:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [12:13:04] (03CR) 10Clément Goubert: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243770 (owner: 10Daniel Kinzler) [12:13:38] (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: use w/api.php for testing shadow mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243770 (owner: 10Daniel Kinzler) [12:14:44] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243788 (owner: 10Muehlenhoff) [12:15:14] PROBLEM - Druid historical on an-druid1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [12:15:14] PROBLEM - Druid historical on an-druid1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [12:15:47] !log slyngshede@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - slyngshede@cumin1003" [12:15:56] (03Merged) 10jenkins-bot: rest-gateway: use w/api.php for testing shadow mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243770 (owner: 10Daniel Kinzler) [12:16:46] !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [12:16:54] !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [12:16:59] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243748 (owner: 10PipelineBot) [12:18:15] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243787 (owner: 10Muehlenhoff) [12:18:51] slyngshede@cumin1003 reimage (PID 1187791) is awaiting input [12:18:53] jelto@cumin1003 jelto: The backup on gitlab1004 is complete, ready to proceed with upgrade. [12:19:13] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243748 (owner: 10PipelineBot) [12:19:49] !log daniel@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [12:20:42] !log daniel@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [12:22:02] Here [12:22:27] FIRING: [3x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2011:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [12:22:28] (03CR) 10Dima koushha: [C:03+1] "LGTM, thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242416 (https://phabricator.wikimedia.org/T414476) (owner: 10Itamar Givon) [12:23:46] * urbanecm takes it is not a good time to deploy stuff [12:24:04] jouncebot: nowandnext [12:24:04] For the next 0 hour(s) and 35 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260225T1200) [12:24:04] In 1 hour(s) and 35 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260225T1400) [12:25:50] PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 2353 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [12:26:12] !log daniel@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [12:26:50] RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 118466 bytes in 0.872 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [12:26:53] !log daniel@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [12:27:27] RESOLVED: [3x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2011:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [12:29:13] !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab [12:29:25] FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:30:57] (03CR) 10Marostegui: "yeah let's merge this once the host is installed and ready to get a new role" [puppet] - 10https://gerrit.wikimedia.org/r/1243134 (https://phabricator.wikimedia.org/T317179) (owner: 10Federico Ceratto) [12:31:20] (03PS1) 10Daniel Kinzler: rest-gateway: no limits for wmcs for now [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243807 [12:32:14] RECOVERY - Druid historical on an-druid1007 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [12:32:36] (03PS1) 10Muehlenhoff: autoinstall: Remove buster support [puppet] - 10https://gerrit.wikimedia.org/r/1243808 [12:33:57] FIRING: [2x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2013:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [12:34:25] RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:34:52] (03CR) 10Urbanecm: [C:03+2] feat(DataProvider): Allow logging of read validation failures [extensions/CommunityConfiguration] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243190 (https://phabricator.wikimedia.org/T417893) (owner: 10Urbanecm) [12:35:54] (03PS2) 10Joal: Add helm chart for turnilo UI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240890 (https://phabricator.wikimedia.org/T416118) [12:36:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, ... [12:36:51] IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F0%2F1%3A2 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [12:37:04] (03Merged) 10jenkins-bot: feat(DataProvider): Allow logging of read validation failures [extensions/CommunityConfiguration] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243190 (https://phabricator.wikimedia.org/T417893) (owner: 10Urbanecm) [12:37:13] (03CR) 10Urbanecm: [C:03+2] [Growth] Log read failures when JSON schema validation is enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242392 (https://phabricator.wikimedia.org/T417422) (owner: 10Urbanecm) [12:37:28] Does anyone mind if I deploy citoid? Shouldn't interview with mediawiki stuff [12:37:32] interfere* [12:37:56] * urbanecm has no objections [12:38:14] RECOVERY - Druid historical on an-druid1006 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [12:38:18] (03Merged) 10jenkins-bot: [Growth] Log read failures when JSON schema validation is enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242392 (https://phabricator.wikimedia.org/T417422) (owner: 10Urbanecm) [12:38:26] !log repooling cp7009 (T417253) [12:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:32] T417253: Upgrade to HAProxy 3.0 on cache (bullseye) hosts - https://phabricator.wikimedia.org/T417253 [12:38:53] !log fabfur@cumin1003 conftool action : set/pooled=yes; selector: name=cp7009.* [12:38:57] FIRING: [2x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2013:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [12:39:45] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply [12:39:58] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1243190|feat(DataProvider): Allow logging of read validation failures (T417893)]], [[gerrit:1242392|[Growth] Log read failures when JSON schema validation is enabled (T417422 T417893)]] [12:40:04] T417893: CommunityConfiguration should allow logging of read validation failures - https://phabricator.wikimedia.org/T417893 [12:40:04] T417422: Switch `GrowthMentorList` Community Configuration provider to JSON schema validation - https://phabricator.wikimedia.org/T417422 [12:40:13] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply [12:40:23] (03CR) 10Federico Ceratto: "The host is installed, can I get a +1 when you have a sec?" [puppet] - 10https://gerrit.wikimedia.org/r/1243134 (https://phabricator.wikimedia.org/T317179) (owner: 10Federico Ceratto) [12:41:17] FIRING: [34x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:41:45] !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/citoid: apply [12:42:11] !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [12:42:16] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1243190|feat(DataProvider): Allow logging of read validation failures (T417893)]], [[gerrit:1242392|[Growth] Log read failures when JSON schema validation is enabled (T417422 T417893)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:42:52] !log urbanecm@deploy2002 urbanecm: Continuing with sync [12:43:52] (03PS1) 10Urbanecm: [Growth] testwiki: Enable wmgGEMentorListJsonSchemaEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243811 (https://phabricator.wikimedia.org/T417422) [12:43:57] FIRING: [2x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2013:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [12:44:53] (03CR) 10Muehlenhoff: "(The PCC error is unrelated)" [puppet] - 10https://gerrit.wikimedia.org/r/1243787 (owner: 10Muehlenhoff) [12:44:58] !log mvolz@deploy2002 helmfile [eqiad] START helmfile.d/services/citoid: apply [12:45:33] (03CR) 10Muehlenhoff: [C:03+2] autoinstall: Remove buster support [puppet] - 10https://gerrit.wikimedia.org/r/1243808 (owner: 10Muehlenhoff) [12:45:38] !log mvolz@deploy2002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [12:46:05] (03PS2) 10Urbanecm: [Growth] Enable wmgGEMentorListJsonSchemaEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242466 (https://phabricator.wikimedia.org/T417422) [12:46:17] FIRING: [36x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:46:40] (03CR) 10Urbanecm: [C:03+2] [Growth] testwiki: Enable wmgGEMentorListJsonSchemaEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243811 (https://phabricator.wikimedia.org/T417422) (owner: 10Urbanecm) [12:46:56] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1243190|feat(DataProvider): Allow logging of read validation failures (T417893)]], [[gerrit:1242392|[Growth] Log read failures when JSON schema validation is enabled (T417422 T417893)]] (duration: 06m 57s) [12:47:01] T417893: CommunityConfiguration should allow logging of read validation failures - https://phabricator.wikimedia.org/T417893 [12:47:02] T417422: Switch `GrowthMentorList` Community Configuration provider to JSON schema validation - https://phabricator.wikimedia.org/T417422 [12:47:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243811 (https://phabricator.wikimedia.org/T417422) (owner: 10Urbanecm) [12:47:35] (03Merged) 10jenkins-bot: [Growth] testwiki: Enable wmgGEMentorListJsonSchemaEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243811 (https://phabricator.wikimedia.org/T417422) (owner: 10Urbanecm) [12:47:40] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:47:40] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:48:05] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1243811|[Growth] testwiki: Enable wmgGEMentorListJsonSchemaEnabled (T417422)]] [12:48:58] (03CR) 10Marostegui: [C:03+1] site.pp: Setup dborch1003 [puppet] - 10https://gerrit.wikimedia.org/r/1243134 (https://phabricator.wikimedia.org/T317179) (owner: 10Federico Ceratto) [12:49:59] (03PS3) 10Joal: Add helm chart for turnilo UI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240890 (https://phabricator.wikimedia.org/T416118) [12:49:59] (03PS2) 10Joal: Add turnilo-next helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240891 (https://phabricator.wikimedia.org/T416121) [12:50:20] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1243811|[Growth] testwiki: Enable wmgGEMentorListJsonSchemaEnabled (T417422)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:51:17] FIRING: [36x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:51:40] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:51:40] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:52:40] (03CR) 10Federico Ceratto: [C:03+2] site.pp: Setup dborch1003 [puppet] - 10https://gerrit.wikimedia.org/r/1243134 (https://phabricator.wikimedia.org/T317179) (owner: 10Federico Ceratto) [12:53:57] RESOLVED: [2x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2013:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [12:56:03] !log urbanecm@deploy2002 urbanecm: Continuing with sync [12:56:40] (03PS1) 10Muehlenhoff: package_builder: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1243813 [12:57:15] (03CR) 10CI reject: [V:04-1] package_builder: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1243813 (owner: 10Muehlenhoff) [13:00:00] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1243792 (owner: 10Muehlenhoff) [13:00:05] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1243811|[Growth] testwiki: Enable wmgGEMentorListJsonSchemaEnabled (T417422)]] (duration: 12m 00s) [13:00:10] T417422: Switch `GrowthMentorList` Community Configuration provider to JSON schema validation - https://phabricator.wikimedia.org/T417422 [13:00:17] 06SRE, 06serviceops, 10API Platform (RESTBase Deprecation Roadmap): Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995#11650536 (10Jdforrester-WMF) Note to self: Same logic for Restbase applies here pending the outcome of T349118#11650527. [13:00:44] 06SRE, 10ChangeProp, 06Data-Engineering, 10EventStreams, and 3 others: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750#11650541 (10Jdforrester-WMF) Note to self: Same logic for Restbase applies here pending the outcome of T349118#11650527. [13:01:04] FIRING: PuppetDisabled: Puppet disabled on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=relforge&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [13:01:17] FIRING: [34x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:01:22] (03PS2) 10Muehlenhoff: package_builder: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1243813 [13:01:23] 06SRE, 06serviceops-radar, 13Patch-For-Review: Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704#11650543 (10Jdforrester-WMF) 05Open→03Resolved a:03Jdforrester-WMF [13:02:02] (03CR) 10David Caro: [C:03+2] data: added backup yubikey [puppet] - 10https://gerrit.wikimedia.org/r/1243775 (owner: 10David Caro) [13:03:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242466 (https://phabricator.wikimedia.org/T417422) (owner: 10Urbanecm) [13:04:33] (03Merged) 10jenkins-bot: [Growth] Enable wmgGEMentorListJsonSchemaEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242466 (https://phabricator.wikimedia.org/T417422) (owner: 10Urbanecm) [13:05:02] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1242466|[Growth] Enable wmgGEMentorListJsonSchemaEnabled (T417422)]] [13:06:11] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243813 (owner: 10Muehlenhoff) [13:07:10] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1242466|[Growth] Enable wmgGEMentorListJsonSchemaEnabled (T417422)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:07:15] T417422: Switch `GrowthMentorList` Community Configuration provider to JSON schema validation - https://phabricator.wikimedia.org/T417422 [13:07:16] (03CR) 10Elukey: locking: Add a mechanism for a global Spicerack lock. (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1239368 (https://phabricator.wikimedia.org/T330997) (owner: 10Blake) [13:08:35] !log urbanecm@deploy2002 urbanecm: Continuing with sync [13:09:02] (03CR) 10Urbanecm: [Growth] Enable on every new Wikipedia by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1239954 (https://phabricator.wikimedia.org/T304052) (owner: 10Urbanecm) [13:09:08] (03PS6) 10Urbanecm: [Growth] Enable on all open Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1239949 (https://phabricator.wikimedia.org/T417023) [13:09:13] (03PS4) 10Urbanecm: [Growth] Enable on every new Wikipedia by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1239954 (https://phabricator.wikimedia.org/T304052) [13:09:59] (03PS2) 10Muehlenhoff: ldap::client::sssd: Only support socket activation [puppet] - 10https://gerrit.wikimedia.org/r/1243795 [13:10:17] (03CR) 10Urbanecm: [C:03+2] [Growth] Enable on all open Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1239949 (https://phabricator.wikimedia.org/T417023) (owner: 10Urbanecm) [13:11:39] (03Merged) 10jenkins-bot: [Growth] Enable on all open Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1239949 (https://phabricator.wikimedia.org/T417023) (owner: 10Urbanecm) [13:12:04] (03CR) 10Elukey: [C:03+2] Revert "setup.py: Pin setuptools < 82.0.0 to make pkg_resources available." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1243734 (owner: 10Elukey) [13:12:25] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1242466|[Growth] Enable wmgGEMentorListJsonSchemaEnabled (T417422)]] (duration: 07m 24s) [13:12:30] T417422: Switch `GrowthMentorList` Community Configuration provider to JSON schema validation - https://phabricator.wikimedia.org/T417422 [13:14:58] (03CR) 10Blake: locking: Add a mechanism for a global Spicerack lock. (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1239368 (https://phabricator.wikimedia.org/T330997) (owner: 10Blake) [13:15:15] (03PS1) 10David Caro: data: removed non-yubikey ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1243816 [13:16:02] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1239949|[Growth] Enable on all open Wikipedias (T417023)]] [13:16:06] T417023: Enable GrowthExperiments on all open Wikipedias - https://phabricator.wikimedia.org/T417023 [13:16:17] FIRING: [32x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:16:52] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to Superset for mikez - https://phabricator.wikimedia.org/T418098#11650583 (10Vgutierrez) got mcollins approval via Slack, we need #data-engineering approval now (that's @Milimetric / @Ottomata) [13:17:56] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1026.eqiad.wmnet with OS bookworm [13:18:22] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1239949|[Growth] Enable on all open Wikipedias (T417023)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:19:34] (03CR) 10Brouberol: Add helm chart for turnilo UI (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240890 (https://phabricator.wikimedia.org/T416118) (owner: 10Joal) [13:19:38] 06SRE, 06serviceops: Update node 14/16 base images - https://phabricator.wikimedia.org/T318541#11650596 (10Jdforrester-WMF) 05Open→03Declined Nothing left in production uses these images any more; let's just Decline. [13:20:20] !log urbanecm@deploy2002 urbanecm: Continuing with sync [13:21:35] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243795 (owner: 10Muehlenhoff) [13:21:52] (03CR) 10Brouberol: Add turnilo-next helmfile and values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240891 (https://phabricator.wikimedia.org/T416121) (owner: 10Joal) [13:22:04] !log Starting deployment of the multi-instance Thanos Store Gateway patches for T412924 [13:22:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:08] T412924: Multi-instance thanos store gateway - https://phabricator.wikimedia.org/T412924 [13:22:10] (03PS5) 10Itamar Givon: Add configurations for graphql usage survey and its pipeline tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242416 (https://phabricator.wikimedia.org/T414476) [13:22:16] (03CR) 10Muehlenhoff: [C:03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/1243816 (owner: 10David Caro) [13:22:30] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for Eileen McFarland - https://phabricator.wikimedia.org/T418221#11650604 (10EMcFarland-WMF) @Aklapper I linked my LDAP account to my Phabricator account, and my 'LDAP User' account is now shown on my Phabricator profile. Thanks for catching that.... [13:22:58] (03CR) 10David Caro: [C:03+2] data: removed non-yubikey ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1243816 (owner: 10David Caro) [13:23:21] (03CR) 10Dima koushha: [C:03+1] Add configurations for graphql usage survey and its pipeline tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242416 (https://phabricator.wikimedia.org/T414476) (owner: 10Itamar Givon) [13:24:13] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1239949|[Growth] Enable on all open Wikipedias (T417023)]] (duration: 08m 11s) [13:24:18] T417023: Enable GrowthExperiments on all open Wikipedias - https://phabricator.wikimedia.org/T417023 [13:24:27] FIRING: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2011:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [13:25:13] !log aikochou@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [13:25:39] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1026.eqiad.wmnet with OS bookworm [13:26:30] (03CR) 10Muehlenhoff: [C:03+2] uwsgi: Remove support for Buster [puppet] - 10https://gerrit.wikimedia.org/r/1243792 (owner: 10Muehlenhoff) [13:26:30] (03CR) 10Tiziano Fogli: [C:03+2] thanos/querier (TMP): filter out non local ruler from query configs [puppet] - 10https://gerrit.wikimedia.org/r/1243133 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli) [13:27:05] tappof: ok to merge your thanos patch along? [13:27:19] yes moritzm thx [13:27:40] !log aikochou@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [13:28:45] tappof: merged [13:28:51] moritzm: ack [13:29:27] FIRING: [3x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2011:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [13:29:45] (03PS1) 10Muehlenhoff: package_builder: Remove spec file [puppet] - 10https://gerrit.wikimedia.org/r/1243819 [13:30:25] !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [13:33:03] (03PS1) 10Muehlenhoff: matomo: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1243820 [13:33:53] (03CR) 10Marostegui: "dry run?" [cookbooks] - 10https://gerrit.wikimedia.org/r/1238368 (https://phabricator.wikimedia.org/T373436) (owner: 10Federico Ceratto) [13:34:27] FIRING: HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=kserve - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:34:28] (03CR) 10Marostegui: "Or confirmation?" [cookbooks] - 10https://gerrit.wikimedia.org/r/1238368 (https://phabricator.wikimedia.org/T373436) (owner: 10Federico Ceratto) [13:35:12] (03PS1) 10Muehlenhoff: kernel_report: Support trixie [puppet] - 10https://gerrit.wikimedia.org/r/1243822 [13:36:05] (03PS2) 10Anzx: zhwiki: remove accountcreator usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243817 (https://phabricator.wikimedia.org/T418089) [13:39:28] (03PS1) 10Muehlenhoff: check_timedatectl: Drop support for old systemd versions [puppet] - 10https://gerrit.wikimedia.org/r/1243824 [13:39:38] (03PS3) 10Anzx: zhwiki: remove accountcreator usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243817 (https://phabricator.wikimedia.org/T418089) [13:39:58] (03PS2) 10AikoChou: httpbb: fix the revertrisk test [puppet] - 10https://gerrit.wikimedia.org/r/1243768 [13:40:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243817 (https://phabricator.wikimedia.org/T418089) (owner: 10Anzx) [13:40:39] (03PS1) 10Muehlenhoff: Drop one role frm insetup_role_report [puppet] - 10https://gerrit.wikimedia.org/r/1243825 [13:41:46] (03CR) 10Hashar: [C:03+1] "Great thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1243183 (https://phabricator.wikimedia.org/T417247) (owner: 10Dzahn) [13:43:33] (03PS1) 10Elukey: setup.py: limit sphinx's version [software/spicerack] - 10https://gerrit.wikimedia.org/r/1243827 [13:43:54] (03PS1) 10Muehlenhoff: Remove profile to build Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1243828 [13:44:27] FIRING: [3x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2011:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [13:44:53] (03CR) 10Gkyziridis: [C:03+1] "Thnx for working on this!" [puppet] - 10https://gerrit.wikimedia.org/r/1243768 (owner: 10AikoChou) [13:45:03] (03CR) 10Muehlenhoff: [C:03+2] kernel_report: Support trixie [puppet] - 10https://gerrit.wikimedia.org/r/1243822 (owner: 10Muehlenhoff) [13:46:03] (03CR) 10Tiziano Fogli: [C:03+2] Thanos/Store: add support for multi-instance setup [puppet] - 10https://gerrit.wikimedia.org/r/1219145 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli) [13:46:26] (03CR) 10ArielGlenn: [C:03+1] "I read the gdoc, this reflects the agreed changes, good to go" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243807 (owner: 10Daniel Kinzler) [13:47:04] 06SRE: Image Rate Limiting Issues For Future Audiences Project - https://phabricator.wikimedia.org/T418377 (10derenrich) 03NEW [13:47:16] PROBLEM - orchestrator.wikimedia.org requires authentication on dborch1003 is CRITICAL: connect to address 10.64.0.20 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [13:47:31] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1026 [13:47:38] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1026 [13:48:06] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1026.eqiad.wmnet with OS bookworm [13:48:14] (03CR) 10Hashar: [C:03+1] "I don't quite know what this file is used for, but it has indeed been renamed :-]" [puppet] - 10https://gerrit.wikimedia.org/r/1243188 (https://phabricator.wikimedia.org/T338470) (owner: 10Dzahn) [13:49:27] FIRING: [3x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2011:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [13:49:30] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1026.eqiad.wmnet with OS bookworm [13:50:03] (03PS1) 10AikoChou: ml-services: align image version for rr-multilingual model on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243829 [13:51:44] (03CR) 10Hashar: [C:03+1] "Nice." [puppet] - 10https://gerrit.wikimedia.org/r/1243187 (https://phabricator.wikimedia.org/T338470) (owner: 10Dzahn) [13:52:15] 10SRE-SLO, 06Abstract Wikipedia team, 06ServiceOps new: wikifunctions-backend-combined-v1 SLI error budget has been rapidly dropping over Feb 2026 - https://phabricator.wikimedia.org/T418160#11650735 (10Blake) [13:52:16] PROBLEM - orchestrator.wikimedia.org tls expiry on dborch1003 is CRITICAL: connect to address 10.64.0.20 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [13:53:43] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM overall, I'll let Cole comment re: hooks" [puppet] - 10https://gerrit.wikimedia.org/r/1243813 (owner: 10Muehlenhoff) [13:54:27] FIRING: [3x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2011:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [13:56:04] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to Superset for mikez - https://phabricator.wikimedia.org/T418098#11650740 (10Milimetric) approved! Welcome to moar data [13:57:08] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1240416 (https://phabricator.wikimedia.org/T414098) (owner: 10Dzahn) [13:58:30] (03CR) 10LSobanski: [C:03+1] Drop one role frm insetup_role_report [puppet] - 10https://gerrit.wikimedia.org/r/1243825 (owner: 10Muehlenhoff) [13:59:23] (03CR) 10Volans: [C:03+1] "LGTM for now, thx" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1243827 (owner: 10Elukey) [13:59:27] FIRING: [3x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2011:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [13:59:43] (03PS2) 10Jforrester: wikifunctions: Turn on custom oTel spans [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243695 (https://phabricator.wikimedia.org/T417750) (owner: 10Ecarg) [13:59:43] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2026-02-12-145008 to 2026-02-25-131752 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243830 (https://phabricator.wikimedia.org/T417024) [13:59:45] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2026-02-18-140059 to 2026-02-25-124326 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243831 (https://phabricator.wikimedia.org/T413728) [13:59:47] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1026.eqiad.wmnet with OS bookworm [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260225T1400). [14:00:05] Tran, itamarWMDE, and anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:13] o/ [14:00:16] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1026.eqiad.wmnet with OS bookworm [14:00:17] o/ [14:00:19] (03CR) 10Gkyziridis: [C:03+1] ml-services: align image version for rr-multilingual model on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243829 (owner: 10AikoChou) [14:00:20] hey hey 0/ [14:00:21] o/ I can’t deploy, sorry [14:00:54] I can self-deploy and if anyone's patch is "someone runs spider-pig for you" then I can help with that [14:01:17] FIRING: [30x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:01:55] (03Abandoned) 10Jelto: tcpproxy: raise connection limit from 200 to 400 [puppet] - 10https://gerrit.wikimedia.org/r/1239928 (https://phabricator.wikimedia.org/T417497) (owner: 10Jelto) [14:01:56] I can resched for tomorrow if no deployers are found. [14:01:59] !log kamila@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2356.codfw.wmnet [14:02:01] !log kamila@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2356.codfw.wmnet [14:02:40] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243820 (owner: 10Muehlenhoff) [14:02:47] Tran: itamarWMDE’s change looks like a spiderpig job to me [14:02:57] I’m not sure about anzx’s, that might need a maintenance script? [14:04:19] I'm going to get started on mine and can do itamarWMDE’s after then [14:04:29] 100% thanks [14:04:31] thanks! [14:04:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1233674 (https://phabricator.wikimedia.org/T413951) (owner: 10STran) [14:05:21] Tran: please if could i need someone to deploy for me, https://www.mediawiki.org/wiki/Manual:EmptyUserGroup.php/de needed to be run for zhwiki [14:05:24] (03CR) 10Elukey: [C:03+2] setup.py: limit sphinx's version [software/spicerack] - 10https://gerrit.wikimedia.org/r/1243827 (owner: 10Elukey) [14:05:45] (03Merged) 10jenkins-bot: Remove deprecated IRS v2 configurations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1233674 (https://phabricator.wikimedia.org/T413951) (owner: 10STran) [14:06:14] !log stran@deploy2002 Started scap sync-world: Backport for [[gerrit:1233674|Remove deprecated IRS v2 configurations (T413951)]] [14:06:19] T413951: Deprecate v1 non emergency flow for IRS - https://phabricator.wikimedia.org/T413951 [14:07:47] anzx: I can help you with the spiderpig config deployment. I can also run the script on prod but can I get the specific commands to run on a sever? I don't do it often and don't know how to run it for zhwiki specifically. [14:09:27] FIRING: [3x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2011:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [14:10:00] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for Eileen McFarland - https://phabricator.wikimedia.org/T418221#11650779 (10Vgutierrez) SSH has been verified out-of-band [14:10:35] !log stran@deploy2002 stran: Backport for [[gerrit:1233674|Remove deprecated IRS v2 configurations (T413951)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:10:46] Tran: I think it should be: mwscript-k8s --comment=T418089 --follow --sal -- emptyUserGroup zhwiki accountcreator [14:10:47] T418089: Remove "accountcreator" and allow "event-organizer" to add and remove "event participant" in zhwiki - https://phabricator.wikimedia.org/T418089 [14:10:49] 06SRE, 06Traffic: Image Rate Limiting Issues For Future Audiences Project - https://phabricator.wikimedia.org/T418377#11650785 (10Aklapper) What is the exact and full error message? What is the exact User Agent string? [14:10:53] but hard to check when I don’t have access to my shell history [14:11:00] testing my changes now [14:11:06] judging by https://sal.toolforge.org/production?p=0&q=emptyUserGroup*&d= you might also want a --log-reason [14:11:26] (03CR) 10Jelto: "I agree with Daniels comment, I'd also prefer to not vendor the full config file . But this is not a blocker for me if `conf-available` do" [puppet] - 10https://gerrit.wikimedia.org/r/1240197 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [14:11:27] anzx: should the group removals by the maintenance script be added to Special:Log (i.e. --log-reason in the maintenance script)? [14:11:46] oh hm testserver sync failed. Retrying [14:12:30] Lucas_WMDE: yes , with reason being [[phab:T418089]] [14:13:10] alright, then add --log-reason=[[phab:T418089]] to the end of the mwscript-k8s command Tran [14:14:16] please hold, is something up with the test servers? I'm seeing logs from spiderpig suggesting it's not up: [14:14:16] `ERRORS: 156 requests attempted to mwdebug-next.discovery.wmnet. Errors connecting to 1 host.` [14:14:27] FIRING: [3x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2011:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [14:15:44] no idea :/ I can see https://alerts.wikimedia.org/?q=alertname%3DPyBal%20backends%20health%20check&q=team%3Dsre&q=%40receiver%3Dirc-spam in alerts.w.o but I’ve never been able to understand those alerts [14:15:54] anyone else? [14:16:18] `ERROR: HTTPSConnectionPool(host='mwdebug-next.discovery.wmnet', port=4453): Max retries exceeded with url: / (Caused by ConnectTimeoutError(, 'Connection to mwdebug-next.discovery.wmnet timed out. (connect timeout=10)'))` is the more specific error [14:16:38] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for Eileen McFarland - https://phabricator.wikimedia.org/T418221#11650802 (10Vgutierrez) [14:16:52] * urbanecm is taking a look [14:16:56] (03PS1) 10Vgutierrez: admin: Add emc-wmf to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1243835 (https://phabricator.wikimedia.org/T418221) [14:17:05] attempting a retry [14:17:34] (03PS13) 10Federico Ceratto: mysql: update replication source [cookbooks] - 10https://gerrit.wikimedia.org/r/1238368 (https://phabricator.wikimedia.org/T373436) [14:17:42] (03CR) 10CI reject: [V:04-1] admin: Add emc-wmf to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1243835 (https://phabricator.wikimedia.org/T418221) (owner: 10Vgutierrez) [14:18:13] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [14:18:31] <_joe_> it's one of those days is it [14:18:38] (03PS2) 10Vgutierrez: admin: Add emc-wmf to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1243835 (https://phabricator.wikimedia.org/T418221) [14:19:02] and i'm fairly sure that alert has something to do with mwdebug-next.discovery.wmnet being unreachable [14:19:27] FIRING: [2x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2011:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [14:20:01] i'd suggest to abort the deployment and wait [14:20:23] 👍 canceled my backport [14:20:43] Tran: please don't forget to revert the patch too :)) [14:20:49] yup, was about to ask ^^ [14:20:53] Yes I was going to ask, do I have to manually revert and merge that? [14:21:08] yep [14:21:10] I think SpiderPig might have a button for it as well? [14:21:17] FIRING: [28x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:21:28] but I think you should also deploy the revert, not just merge it, so that it gets pulled onto the deployment host itself [14:21:31] (03PS1) 10STran: Revert "Remove deprecated IRS v2 configurations" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243836 [14:21:31] actually, the revert would need to be deployed, to restore mwdebug [14:21:34] (and then it’s okay if it can’t be deployed everywhere) [14:21:38] right, mwdebug too [14:22:04] (03PS2) 10Ayounsi: [WIP] Add depool strategy for rack depool cookbook [puppet] - 10https://gerrit.wikimedia.org/r/1243077 (https://phabricator.wikimedia.org/T327300) [14:22:28] Do I just deploy it and ignore the testserver sync then? Or run spiderpig until it errors? [14:22:30] <_joe_> !ack [14:22:31] 7477 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule@main) [14:22:36] <_joe_> Tran: please hold on [14:22:44] :+1 [14:23:07] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1243835 (https://phabricator.wikimedia.org/T418221) (owner: 10Vgutierrez) [14:23:08] 06SRE, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-REST-API, 06MW-Interfaces-Team, and 2 others: Determine http cache control and active purging for REST endpoints serving parsoid output - https://phabricator.wikimedia.org/T308424#11650826 (10HCoplin-WMF) @daniel & @MSantos -- Is this still a concern? or are... [14:23:13] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [14:23:33] (03CR) 10Muehlenhoff: [C:03+2] package_builder: Remove spec file [puppet] - 10https://gerrit.wikimedia.org/r/1243819 (owner: 10Muehlenhoff) [14:24:27] RESOLVED: [2x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2011:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [14:25:35] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1026.eqiad.wmnet with OS bookworm [14:25:55] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1026.eqiad.wmnet with OS bookworm [14:26:17] FIRING: [28x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:26:17] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - slyngshede@cumin1003" [14:26:18] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2045.codfw.wmnet with OS trixie [14:26:35] <_joe_> Tran: you can proceed [14:27:15] Tran: my expectation, once you’re no longer on hold, would be: abort the running scap; merge the revert; try to deploy that (and either let it run through or abort if it encounters the same error) [14:27:24] _joe_: i'm still interminneltly unable to reach mwdebug.discovery.wmnet, which is what scap was erroring with. [14:27:26] but maybe urbanecm has a better opinion (since I can’t see the spiderpig log) :) [14:27:28] is that expected? [14:27:46] <_joe_> urbanecm: can you elaborate a bit? [14:27:56] <_joe_> what is unable to reach mwdebug? [14:28:02] <_joe_> yo from your browser? [14:28:23] no, scap. [14:28:28] or deployment host, to be more precise [14:28:32] <_joe_> uhm no idea tbh [14:28:35] https://www.irccloud.com/pastebin/SZ3d1HKu/ [14:28:59] this is what it fails. and i'm still having timeout when trying to connect to mwdebug.discovery.wmnet:4444 via curl. [14:29:00] !log installing openssl security updates [14:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:34] <_joe_> ah so it's httpbb [14:29:42] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:29:45] <_joe_> claime / jayme can you PTAL? [14:30:00] <_joe_> looks like mwdebug was left in a funny state [14:30:08] full logs are at https://spiderpig.wikimedia.org/jobs/1412, if needed [14:30:18] (03CR) 10Muehlenhoff: [C:03+2] Drop one role frm insetup_role_report [puppet] - 10https://gerrit.wikimedia.org/r/1243825 (owner: 10Muehlenhoff) [14:32:08] 👀 [14:35:09] (03PS1) 10Elukey: Apply role to pki1002 [puppet] - 10https://gerrit.wikimedia.org/r/1243839 (https://phabricator.wikimedia.org/T416664) [14:35:17] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to Superset for mikez - https://phabricator.wikimedia.org/T418098#11650875 (10mikez-WMF) Ah thank you for doing that! I was also confused why I couldn't find her in Phabricator and was going to ask in our 1:1 later today. I appreciate it! [14:37:03] (03CR) 10Elukey: [C:03+2] httpbb: fix the revertrisk test [puppet] - 10https://gerrit.wikimedia.org/r/1243768 (owner: 10AikoChou) [14:37:53] (03PS4) 10Joal: Add helm chart for turnilo UI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240890 (https://phabricator.wikimedia.org/T416118) [14:41:07] (03CR) 10Joal: Add helm chart for turnilo UI (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240890 (https://phabricator.wikimedia.org/T416118) (owner: 10Joal) [14:41:17] FIRING: [26x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:41:42] <_joe_> jayme: I assume you're looking into it? [14:42:02] yeah...sometimes it works, sometimes it does not [14:42:30] (03CR) 10Vgutierrez: [C:03+2] admin: Add emc-wmf to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1243835 (https://phabricator.wikimedia.org/T418221) (owner: 10Vgutierrez) [14:42:31] and a bunch of pybal warnings... [14:43:20] (03PS1) 10Awight: Add helm value to optionally allow egress for airflow-wmde [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243841 (https://phabricator.wikimedia.org/T417633) [14:43:21] which seems a bit off since the mw-debug pods are okay [14:43:32] (03PS1) 10Brouberol: idp_test: mirror the configuration we have for turnilo on idp [puppet] - 10https://gerrit.wikimedia.org/r/1243842 (https://phabricator.wikimedia.org/T417990) [14:44:15] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243842 (https://phabricator.wikimedia.org/T417990) (owner: 10Brouberol) [14:44:19] (03PS1) 10Muehlenhoff: profile::pki::multirootca: Adapt firewall config to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1243843 (https://phabricator.wikimedia.org/T416664) [14:44:51] (03CR) 10CI reject: [V:04-1] profile::pki::multirootca: Adapt firewall config to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1243843 (https://phabricator.wikimedia.org/T416664) (owner: 10Muehlenhoff) [14:45:03] (03CR) 10AikoChou: [C:03+2] ml-services: align image version for rr-multilingual model on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243829 (owner: 10AikoChou) [14:45:19] <_joe_> ok sorry, there was no communication here [14:46:21] (03PS5) 10Joal: Add helm chart for turnilo UI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240890 (https://phabricator.wikimedia.org/T416118) [14:46:26] (03CR) 10Btullis: idp_test: mirror the configuration we have for turnilo on idp (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1243842 (https://phabricator.wikimedia.org/T417990) (owner: 10Brouberol) [14:46:52] (03CR) 10Brouberol: Add helm value to optionally allow egress for airflow-wmde (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243841 (https://phabricator.wikimedia.org/T417633) (owner: 10Awight) [14:46:56] (03Merged) 10jenkins-bot: ml-services: align image version for rr-multilingual model on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243829 (owner: 10AikoChou) [14:47:32] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for Eileen McFarland - https://phabricator.wikimedia.org/T418221#11650928 (10Vgutierrez) 05Open→03Resolved change has been merged, please allow puppet to propagate the change, it could take up to 30 minutes [14:48:36] (03PS3) 10Joal: Add turnilo-next helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240891 (https://phabricator.wikimedia.org/T416121) [14:48:45] (03PS2) 10Awight: Add helm value to optionally allow egress for airflow-wmde [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243841 (https://phabricator.wikimedia.org/T417633) [14:48:46] (03CR) 10Joal: Add turnilo-next helmfile and values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240891 (https://phabricator.wikimedia.org/T416121) (owner: 10Joal) [14:49:19] (03CR) 10Awight: Add helm value to optionally allow egress for airflow-wmde (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243841 (https://phabricator.wikimedia.org/T417633) (owner: 10Awight) [14:50:05] (03CR) 10Brouberol: idp_test: mirror the configuration we have for turnilo on idp (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1243842 (https://phabricator.wikimedia.org/T417990) (owner: 10Brouberol) [14:50:15] (03PS2) 10Muehlenhoff: profile::pki::multirootca: Adapt firewall config to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1243843 (https://phabricator.wikimedia.org/T416664) [14:50:40] (03CR) 10Joal: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1243842 (https://phabricator.wikimedia.org/T417990) (owner: 10Brouberol) [14:50:51] (03CR) 10CI reject: [V:04-1] profile::pki::multirootca: Adapt firewall config to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1243843 (https://phabricator.wikimedia.org/T416664) (owner: 10Muehlenhoff) [14:50:52] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-private-users for maxbinderWMF - https://phabricator.wikimedia.org/T417655#11650944 (10MBinder_WMF) Yep, logged in on a different machine entirely and still get this: {F72410569} [14:50:57] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8144/console" [puppet] - 10https://gerrit.wikimedia.org/r/1243842 (https://phabricator.wikimedia.org/T417990) (owner: 10Brouberol) [14:51:17] FIRING: [26x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:51:38] (03PS2) 10Brouberol: idp_test: mirror the configuration we have for turnilo on idp [puppet] - 10https://gerrit.wikimedia.org/r/1243842 (https://phabricator.wikimedia.org/T417990) [14:51:39] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-private-users for maxbinderWMF - https://phabricator.wikimedia.org/T417655#11650945 (10MBinder_WMF) I can still log into https://idp.wikimedia.org/ [14:51:58] (03CR) 10Btullis: [C:03+1] "Copied votes on follow-up patch sets have been updated:" [puppet] - 10https://gerrit.wikimedia.org/r/1243842 (https://phabricator.wikimedia.org/T417990) (owner: 10Brouberol) [14:52:13] (03CR) 10Brouberol: [C:03+1] "Nicely done!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240890 (https://phabricator.wikimedia.org/T416118) (owner: 10Joal) [14:52:27] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8145/co" [puppet] - 10https://gerrit.wikimedia.org/r/1243842 (https://phabricator.wikimedia.org/T417990) (owner: 10Brouberol) [14:52:55] !log push pfw policies - T418305 [14:52:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:03] (03CR) 10Brouberol: Add turnilo-next helmfile and values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240891 (https://phabricator.wikimedia.org/T416121) (owner: 10Joal) [14:54:32] (03CR) 10Eevans: [C:03+2] deployment_server: add linked-artifacts kubeconfig files [puppet] - 10https://gerrit.wikimedia.org/r/1237258 (https://phabricator.wikimedia.org/T414112) (owner: 10Jelto) [14:56:12] (03CR) 10Brouberol: [V:03+1 C:03+2] idp_test: mirror the configuration we have for turnilo on idp [puppet] - 10https://gerrit.wikimedia.org/r/1243842 (https://phabricator.wikimedia.org/T417990) (owner: 10Brouberol) [14:56:17] FIRING: [28x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:56:35] stepping away for a minute but will still be around to resolve my deploy. The window is ending however. Is there anything I can do in the meanwhile or is it better to wait? [14:57:02] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-private-users for maxbinderWMF - https://phabricator.wikimedia.org/T417655#11650968 (10MatthewVernon) OK, I've tagged #data-engineering, since I think this is their ballpark now. Hopefully they can help :) [14:57:24] Tran: can you link the revert here please? [14:57:31] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1243836 [14:57:33] ty [14:57:46] (03CR) 10Brouberol: [C:03+2] Add helm value to optionally allow egress for airflow-wmde [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243841 (https://phabricator.wikimedia.org/T417633) (owner: 10Awight) [14:58:00] 10ops-eqiad, 06SRE, 06DC-Ops, 06SRE Observability: RAM upgrade availability for Titan hosts - https://phabricator.wikimedia.org/T416741#11650972 (10herron) [14:58:01] 10ops-codfw, 06SRE, 06DC-Ops: RAM upgrade availability for Titan hosts - https://phabricator.wikimedia.org/T417336#11650971 (10herron) [14:58:18] i'm going to merge it and pull to deployment host manually. then, it's up to jay.me to fix mwdebug. [14:58:26] Tran: i suggest rescheduling for another window, ufnortunately :/ [14:58:35] (03CR) 10Urbanecm: [C:03+2] Revert "Remove deprecated IRS v2 configurations" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243836 (owner: 10STran) [14:59:02] urbanecm: could you link me to what was deployed please? since the mwdebug issues started on 14:10Z [14:59:11] that's fine, cc itamarWMDE and anzx who also had patches this window which probably won't be deployed in that case [14:59:12] !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [14:59:38] (03PS3) 10Muehlenhoff: profile::pki::multirootca: Adapt firewall config to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1243843 (https://phabricator.wikimedia.org/T416664) [14:59:39] jayme: we merged https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1233674, but scap failed with "mwdebug timeouts", so the deployment was aborted. [15:00:00] also see https://spiderpig.wikimedia.org/jobs/1412 for logs [15:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260225T1500) [15:00:07] does that help? [15:00:16] (03Merged) 10jenkins-bot: Revert "Remove deprecated IRS v2 configurations" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243836 (owner: 10STran) [15:00:20] (03CR) 10CI reject: [V:04-1] profile::pki::multirootca: Adapt firewall config to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1243843 (https://phabricator.wikimedia.org/T416664) (owner: 10Muehlenhoff) [15:00:35] (03CR) 10Brouberol: [C:03+2] Add helm chart for turnilo UI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240890 (https://phabricator.wikimedia.org/T416118) (owner: 10Joal) [15:00:52] Tran: revert pulled to deploy2002, `/srv/mediawiki-staging` should now be consistent with what's in prod. [15:00:57] Thanks so much! [15:01:12] any time! [15:01:15] not sure yet. I was just correlating the start of errors on the loadbalancers with when you ran the deployment [15:01:17] RESOLVED: [2x] ProbeDown: Service wdqs2011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:01:18] (03PS1) 10Vgutierrez: admin: Add mikez to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1243845 (https://phabricator.wikimedia.org/T418098) [15:02:17] (03PS1) 10Clément Goubert: statsd-exporter: Add 2 replicas for mw-web [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243846 (https://phabricator.wikimedia.org/T418383) [15:02:33] thanks urbanecm! [15:02:45] np [15:03:03] (FWIW I’ll hopefully pick up a new yubikey tomorrow and then I should be able to deploy again soon) [15:03:16] (03PS1) 10STran: Revert^2 "Remove deprecated IRS v2 configurations" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243847 [15:03:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243847 (owner: 10STran) [15:03:55] Tran: (nit) can you link that revert^2 to the task too? [15:04:03] (03PS2) 10Volans: .wmfconfig: remove bullseye build [software/spicerack] - 10https://gerrit.wikimedia.org/r/1237932 [15:04:03] (03PS1) 10Volans: Drop support for older Python versions [software/spicerack] - 10https://gerrit.wikimedia.org/r/1243848 [15:04:30] (03PS2) 10STran: Revert^2 "Remove deprecated IRS v2 configurations" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243847 (https://phabricator.wikimedia.org/T413951) [15:04:52] (03CR) 10Blake: [C:03+1] statsd-exporter: Add 2 replicas for mw-web [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243846 (https://phabricator.wikimedia.org/T418383) (owner: 10Clément Goubert) [15:05:16] urbanecm: done, thanks! For some reason I just never noticed that reverts don't carry the bug id [15:05:23] (03CR) 10Volans: .wmfconfig: remove bullseye build (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1237932 (owner: 10Volans) [15:05:27] thanks! indeed, it's a bit unfortunate :/ [15:06:38] (03PS2) 10Jforrester: wikifunctions: Upgrade evaluators from 2026-02-12-145008 to 2026-02-25-131752 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243830 (https://phabricator.wikimedia.org/T417024) [15:06:38] (03PS2) 10Jforrester: wikifunctions: Upgrade orchestrator from 2026-02-18-140059 to 2026-02-25-124326 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243831 (https://phabricator.wikimedia.org/T413728) [15:06:39] (03PS3) 10Jforrester: wikifunctions: Turn on custom oTel spans [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243695 (https://phabricator.wikimedia.org/T417750) (owner: 10Ecarg) [15:06:58] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-private-users for maxbinderWMF - https://phabricator.wikimedia.org/T417655#11651035 (10Aklapper) @MBinder_WMF: Please feel also free to [link your LDAP account to your Phabricator account](https://phabricator.wikimedia.org/s... [15:07:00] 06SRE, 06serviceops, 10API Platform (RESTBase Deprecation Roadmap): Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995#11651036 (10Krinkle) [15:07:15] (03PS4) 10Joal: Add turnilo-next helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240891 (https://phabricator.wikimedia.org/T416121) [15:07:24] (03CR) 10Joal: Add turnilo-next helmfile and values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240891 (https://phabricator.wikimedia.org/T416121) (owner: 10Joal) [15:08:07] (03CR) 10Ecarg: [C:03+2] wikifunctions: Upgrade evaluators from 2026-02-12-145008 to 2026-02-25-131752 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243830 (https://phabricator.wikimedia.org/T417024) (owner: 10Jforrester) [15:09:17] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1237932 (owner: 10Volans) [15:09:21] (03PS1) 10Eevans: admin_ng: add namespace for linked-artifacts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243850 (https://phabricator.wikimedia.org/T414112) [15:09:27] (03CR) 10Arnaudb: [C:03+1] "note: here the service_account is like a "kubernetes" service account, not a system one, it's perhaps a confusing naming from me! that use" [puppet] - 10https://gerrit.wikimedia.org/r/1243187 (https://phabricator.wikimedia.org/T338470) (owner: 10Dzahn) [15:10:14] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2026-02-12-145008 to 2026-02-25-131752 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243830 (https://phabricator.wikimedia.org/T417024) (owner: 10Jforrester) [15:10:46] 06SRE, 10ChangeProp, 06Data-Engineering, 10EventStreams, and 3 others: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750#11651054 (10Krinkle) 05Open→03Resolved [15:10:54] (03CR) 10Brouberol: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240891 (https://phabricator.wikimedia.org/T416121) (owner: 10Joal) [15:10:56] (03CR) 10Clément Goubert: [C:03+2] statsd-exporter: Add 2 replicas for mw-web [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243846 (https://phabricator.wikimedia.org/T418383) (owner: 10Clément Goubert) [15:11:26] (03CR) 10Brouberol: [C:03+2] Add turnilo-next helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240891 (https://phabricator.wikimedia.org/T416121) (owner: 10Joal) [15:12:58] Tran: urbanecm: Lucas_WMDE: There was an issue on our side of things, I appologise. The test (and scap failing) actually prevented a "real" outage here I suppose [15:13:08] (03Merged) 10jenkins-bot: statsd-exporter: Add 2 replicas for mw-web [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243846 (https://phabricator.wikimedia.org/T418383) (owner: 10Clément Goubert) [15:14:02] which is good! :) [15:14:09] jayme: does that mean we should wait for the issue to be fixed? [15:14:11] or is scap safe to run again? [15:14:43] I just fixed it, you should be good to run scap again [15:14:49] ack, thanks! [15:15:08] trying on the revert [15:15:29] (03CR) 10Scott French: [C:03+1] "Thanks, Moritz!" [puppet] - 10https://gerrit.wikimedia.org/r/1243787 (owner: 10Muehlenhoff) [15:15:35] (03PS1) 10Clément Goubert: shellbox-constraints: Bump memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243851 [15:15:40] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1243836|Revert "Remove deprecated IRS v2 configurations"]] [15:16:22] urbanecm: lmk how it goes [15:16:30] will do! [15:16:45] !log ecarg@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:17:23] (03CR) 10Btullis: [C:03+2] test-kitchen kubernetes chart: New config property [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242438 (https://phabricator.wikimedia.org/T418088) (owner: 10Santiago Faci) [15:17:31] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [15:17:32] !log ecarg@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:17:43] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [15:17:47] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [15:17:55] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [15:18:06] !log urbanecm@deploy2002 stran, urbanecm: Backport for [[gerrit:1243836|Revert "Remove deprecated IRS v2 configurations"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:18:06] (03PS1) 10Urbanecm: cleanup: Remove bunch of unnecessary code from ReassignMentees [extensions/GrowthExperiments] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1243853 [15:18:07] (03PS1) 10Urbanecm: tests: Introduce MentorRemoverTest [extensions/GrowthExperiments] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1243854 [15:18:08] (statsd replica bump, nothing to do directly with mw) [15:18:23] !log ecarg@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:18:35] (03PS1) 10Urbanecm: ExperimentManager: remove geForceVariant flag handling [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243855 (https://phabricator.wikimedia.org/T416894) [15:18:37] (03PS1) 10Urbanecm: SiteNoticeGenerator: stop adding per-variant classes [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243856 (https://phabricator.wikimedia.org/T416894) [15:18:39] (03PS1) 10Urbanecm: Experiments: introduce IExperimentManager [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243857 (https://phabricator.wikimedia.org/T375198) [15:18:43] (03PS1) 10Urbanecm: Remove PHPDoc blocks that are 100% identical to the code [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243858 [15:18:45] (03PS1) 10Urbanecm: cleanup: Remove bunch of unnecessary code from ReassignMentees [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243859 [15:18:51] (03PS1) 10Urbanecm: tests: Introduce MentorRemoverTest [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243860 [15:18:55] (03CR) 10Blake: [C:03+1] shellbox-constraints: Bump memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243851 (owner: 10Clément Goubert) [15:19:03] !log ecarg@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:19:21] !log ecarg@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:19:30] (03Merged) 10jenkins-bot: test-kitchen kubernetes chart: New config property [deployment-charts] - 10https://gerrit.wikimedia.org/r/1242438 (https://phabricator.wikimedia.org/T418088) (owner: 10Santiago Faci) [15:19:40] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:19:40] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:19:43] (03CR) 10Urbanecm: [C:03+2] cleanup: Remove bunch of unnecessary code from ReassignMentees [extensions/GrowthExperiments] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1243853 (owner: 10Urbanecm) [15:19:47] (03CR) 10Urbanecm: [C:03+2] tests: Introduce MentorRemoverTest [extensions/GrowthExperiments] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1243854 (owner: 10Urbanecm) [15:19:55] !log urbanecm@deploy2002 stran, urbanecm: Continuing with sync [15:20:06] !log ecarg@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:20:08] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-private-users for maxbinderWMF - https://phabricator.wikimedia.org/T417655#11651112 (10Ottomata) A failure while logging in is more related to CAS (?right?), which IIUC is using LDAP for authorization. Is Max in either the... [15:20:15] (03PS1) 10Kamila Součková: Revert "conftool-data: add wikikube-worker2356 to test nokia switches" [puppet] - 10https://gerrit.wikimedia.org/r/1243862 [15:20:50] (03CR) 10CI reject: [V:04-1] Revert "conftool-data: add wikikube-worker2356 to test nokia switches" [puppet] - 10https://gerrit.wikimedia.org/r/1243862 (owner: 10Kamila Součková) [15:20:53] (03CR) 10Blake: [C:03+1] Revert "conftool-data: add wikikube-worker2356 to test nokia switches" [puppet] - 10https://gerrit.wikimedia.org/r/1243862 (owner: 10Kamila Součková) [15:21:07] (03CR) 10Ecarg: [C:03+2] wikifunctions: Upgrade orchestrator from 2026-02-18-140059 to 2026-02-25-124326 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243831 (https://phabricator.wikimedia.org/T413728) (owner: 10Jforrester) [15:21:21] (03CR) 10Jelto: "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243850 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [15:21:31] (03CR) 10Jelto: [C:03+1] admin_ng: add namespace for linked-artifacts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243850 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [15:21:34] (03PS2) 10Kamila Součková: Revert "conftool-data: add wikikube-worker2356 to test nokia switches" [puppet] - 10https://gerrit.wikimedia.org/r/1243862 [15:22:25] (03CR) 10Kamila Součková: [C:03+2] Revert "conftool-data: add wikikube-worker2356 to test nokia switches" [puppet] - 10https://gerrit.wikimedia.org/r/1243862 (owner: 10Kamila Součková) [15:22:34] (03CR) 10JMeybohm: [C:03+1] admin_ng: add namespace for linked-artifacts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243850 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [15:23:51] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1243836|Revert "Remove deprecated IRS v2 configurations"]] (duration: 08m 11s) [15:24:05] jayme: worked like a charm! thanks for the fix [15:24:09] (curious, what was the cause?) [15:24:11] (03CR) 10Bking: [C:03+1] opensearch-semantic-search: test cluster capacity [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243696 (owner: 10DCausse) [15:24:51] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-fe2023.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:25:10] (03CR) 10CI reject: [V:04-1] Remove PHPDoc blocks that are 100% identical to the code [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243858 (owner: 10Urbanecm) [15:25:22] urbanecm: one k8s worker had/has issues with BGP peering to a new switch and since it was empty (the worker), the mw-debug pods got sheduled there (most of them) so they where unreachable - although happy [15:25:26] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-fe2023.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:25:30] (03CR) 10CI reject: [V:04-1] Experiments: introduce IExperimentManager [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243857 (https://phabricator.wikimedia.org/T375198) (owner: 10Urbanecm) [15:26:02] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2026-02-18-140059 to 2026-02-25-124326 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243831 (https://phabricator.wikimedia.org/T413728) (owner: 10Jforrester) [15:26:17] (03PS14) 10Federico Ceratto: mysql: update replication source [cookbooks] - 10https://gerrit.wikimedia.org/r/1238368 (https://phabricator.wikimedia.org/T373436) [15:26:22] urbanecm: my bad, I checked that the pods were happy but not that they were reachable '^^ [15:26:27] heh, so that's why it was intermittent! thanks for the info. [15:26:35] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-fe2023.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:26:41] !log ecarg@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:26:42] 06SRE, 06serviceops, 10API Platform (RESTBase Deprecation Roadmap): Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995#11651149 (10Jdforrester-WMF) 05Open→03Resolved a:03Jdforrester-WMF [15:27:14] !log ecarg@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:27:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1243853 (owner: 10Urbanecm) [15:27:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1243854 (owner: 10Urbanecm) [15:28:21] !log ecarg@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:28:29] (03PS4) 10Jforrester: wikifunctions: Turn on custom oTel spans [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243695 (https://phabricator.wikimedia.org/T417750) (owner: 10Ecarg) [15:28:54] !log ecarg@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:29:03] !log ecarg@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:29:06] (03PS15) 10Federico Ceratto: mysql: update replication source [cookbooks] - 10https://gerrit.wikimedia.org/r/1238368 (https://phabricator.wikimedia.org/T373436) [15:29:28] (03CR) 10Clément Goubert: [C:03+2] shellbox-constraints: Bump memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243851 (owner: 10Clément Goubert) [15:29:43] !log ecarg@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:30:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260225T1500) [15:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260225T1530) [15:30:40] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:30:40] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:31:00] (03CR) 10Federico Ceratto: "Both: I updated the CR description with the safety checks. I just found a bug in how pymysql is exposed from spicerack and added dry-run s" [cookbooks] - 10https://gerrit.wikimedia.org/r/1238368 (https://phabricator.wikimedia.org/T373436) (owner: 10Federico Ceratto) [15:31:34] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host pki1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [15:32:18] (03CR) 10CI reject: [V:04-1] tests: Introduce MentorRemoverTest [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243860 (owner: 10Urbanecm) [15:32:19] (03CR) 10CI reject: [V:04-1] cleanup: Remove bunch of unnecessary code from ReassignMentees [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243859 (owner: 10Urbanecm) [15:32:33] (03CR) 10Ecarg: [C:03+2] wikifunctions: Turn on custom oTel spans [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243695 (https://phabricator.wikimedia.org/T417750) (owner: 10Ecarg) [15:32:41] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [15:33:23] !log kamila@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2356.codfw.wmnet [15:33:26] !log kamila@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2356.codfw.wmnet [15:33:51] (03PS2) 10Urbanecm: Experiments: introduce IExperimentManager [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243857 (https://phabricator.wikimedia.org/T375198) [15:33:52] (03PS2) 10Urbanecm: Remove PHPDoc blocks that are 100% identical to the code [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243858 [15:33:52] (03PS2) 10Urbanecm: cleanup: Remove bunch of unnecessary code from ReassignMentees [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243859 [15:33:55] (03PS2) 10Urbanecm: tests: Introduce MentorRemoverTest [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243860 [15:34:01] (03PS1) 10Urbanecm: LevelingUpManager: stop supporting multiple delay specifications [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243868 (https://phabricator.wikimedia.org/T416894) [15:34:07] (03Merged) 10jenkins-bot: cleanup: Remove bunch of unnecessary code from ReassignMentees [extensions/GrowthExperiments] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1243853 (owner: 10Urbanecm) [15:34:29] (03CR) 10Urbanecm: [C:03+2] ExperimentManager: remove geForceVariant flag handling [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243855 (https://phabricator.wikimedia.org/T416894) (owner: 10Urbanecm) [15:34:31] (03CR) 10Urbanecm: [C:03+2] SiteNoticeGenerator: stop adding per-variant classes [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243856 (https://phabricator.wikimedia.org/T416894) (owner: 10Urbanecm) [15:34:45] (03CR) 10Urbanecm: [C:03+2] LevelingUpManager: stop supporting multiple delay specifications [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243868 (https://phabricator.wikimedia.org/T416894) (owner: 10Urbanecm) [15:36:22] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ms-fe2024 to codfw - jhancock@cumin2002" [15:36:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ms-fe2024 to codfw - jhancock@cumin2002" [15:36:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:36:28] (03CR) 10Ssingh: "Let's plan on merging this" [puppet] - 10https://gerrit.wikimedia.org/r/1214531 (https://phabricator.wikimedia.org/T411584) (owner: 10Slyngshede) [15:36:33] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-fe2024 [15:36:39] (03CR) 10Ssingh: "@slyngshede@wikimedia.org: this is ready for review, I assume?" [puppet] - 10https://gerrit.wikimedia.org/r/1214531 (https://phabricator.wikimedia.org/T411584) (owner: 10Slyngshede) [15:36:48] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-fe2024 [15:38:21] (03CR) 10CI reject: [V:04-1] tests: Introduce MentorRemoverTest [extensions/GrowthExperiments] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1243854 (owner: 10Urbanecm) [15:38:25] (03Merged) 10jenkins-bot: shellbox-constraints: Bump memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243851 (owner: 10Clément Goubert) [15:38:34] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:38:35] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-fe2024.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:38:47] (03CR) 10Urbanecm: tests: Introduce MentorRemoverTest [extensions/GrowthExperiments] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1243854 (owner: 10Urbanecm) [15:38:49] (03CR) 10Urbanecm: [C:03+2] tests: Introduce MentorRemoverTest [extensions/GrowthExperiments] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1243854 (owner: 10Urbanecm) [15:39:02] (03CR) 10CI reject: [V:04-1] mysql: update replication source [cookbooks] - 10https://gerrit.wikimedia.org/r/1238368 (https://phabricator.wikimedia.org/T373436) (owner: 10Federico Ceratto) [15:39:30] (03CR) 10Majavah: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1241003 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [15:39:34] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:39:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2269:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2269 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:40:02] (03Merged) 10jenkins-bot: wikifunctions: Turn on custom oTel spans [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243695 (https://phabricator.wikimedia.org/T417750) (owner: 10Ecarg) [15:40:13] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1243853|cleanup: Remove bunch of unnecessary code from ReassignMentees]] [15:40:24] (03CR) 10Andrew McAllister (WMDE): [C:03+1] Add helm value to optionally allow egress for airflow-wmde [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243841 (https://phabricator.wikimedia.org/T417633) (owner: 10Awight) [15:40:33] !log ecarg@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:40:34] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host pki1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [15:41:02] (03CR) 10Santiago Faci: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240692 (https://phabricator.wikimedia.org/T417717) (owner: 10Santiago Faci) [15:41:05] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host pki1002.eqiad.wmnet with OS trixie [15:41:11] !log ecarg@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:41:13] (03CR) 10CI reject: [V:04-1] Test Kitchen UI: Deploy v1.2.3 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240692 (https://phabricator.wikimedia.org/T417717) (owner: 10Santiago Faci) [15:41:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1261 (T415786)', diff saved to https://phabricator.wikimedia.org/P89024 and previous config saved to /var/cache/conftool/dbconfig/20260225-154116-marostegui.json [15:41:21] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [15:41:32] (03CR) 10JHathaway: [C:03+2] dmarc: add dmarc records for domains which do not send email [dns] - 10https://gerrit.wikimedia.org/r/1243225 (owner: 10JHathaway) [15:41:38] (03Merged) 10jenkins-bot: ExperimentManager: remove geForceVariant flag handling [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243855 (https://phabricator.wikimedia.org/T416894) (owner: 10Urbanecm) [15:41:42] !log ecarg@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:41:44] (03Merged) 10jenkins-bot: SiteNoticeGenerator: stop adding per-variant classes [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243856 (https://phabricator.wikimedia.org/T416894) (owner: 10Urbanecm) [15:41:51] !log jhathaway@dns1004 START - running authdns-update [15:41:58] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1027.eqiad.wmnet with OS bookworm [15:42:01] (03CR) 10Vgutierrez: [C:04-1] gerrit: add gerrit-replica service to LVS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1240294 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [15:42:38] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1243853|cleanup: Remove bunch of unnecessary code from ReassignMentees]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:42:49] !log ecarg@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:42:59] !log urbanecm@deploy2002 urbanecm: Continuing with sync [15:43:01] !log ecarg@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:43:17] !log jhathaway@dns1004 END - running authdns-update [15:43:31] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen-next: apply [15:43:47] !log ecarg@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:44:00] (03PS32) 10CDobbins: prometheus: add pooled host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) [15:44:35] (03CR) 10Vgutierrez: [C:03+1] "looks good assuming you'll follow up with another CR to extend the buckets" [puppet] - 10https://gerrit.wikimedia.org/r/1214531 (https://phabricator.wikimedia.org/T411584) (owner: 10Slyngshede) [15:44:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker2269:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2269 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:45:09] jhancock@cumin2002 provision (PID 3437589) is awaiting input [15:45:22] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen-next: apply [15:46:07] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1026.eqiad.wmnet with OS bookworm [15:46:56] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1243853|cleanup: Remove bunch of unnecessary code from ReassignMentees]] (duration: 06m 43s) [15:47:03] (03CR) 10Urbanecm: [C:03+2] Experiments: introduce IExperimentManager [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243857 (https://phabricator.wikimedia.org/T375198) (owner: 10Urbanecm) [15:48:16] (03PS16) 10Federico Ceratto: mysql: update replication source [cookbooks] - 10https://gerrit.wikimedia.org/r/1238368 (https://phabricator.wikimedia.org/T373436) [15:49:19] (03CR) 10Urbanecm: [C:03+2] Remove PHPDoc blocks that are 100% identical to the code [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243858 (owner: 10Urbanecm) [15:49:29] (03CR) 10Ahmon Dancy: "Fingers crossed!" [puppet] - 10https://gerrit.wikimedia.org/r/1243727 (https://phabricator.wikimedia.org/T412951) (owner: 10Elukey) [15:49:35] (03CR) 10Urbanecm: [C:03+2] cleanup: Remove bunch of unnecessary code from ReassignMentees [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243859 (owner: 10Urbanecm) [15:53:08] 06SRE, 06Traffic: Image Rate Limiting Issues For Future Audiences Project - https://phabricator.wikimedia.org/T418377#11651280 (10derenrich) here is the request in full > > method: GET > uri: https://upload.wikimedia.org/wikipedia/commons/thumb/3/3e/Emanu-elNYjeh.JPG/250px-Emanu-elNYjeh.JPG > compressionSta... [15:53:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243868 (https://phabricator.wikimedia.org/T416894) (owner: 10Urbanecm) [15:53:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1243854 (owner: 10Urbanecm) [15:53:35] (03Merged) 10jenkins-bot: LevelingUpManager: stop supporting multiple delay specifications [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243868 (https://phabricator.wikimedia.org/T416894) (owner: 10Urbanecm) [15:53:52] (03PS4) 10Santiago Faci: Test Kitchen UI: Deploy v1.2.3 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240692 (https://phabricator.wikimedia.org/T417717) [15:54:01] (03CR) 10Urbanecm: [V:03+2 C:03+2] tests: Introduce MentorRemoverTest [extensions/GrowthExperiments] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1243854 (owner: 10Urbanecm) [15:54:05] (03CR) 10Scott French: [C:03+1] "Thanks, Luca!" [puppet] - 10https://gerrit.wikimedia.org/r/1243727 (https://phabricator.wikimedia.org/T412951) (owner: 10Elukey) [15:54:42] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1243855|ExperimentManager: remove geForceVariant flag handling (T416894)]], [[gerrit:1243856|SiteNoticeGenerator: stop adding per-variant classes (T416894)]], [[gerrit:1243868|LevelingUpManager: stop supporting multiple delay specifications (T416894)]], [[gerrit:1243854|tests: Introduce MentorRemoverTest]] [15:54:47] T416894: Integrate existing AB testable features with TestKitchen - https://phabricator.wikimedia.org/T416894 [15:55:00] (03CR) 10Santiago Faci: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1240692 (https://phabricator.wikimedia.org/T417717) (owner: 10Santiago Faci) [15:55:04] (03CR) 10Scott French: [C:03+1] docker_registry: remove the /test prefix special handling [puppet] - 10https://gerrit.wikimedia.org/r/1243726 (https://phabricator.wikimedia.org/T394476) (owner: 10Elukey) [15:55:30] (03PS1) 10Fabfur: cache::haproxy: save x-wmf-ratelimit-class content for webrequest [puppet] - 10https://gerrit.wikimedia.org/r/1243870 (https://phabricator.wikimedia.org/T417864) [15:55:48] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be209[56] - https://phabricator.wikimedia.org/T413088#11651291 (10Jhancock.wm) [15:56:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1261', diff saved to https://phabricator.wikimedia.org/P89025 and previous config saved to /var/cache/conftool/dbconfig/20260225-155624-marostegui.json [15:57:02] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1243855|ExperimentManager: remove geForceVariant flag handling (T416894)]], [[gerrit:1243856|SiteNoticeGenerator: stop adding per-variant classes (T416894)]], [[gerrit:1243868|LevelingUpManager: stop supporting multiple delay specifications (T416894)]], [[gerrit:1243854|tests: Introduce MentorRemoverTest]] synced to the testservers (see https://wikitech.wikimedia.or [15:57:02] g/wiki/Mwdebug). Changes can now be verified there. [15:57:14] (03CR) 10Urbanecm: [C:03+2] tests: Introduce MentorRemoverTest [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243860 (owner: 10Urbanecm) [15:57:23] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [15:57:25] !log urbanecm@deploy2002 urbanecm: Continuing with sync [15:58:00] (03CR) 10Volans: "What bug? Did you open a task for it?" [cookbooks] - 10https://gerrit.wikimedia.org/r/1238368 (https://phabricator.wikimedia.org/T373436) (owner: 10Federico Ceratto) [15:58:17] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-fe2024.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:58:38] (03CR) 10Bking: [C:03+2] "I'm merging this now, because wdqs is hard down and has been for the greater part of this week." [puppet] - 10https://gerrit.wikimedia.org/r/1243698 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [15:58:57] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243870 (https://phabricator.wikimedia.org/T417864) (owner: 10Fabfur) [16:00:57] (03PS5) 10BCornwall: ncmonitor: Add ncmonitor sysuser [puppet] - 10https://gerrit.wikimedia.org/r/1243258 [16:01:19] (03CR) 10Vgutierrez: [C:04-2] "you don't need an additional LVS service on high-traffic1 for gerrit-replica.wm.o, just add it to ATS backend.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/1240294 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [16:01:24] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1243855|ExperimentManager: remove geForceVariant flag handling (T416894)]], [[gerrit:1243856|SiteNoticeGenerator: stop adding per-variant classes (T416894)]], [[gerrit:1243868|LevelingUpManager: stop supporting multiple delay specifications (T416894)]], [[gerrit:1243854|tests: Introduce MentorRemoverTest]] (duration: 06m 42s) [16:01:29] T416894: Integrate existing AB testable features with TestKitchen - https://phabricator.wikimedia.org/T416894 [16:01:48] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8146/co" [puppet] - 10https://gerrit.wikimedia.org/r/1243258 (owner: 10BCornwall) [16:02:52] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [16:03:11] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ms-be2095 to codfw - jhancock@cumin2002" [16:03:13] PROBLEM - PyBal IPVS diff check on lvs2014 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:03:13] PROBLEM - PyBal IPVS diff check on lvs2013 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:03:16] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ms-be2095 to codfw - jhancock@cumin2002" [16:03:16] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:03:21] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be2095 [16:03:23] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be2096 [16:03:28] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [16:03:34] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be2095 [16:03:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be2096 [16:04:03] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [16:04:10] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [16:04:33] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2095.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:06:05] !log cgoubert@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [16:07:12] !log cgoubert@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [16:07:53] (03CR) 10Elukey: [C:03+2] .wmfconfig: remove bullseye build [software/spicerack] - 10https://gerrit.wikimedia.org/r/1237932 (owner: 10Volans) [16:08:02] (03CR) 10Elukey: [C:03+2] Drop support for older Python versions [software/spicerack] - 10https://gerrit.wikimedia.org/r/1243848 (owner: 10Volans) [16:08:11] (03Merged) 10jenkins-bot: Experiments: introduce IExperimentManager [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243857 (https://phabricator.wikimedia.org/T375198) (owner: 10Urbanecm) [16:08:11] RECOVERY - PyBal IPVS diff check on lvs2014 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:08:13] RECOVERY - PyBal IPVS diff check on lvs2013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:08:22] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:08:44] !log cgoubert@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [16:08:44] (03PS1) 10Ayounsi: Add support for POP Nokia switches [homer/public] - 10https://gerrit.wikimedia.org/r/1243875 (https://phabricator.wikimedia.org/T408892) [16:08:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243858 (owner: 10Urbanecm) [16:09:09] (03Merged) 10jenkins-bot: Remove PHPDoc blocks that are 100% identical to the code [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243858 (owner: 10Urbanecm) [16:09:10] (03CR) 10CI reject: [V:04-1] cleanup: Remove bunch of unnecessary code from ReassignMentees [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243859 (owner: 10Urbanecm) [16:09:11] (03CR) 10CI reject: [V:04-1] tests: Introduce MentorRemoverTest [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243860 (owner: 10Urbanecm) [16:09:31] (03CR) 10Urbanecm: cleanup: Remove bunch of unnecessary code from ReassignMentees [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243859 (owner: 10Urbanecm) [16:09:31] (03CR) 10Ssingh: [C:03+1] ats: Set secondary nvme drives for new codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/1243184 (https://phabricator.wikimedia.org/T401832) (owner: 10BCornwall) [16:09:34] (03CR) 10Urbanecm: [C:03+2] cleanup: Remove bunch of unnecessary code from ReassignMentees [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243859 (owner: 10Urbanecm) [16:09:40] (03CR) 10Urbanecm: tests: Introduce MentorRemoverTest [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243860 (owner: 10Urbanecm) [16:09:44] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1243857|Experiments: introduce IExperimentManager (T375198 T415536)]], [[gerrit:1243858|Remove PHPDoc blocks that are 100% identical to the code]] [16:09:49] (03CR) 10Urbanecm: [C:03+2] tests: Introduce MentorRemoverTest [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243860 (owner: 10Urbanecm) [16:09:49] T375198: Fully adopt TestKitchen for experiment enrollment - https://phabricator.wikimedia.org/T375198 [16:09:50] T415536: Allow running multiple experiments in GrowthExperiments at the same time - https://phabricator.wikimedia.org/T415536 [16:09:52] jhancock@cumin2002 provision (PID 3451051) is awaiting input [16:10:40] !log cgoubert@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [16:10:55] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [16:11:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1261', diff saved to https://phabricator.wikimedia.org/P89026 and previous config saved to /var/cache/conftool/dbconfig/20260225-161132-marostegui.json [16:12:03] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1243857|Experiments: introduce IExperimentManager (T375198 T415536)]], [[gerrit:1243858|Remove PHPDoc blocks that are 100% identical to the code]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:12:32] !log urbanecm@deploy2002 urbanecm: Continuing with sync [16:12:39] (03CR) 10BCornwall: [V:03+1 C:03+2] ats: Set secondary nvme drives for new codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/1243184 (https://phabricator.wikimedia.org/T401832) (owner: 10BCornwall) [16:12:45] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [16:12:57] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [16:14:31] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [16:15:37] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:15:37] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:15:43] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:15:45] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:15:45] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host pki1002.eqiad.wmnet with OS trixie [16:15:55] (03Merged) 10jenkins-bot: Drop support for older Python versions [software/spicerack] - 10https://gerrit.wikimedia.org/r/1243848 (owner: 10Volans) [16:15:55] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [16:15:58] (03PS2) 10Fabfur: cache::haproxy: save x-wmf-ratelimit-class content for webrequest [puppet] - 10https://gerrit.wikimedia.org/r/1243870 (https://phabricator.wikimedia.org/T417864) [16:16:28] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1243857|Experiments: introduce IExperimentManager (T375198 T415536)]], [[gerrit:1243858|Remove PHPDoc blocks that are 100% identical to the code]] (duration: 06m 44s) [16:16:34] T375198: Fully adopt TestKitchen for experiment enrollment - https://phabricator.wikimedia.org/T375198 [16:16:34] T415536: Allow running multiple experiments in GrowthExperiments at the same time - https://phabricator.wikimedia.org/T415536 [16:17:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243859 (owner: 10Urbanecm) [16:17:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243860 (owner: 10Urbanecm) [16:17:46] (03PS2) 10Elukey: Apply role to pki1002 [puppet] - 10https://gerrit.wikimedia.org/r/1243839 (https://phabricator.wikimedia.org/T416664) [16:17:46] (03PS1) 10Elukey: profile::installserver: move pki1002 to UEFI in preseed config [puppet] - 10https://gerrit.wikimedia.org/r/1243876 [16:17:48] jhancock@cumin2002 provision (PID 3451051) is awaiting input [16:18:23] jouncebot nowandnext [16:18:23] No deployments scheduled for the next 1 hour(s) and 41 minute(s) [16:18:23] In 1 hour(s) and 41 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260225T1800) [16:19:36] urandom: I had deployments to do on admin_ng so part of your linked-artifacts namespaces are deployed [16:19:37] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - gerritlb6_443: Servers cp1100.eqiad.wmnet, cp1104.eqiad.wmnet, cp1106.eqiad.wmnet, cp1108.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1104.eqiad.wmnet, cp1106.eqiad.wmnet, cp1108.eqiad.wmnet, cp1112.eqiad.wmnet, cp1114.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1100.eqiad.wmnet, cp1106.eqiad.wmnet, cp1108.eqiad.w [16:19:37] 1112.eqiad.wmnet, cp1110.eqiad.wmnet are marked down but pooled: gerritlb_443: Servers cp1106.eqiad.wmnet, cp1108.eqiad.wmnet, cp1112.eqiad.wmnet, cp1110.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:19:39] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - gerritlb6_443: Servers cp1108.eqiad.wmnet, cp1112.eqiad.wmnet, cp1106.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1100.eqiad.wmnet, cp1104.eqiad.wmnet, cp1106.eqiad.wmnet, cp1108.eqiad.wmnet, cp1112.eqiad.wmnet, cp1110.eqiad.wmnet, cp1114.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1100.eqiad.wmnet, cp1104.eqiad.w [16:19:39] 1106.eqiad.wmnet, cp1108.eqiad.wmnet, cp1112.eqiad.wmnet are marked down but pooled: gerritlb_443: Servers cp1100.eqiad.wmnet, cp1104.eqiad.wmnet, cp1108.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:19:41] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp1104 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed error:0A000126:SSL routines::unexpected eof while reading https://wikitech.wikimedia.org/wiki/HTTPS [16:19:43] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - gerritlb6_443: Servers cp2039.codfw.wmnet, cp2029.codfw.wmnet, cp2037.codfw.wmnet, cp2031.codfw.wmnet are marked down but pooled: textlb_443: Servers cp2035.codfw.wmnet, cp2039.codfw.wmnet, cp2029.codfw.wmnet, cp2037.codfw.wmnet, cp2031.codfw.wmnet, cp2027.codfw.wmnet are marked down but pooled: textlb6_443: Servers cp2035.codfw.wmnet, cp2039.codfw.w [16:19:43] 2029.codfw.wmnet, cp2037.codfw.wmnet, cp2031.codfw.wmnet, cp2033.codfw.wmnet are marked down but pooled: wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled: gerritlb_443: Servers cp2035.codfw.wmnet, cp2029.codfw.wmnet, cp2037.codfw.wmnet, cp2027.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:19:43] PROBLEM - PyBal backends health check on lvs2011 is CRITICAL: PYBAL CRITICAL - CRITICAL - gerritlb6_443: Servers cp2033.codfw.wmnet, cp2027.codfw.wmnet, cp2039.codfw.wmnet, cp2031.codfw.wmnet are marked down but pooled: textlb_443: Servers cp2035.codfw.wmnet, cp2039.codfw.wmnet, cp2029.codfw.wmnet, cp2027.codfw.wmnet, cp2031.codfw.wmnet, cp2041.codfw.wmnet, cp2033.codfw.wmnet, cp2037.codfw.wmnet are marked down but pooled: textlb6_443: Se [16:19:43] 2035.codfw.wmnet, cp2039.codfw.wmnet, cp2041.codfw.wmnet, cp2031.codfw.wmnet, cp2027.codfw.wmnet, cp2037.codfw.wmnet are marked down but pooled: gerritlb_443: Servers cp2035.codfw.wmnet, cp2039.codfw.wmnet, cp2029.codfw.wmnet, cp2037.codfw.wmnet, cp2031.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:19:44] oh boy [16:19:45] 503 when visiting any Wikimedia site [16:19:45] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp3071 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [16:19:55] PROBLEM - HAProxy HTTPS wikipedia25.org ECDSA on cp5023 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [16:19:57] FIRING: [11x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:19:59] wtf [16:20:10] yeehaw [16:20:15] !ack [16:20:15] 7480 (ACKED) [11x] ProbeDown sre (probes/service) [16:20:27] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp5020 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [16:20:29] (03CR) 10CI reject: [V:04-1] profile::installserver: move pki1002 to UEFI in preseed config [puppet] - 10https://gerrit.wikimedia.org/r/1243876 (owner: 10Elukey) [16:20:31] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp1100 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [16:20:31] FIRING: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:20:42] (03CR) 10CI reject: [V:04-1] Apply role to pki1002 [puppet] - 10https://gerrit.wikimedia.org/r/1243839 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey) [16:20:57] "503 Service Unavailable [16:20:57] No server is available to handle this request. " [16:20:59] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp1108 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [16:20:59] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp1108 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed error:0A000126:SSL routines::unexpected eof while reading https://wikitech.wikimedia.org/wiki/HTTPS [16:21:05] ShakespeareFan00: we are aware and handling it [16:21:07] RECOVERY - HAProxy HTTPS wikipedia25.org ECDSA on cp5023 is OK: SSL OK - Certificate wikipedia25.org contains all required SANs:Certificate wikipedia25.org (ECDSA) valid until 2026-04-07 07:52:16 +0000 (expires in 40 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:21:09] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp5018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [16:21:13] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [16:21:22] !ack [16:21:23] 7482 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule@main) [16:21:27] PROBLEM - HAProxy HTTPS wikipedia25.org ECDSA on cp5024 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [16:21:30] FIRING: [16x] LibericaUnhealthyRealserverPooled: Liberica service gerrit-httpslb6_443 has 5 unhealthy realservers pooled on lvs5004:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [16:21:40] (03CR) 10Elukey: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1243876 (owner: 10Elukey) [16:21:49] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp1104 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2026-05-07 21:41:31 +0000 (expires in 71 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:21:51] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp3071 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2026-05-07 21:41:31 +0000 (expires in 71 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:21:57] @urbanecm: Thanks, i would have expected a "Wikimedia" style "Busy" pages rather than a generic 503 though.. [16:22:03] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp5024 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [16:22:09] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp1108 is OK: SSL OK - Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2026-05-13 04:44:41 +0000 (expires in 76 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:22:17] PROBLEM - Debmonitor Health Check on debmonitor.wikimedia.org is CRITICAL: CRITICAL - Cannot make SSL connection. https://wikitech.wikimedia.org/wiki/Debmonitor [16:22:17] PROBLEM - Debmonitor Health Check Expiry on debmonitor.wikimedia.org is CRITICAL: CRITICAL - Cannot make SSL connection. https://wikitech.wikimedia.org/wiki/Debmonitor [16:22:17] PROBLEM - HAProxy HTTPS wikipedia25.org ECDSA on cp3073 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed error:0A000126:SSL routines::unexpected eof while reading https://wikitech.wikimedia.org/wiki/HTTPS [16:22:17] PROBLEM - HAProxy HTTPS wikipedia25.org ECDSA on cp3069 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed error:0A000126:SSL routines::unexpected eof while reading https://wikitech.wikimedia.org/wiki/HTTPS [16:22:17] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp3073 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed error:0A000126:SSL routines::unexpected eof while reading https://wikitech.wikimedia.org/wiki/HTTPS [16:22:18] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp3066 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed error:0A000126:SSL routines::unexpected eof while reading https://wikitech.wikimedia.org/wiki/HTTPS [16:22:21] PROBLEM - HAProxy HTTPS wikipedia25.org ECDSA on cp3070 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed error:0A000126:SSL routines::unexpected eof while reading https://wikitech.wikimedia.org/wiki/HTTPS [16:22:23] Welp. I was about to shout out about doing a backport to fix a bug in test kitchen [16:22:38] It can wait D: [16:22:43] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp1100 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2026-05-07 21:41:31 +0000 (expires in 71 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:22:47] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp3069 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed error:0A000126:SSL routines::unexpected eof while reading https://wikitech.wikimedia.org/wiki/HTTPS [16:22:47] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp3072 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed error:0A000126:SSL routines::unexpected eof while reading https://wikitech.wikimedia.org/wiki/HTTPS [16:22:47] RECOVERY - HAProxy HTTPS wikipedia25.org ECDSA on cp5024 is OK: SSL OK - Certificate wikipedia25.org contains all required SANs:Certificate wikipedia25.org (ECDSA) valid until 2026-04-07 07:52:16 +0000 (expires in 40 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:22:48] phuedx: definitely [16:22:59] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp5020 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [16:23:03] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp5018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [16:23:03] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp5019 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [16:23:19] RECOVERY - Debmonitor Health Check Expiry on debmonitor.wikimedia.org is OK: OK - Certificate *.wikipedia.org will expire on Thu 07 May 2026 09:41:31 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Debmonitor [16:23:19] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp3069 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed error:0A000126:SSL routines::unexpected eof while reading https://wikitech.wikimedia.org/wiki/HTTPS [16:23:21] RECOVERY - Debmonitor Health Check on debmonitor.wikimedia.org is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 1340 bytes in 4.595 second response time https://wikitech.wikimedia.org/wiki/Debmonitor [16:23:22] FIRING: [4x] JobUnavailable: Reduced availability for job probes/swagger in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:23:40] (03Merged) 10jenkins-bot: cleanup: Remove bunch of unnecessary code from ReassignMentees [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243859 (owner: 10Urbanecm) [16:23:41] PROBLEM - HAProxy HTTPS wikipedia25.org ECDSA on cp1112 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [16:23:41] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp1106 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed error:0A000126:SSL routines::unexpected eof while reading https://wikitech.wikimedia.org/wiki/HTTPS [16:23:49] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp7003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [16:23:50] PROBLEM - HAProxy HTTPS wikipedia25.org ECDSA on cp7002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed error:0A000126:SSL routines::unexpected eof while reading https://wikitech.wikimedia.org/wiki/HTTPS [16:23:51] FIRING: [6x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [16:24:07] @urbancem : I was also having problems viewing https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1239274 [16:24:09] PROBLEM - HAProxy HTTPS wikipedia25.org ECDSA on cp5017 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [16:24:10] previously [16:24:19] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp3066 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed error:0A000126:SSL routines::unexpected eof while reading https://wikitech.wikimedia.org/wiki/HTTPS [16:24:19] PROBLEM - HAProxy HTTPS wikipedia25.org ECDSA on cp3066 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed error:0A000126:SSL routines::unexpected eof while reading https://wikitech.wikimedia.org/wiki/HTTPS [16:24:19] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp3069 is OK: SSL OK - Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2026-05-13 04:44:41 +0000 (expires in 76 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:24:21] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp3067 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed error:0A000126:SSL routines::unexpected eof while reading https://wikitech.wikimedia.org/wiki/HTTPS [16:24:21] RECOVERY - HAProxy HTTPS wikipedia25.org ECDSA on cp3070 is OK: SSL OK - Certificate wikipedia25.org contains all required SANs:Certificate wikipedia25.org (ECDSA) valid until 2026-04-07 07:52:16 +0000 (expires in 40 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:24:31] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp5017 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [16:24:43] FIRING: [7x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:24:43] ShakespeareFan00: also known [16:24:43] RECOVERY - HAProxy HTTPS wikipedia25.org ECDSA on cp1112 is OK: SSL OK - Certificate wikipedia25.org contains all required SANs:Certificate wikipedia25.org (ECDSA) valid until 2026-04-07 07:52:16 +0000 (expires in 40 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:24:47] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp3069 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2026-05-07 21:41:31 +0000 (expires in 71 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:24:57] FIRING: [18x] ProbeDown: Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:24:58] FIRING: [6x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:24:59] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp1108 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed error:0A000126:SSL routines::unexpected eof while reading https://wikitech.wikimedia.org/wiki/HTTPS [16:25:08] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [16:25:17] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [16:25:19] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp3073 is OK: SSL OK - Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2026-05-13 04:44:41 +0000 (expires in 76 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:25:19] RECOVERY - HAProxy HTTPS wikipedia25.org ECDSA on cp3073 is OK: SSL OK - Certificate wikipedia25.org contains all required SANs:Certificate wikipedia25.org (ECDSA) valid until 2026-04-07 07:52:16 +0000 (expires in 40 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:25:19] PROBLEM - HAProxy HTTPS wikipedia25.org ECDSA on cp3072 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed error:0A000126:SSL routines::unexpected eof while reading https://wikitech.wikimedia.org/wiki/HTTPS [16:25:20] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be2095.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:25:27] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp5018 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2026-05-07 21:41:31 +0000 (expires in 71 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:25:31] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp5024 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [16:25:39] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2096.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:25:39] PROBLEM - HAProxy HTTPS wikipedia25.org ECDSA on cp1102 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [16:25:41] PROBLEM - HAProxy HTTPS wikipedia25.org ECDSA on cp1108 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed error:0A000126:SSL routines::unexpected eof while reading https://wikitech.wikimedia.org/wiki/HTTPS [16:25:45] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp3068 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [16:25:45] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp5021 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [16:25:47] PROBLEM - HAProxy HTTPS wikipedia25.org ECDSA on cp3067 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [16:25:49] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp7003 is OK: SSL OK - Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2026-05-13 04:44:41 +0000 (expires in 76 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:25:51] RECOVERY - HAProxy HTTPS wikipedia25.org ECDSA on cp7002 is OK: SSL OK - Certificate wikipedia25.org contains all required SANs:Certificate wikipedia25.org (ECDSA) valid until 2026-04-07 07:52:16 +0000 (expires in 40 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:25:57] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp1106 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2026-05-07 21:41:31 +0000 (expires in 71 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:25:58] FIRING: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [16:26:01] I'm guessing no, but does anybody need any help from product eng? I can stage if it would be useful [16:26:01] PROBLEM - HAProxy HTTPS wikipedia25.org ECDSA on cp1106 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [16:26:07] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp5018 is OK: SSL OK - Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2026-05-13 04:44:41 +0000 (expires in 76 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:26:07] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp1108 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2026-05-07 21:41:31 +0000 (expires in 71 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:26:13] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp5017 is OK: SSL OK - Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2026-05-13 04:44:41 +0000 (expires in 76 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:26:29] RECOVERY - HAProxy HTTPS wikipedia25.org ECDSA on cp3066 is OK: SSL OK - Certificate wikipedia25.org contains all required SANs:Certificate wikipedia25.org (ECDSA) valid until 2026-04-07 07:52:16 +0000 (expires in 40 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:26:30] FIRING: [32x] LibericaUnhealthyRealserverPooled: Liberica service gerrit-httpslb6_443 has 6 unhealthy realservers pooled on lvs3008:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [16:26:31] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp1100 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed error:0A000126:SSL routines::unexpected eof while reading https://wikitech.wikimedia.org/wiki/HTTPS [16:26:31] PROBLEM - HAProxy HTTPS wikipedia25.org ECDSA on cp5021 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [16:26:31] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp1108 is OK: SSL OK - Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2026-05-13 04:44:41 +0000 (expires in 76 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:26:41] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp1106 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed error:0A000126:SSL routines::unexpected eof while reading https://wikitech.wikimedia.org/wiki/HTTPS [16:26:41] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp1104 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed error:0A000126:SSL routines::unexpected eof while reading https://wikitech.wikimedia.org/wiki/HTTPS [16:26:44] (03CR) 10Urbanecm: tests: Introduce MentorRemoverTest [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243860 (owner: 10Urbanecm) [16:26:45] PROBLEM - HAProxy HTTPS wikipedia25.org ECDSA on cp5017 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [16:26:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1261 (T415786)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20260225-162641-marostegui.json [16:26:47] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp7003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [16:26:50] (03CR) 10CI reject: [V:04-1] tests: Introduce MentorRemoverTest [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243860 (owner: 10Urbanecm) [16:26:51] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp7008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [16:26:51] RECOVERY - HAProxy HTTPS wikipedia25.org ECDSA on cp1102 is OK: SSL OK - Certificate wikipedia25.org contains all required SANs:Certificate wikipedia25.org (ECDSA) valid until 2026-04-07 07:52:16 +0000 (expires in 40 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:26:51] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1262.eqiad.wmnet with reason: Maintenance [16:26:53] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp5021 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2026-05-07 21:41:31 +0000 (expires in 71 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:26:55] RECOVERY - HAProxy HTTPS wikipedia25.org ECDSA on cp3067 is OK: SSL OK - Certificate wikipedia25.org contains all required SANs:Certificate wikipedia25.org (ECDSA) valid until 2026-04-07 07:52:16 +0000 (expires in 40 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:26:59] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp5023 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [16:26:59] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp3068 is OK: SSL OK - Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2026-05-13 04:44:41 +0000 (expires in 76 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:27:03] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp5017 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [16:27:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1262 (T415786)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20260225-162659-marostegui.json [16:27:09] RECOVERY - HAProxy HTTPS wikipedia25.org ECDSA on cp1106 is OK: SSL OK - Certificate wikipedia25.org contains all required SANs:Certificate wikipedia25.org (ECDSA) valid until 2026-04-07 07:52:16 +0000 (expires in 40 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:27:11] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp5020 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2026-05-07 21:41:31 +0000 (expires in 71 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:27:15] RECOVERY - HAProxy HTTPS wikipedia25.org ECDSA on cp5021 is OK: SSL OK - Certificate wikipedia25.org contains all required SANs:Certificate wikipedia25.org (ECDSA) valid until 2026-04-07 07:52:16 +0000 (expires in 40 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:27:17] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp3072 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed error:0A000126:SSL routines::unexpected eof while reading https://wikitech.wikimedia.org/wiki/HTTPS [16:27:19] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp3070 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed error:0A000126:SSL routines::unexpected eof while reading https://wikitech.wikimedia.org/wiki/HTTPS [16:27:37] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp5019 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [16:27:41] PROBLEM - HAProxy HTTPS wikipedia25.org ECDSA on cp1112 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [16:27:50] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp1104 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2026-05-07 21:41:31 +0000 (expires in 71 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:27:50] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp7008 is OK: SSL OK - Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2026-05-13 04:44:41 +0000 (expires in 76 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:27:51] PROBLEM - HAProxy HTTPS wikipedia25.org ECDSA on cp7005 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [16:27:51] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp7005 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed error:0A000126:SSL routines::unexpected eof while reading https://wikitech.wikimedia.org/wiki/HTTPS [16:27:51] PROBLEM - HAProxy HTTPS wikipedia25.org ECDSA on cp7006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [16:27:51] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp7003 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2026-04-18 10:56:56 +0000 (expires in 51 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:28:05] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp5024 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [16:28:05] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp5019 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [16:28:07] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp5023 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2026-05-07 21:41:31 +0000 (expires in 71 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:28:07] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp5021 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [16:28:21] PROBLEM - HAProxy HTTPS wikipedia25.org ECDSA on cp3073 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [16:28:23] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp3073 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [16:28:27] PROBLEM - Debmonitor Health Check on debmonitor.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Debmonitor [16:28:27] PROBLEM - Debmonitor Health Check Expiry on debmonitor.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Debmonitor [16:28:27] RECOVERY - HAProxy HTTPS wikipedia25.org ECDSA on cp3069 is OK: SSL OK - Certificate wikipedia25.org contains all required SANs:Certificate wikipedia25.org (ECDSA) valid until 2026-04-07 07:52:16 +0000 (expires in 40 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:28:43] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp5017 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [16:28:45] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp1106 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [16:28:45] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp1100 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2026-05-07 21:41:31 +0000 (expires in 71 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:28:47] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp3069 is CRITICAL: Return code of 141 is out of bounds https://wikitech.wikimedia.org/wiki/HTTPS [16:28:51] FIRING: [7x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [16:28:51] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp7005 is OK: SSL OK - Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2026-05-13 04:44:41 +0000 (expires in 76 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:28:57] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp1106 is OK: SSL OK - Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2026-05-13 04:44:41 +0000 (expires in 76 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:28:57] RECOVERY - HAProxy HTTPS wikipedia25.org ECDSA on cp1112 is OK: SSL OK - Certificate wikipedia25.org contains all required SANs:Certificate wikipedia25.org (ECDSA) valid until 2026-04-07 07:52:16 +0000 (expires in 40 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:28:59] RECOVERY - HAProxy HTTPS wikipedia25.org ECDSA on cp7006 is OK: SSL OK - Certificate wikipedia25.org contains all required SANs:Certificate wikipedia25.org (ECDSA) valid until 2026-04-07 07:52:16 +0000 (expires in 40 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:29:01] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp5019 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2026-05-07 21:41:31 +0000 (expires in 71 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:29:07] PROBLEM - HAProxy HTTPS wikipedia25.org ECDSA on cp5017 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [16:29:21] @urbancem: It wasn't 503 previously, typically it was a "Not Authorised" error previously. [16:29:21] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp3066 is OK: SSL OK - Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2026-05-13 04:44:41 +0000 (expires in 76 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:29:23] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp3070 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2026-05-07 21:41:31 +0000 (expires in 71 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:29:23] RECOVERY - HAProxy HTTPS wikipedia25.org ECDSA on cp3072 is OK: SSL OK - Certificate wikipedia25.org contains all required SANs:Certificate wikipedia25.org (ECDSA) valid until 2026-04-07 07:52:16 +0000 (expires in 40 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:29:23] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp3072 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2026-05-07 21:41:31 +0000 (expires in 71 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:29:23] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp3066 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2026-05-07 21:41:31 +0000 (expires in 71 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:29:25] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [16:29:27] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp5019 is OK: SSL OK - Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2026-05-13 04:44:41 +0000 (expires in 76 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:29:27] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp3067 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2026-05-07 21:41:31 +0000 (expires in 71 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:29:27] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp3073 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2026-05-07 21:41:31 +0000 (expires in 71 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:29:35] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp5017 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [16:29:37] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp5024 is OK: SSL OK - Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2026-05-13 04:44:41 +0000 (expires in 76 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:29:37] jhancock@cumin2002 provision (PID 3460256) is awaiting input [16:29:41] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp5017 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2026-05-07 21:41:31 +0000 (expires in 71 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:29:43] FIRING: [8x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:29:47] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp1106 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2026-05-07 21:41:31 +0000 (expires in 71 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:29:49] RECOVERY - HAProxy HTTPS wikipedia25.org ECDSA on cp1108 is OK: SSL OK - Certificate wikipedia25.org contains all required SANs:Certificate wikipedia25.org (ECDSA) valid until 2026-04-07 07:52:16 +0000 (expires in 40 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:29:49] RECOVERY - HAProxy HTTPS wikipedia25.org ECDSA on cp7005 is OK: SSL OK - Certificate wikipedia25.org contains all required SANs:Certificate wikipedia25.org (ECDSA) valid until 2026-04-07 07:52:16 +0000 (expires in 40 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:29:57] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp3069 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2026-05-07 21:41:31 +0000 (expires in 71 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:29:57] FIRING: [19x] ProbeDown: Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:29:57] FIRING: [6x] ProbeDown: Service text:80 has failed probes (http_text_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:29:59] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp1108 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed error:0A000126:SSL routines::unexpected eof while reading https://wikitech.wikimedia.org/wiki/HTTPS [16:30:01] PROBLEM - HAProxy HTTPS wikipedia25.org ECDSA on cp5024 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [16:30:01] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp5020 is CRITICAL: Return code of 141 is out of bounds https://wikitech.wikimedia.org/wiki/HTTPS [16:30:07] !ack [16:30:09] 7485 (ACKED) NELHigh sre (thanos-rule@main tcp.timed_out) [16:30:17] RECOVERY - Debmonitor Health Check Expiry on debmonitor.wikimedia.org is OK: OK - Certificate *.wikipedia.org will expire on Thu 07 May 2026 09:41:31 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Debmonitor [16:30:17] RECOVERY - Debmonitor Health Check on debmonitor.wikimedia.org is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 1340 bytes in 0.030 second response time https://wikitech.wikimedia.org/wiki/Debmonitor [16:30:19] RECOVERY - HAProxy HTTPS wikipedia25.org ECDSA on cp3073 is OK: SSL OK - Certificate wikipedia25.org contains all required SANs:Certificate wikipedia25.org (ECDSA) valid until 2026-04-07 07:52:16 +0000 (expires in 40 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:30:27] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp5017 is OK: SSL OK - Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2026-05-13 04:44:41 +0000 (expires in 76 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:30:36] "Secure Connection Failed [16:30:36] An error occurred during a connection to en.wikipedia.org. PR_END_OF_FILE_ERROR [16:30:36] Error code: PR_END_OF_FILE_ERROR" [16:30:37] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:30:39] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:30:45] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:30:45] RECOVERY - PyBal backends health check on lvs2011 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:30:47] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp3072 is OK: SSL OK - Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2026-05-13 04:44:41 +0000 (expires in 76 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:30:50] ShakespeareFan00: yeah thanks, known. we are working on it. [16:30:55] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp5023 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed error:0A000126:SSL routines::unexpected eof while reading https://wikitech.wikimedia.org/wiki/HTTPS [16:30:59] RECOVERY - HAProxy HTTPS wikipedia25.org ECDSA on cp5024 is OK: SSL OK - Certificate wikipedia25.org contains all required SANs:Certificate wikipedia25.org (ECDSA) valid until 2026-04-07 07:52:16 +0000 (expires in 40 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:30:59] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp5018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed error:0A000126:SSL routines::unexpected eof while reading https://wikitech.wikimedia.org/wiki/HTTPS [16:30:59] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp1108 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2026-05-07 21:41:31 +0000 (expires in 71 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:31:01] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp5020 is OK: SSL OK - Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2026-05-13 04:44:41 +0000 (expires in 76 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:31:05] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp5024 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2026-05-07 21:41:31 +0000 (expires in 71 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:31:15] sukhe: Just providing additional information .. [16:31:21] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp7008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed error:0A000126:SSL routines::unexpected eof while reading https://wikitech.wikimedia.org/wiki/HTTPS [16:31:30] FIRING: [32x] LibericaUnhealthyRealserverPooled: Liberica service gerrit-httpslb6_443 has 8 unhealthy realservers pooled on lvs3008:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [16:31:43] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [16:31:44] (I'm based in the United Kingdom, and typically my Wiki experience is via the Amsterdam hub) [16:31:45] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [16:31:45] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [16:31:45] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns7002 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [16:31:53] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp5021 is OK: SSL OK - Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2026-05-13 04:44:41 +0000 (expires in 76 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:31:53] (03PS1) 10Fabfur: cache:haproxy: temporary fix [puppet] - 10https://gerrit.wikimedia.org/r/1243882 [16:31:55] ^ ignore [16:31:57] PROBLEM - HAProxy HTTPS wikipedia25.org ECDSA on cp5018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed error:0A000126:SSL routines::unexpected eof while reading https://wikitech.wikimedia.org/wiki/HTTPS [16:32:01] RECOVERY - HAProxy HTTPS wikipedia25.org ECDSA on cp5017 is OK: SSL OK - Certificate wikipedia25.org contains all required SANs:Certificate wikipedia25.org (ECDSA) valid until 2026-04-07 07:52:16 +0000 (expires in 40 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:32:07] (03PS1) 10CDanis: haproxy: dont log silentdrop [puppet] - 10https://gerrit.wikimedia.org/r/1243883 [16:32:21] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp7008 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2026-04-18 10:56:56 +0000 (expires in 51 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:32:25] (03PS1) 10Giuseppe Lavagetto: haproxy: stop current abuse [puppet] - 10https://gerrit.wikimedia.org/r/1243884 [16:32:33] ShakespeareFan00: please move such reports and chatter to #wikimedia-tech [16:32:44] (03CR) 10JHathaway: [C:03+1] cache:haproxy: temporary fix [puppet] - 10https://gerrit.wikimedia.org/r/1243882 (owner: 10Fabfur) [16:32:46] Will do so thanks for the hint .. [16:32:55] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp5023 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2026-05-07 21:41:31 +0000 (expires in 71 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:32:57] RECOVERY - HAProxy HTTPS wikipedia25.org ECDSA on cp5018 is OK: SSL OK - Certificate wikipedia25.org contains all required SANs:Certificate wikipedia25.org (ECDSA) valid until 2026-04-07 07:52:16 +0000 (expires in 40 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:32:59] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp5018 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2026-05-07 21:41:31 +0000 (expires in 71 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:33:01] PROBLEM - HAProxy HTTPS wikipedia25.org ECDSA on cp5021 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed error:0A000126:SSL routines::unexpected eof while reading https://wikitech.wikimedia.org/wiki/HTTPS [16:33:22] FIRING: [9x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:33:25] (03CR) 10Ladsgroup: [C:03+1] haproxy: dont log silentdrop [puppet] - 10https://gerrit.wikimedia.org/r/1243883 (owner: 10CDanis) [16:33:35] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp5021 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed error:0A000126:SSL routines::unexpected eof while reading https://wikitech.wikimedia.org/wiki/HTTPS [16:33:44] 06SRE, 06Commons, 06Infrastructure-Foundations, 10netops, and 2 others: 503 Service Unavailable No server is available to handle this request. - https://phabricator.wikimedia.org/T418392#11651446 (10AlexisJazz) [16:33:51] FIRING: [8x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [16:33:55] (03CR) 10CDanis: [C:03+1] cache:haproxy: temporary fix [puppet] - 10https://gerrit.wikimedia.org/r/1243882 (owner: 10Fabfur) [16:33:59] (03CR) 10CDanis: [C:03+2] cache:haproxy: temporary fix [puppet] - 10https://gerrit.wikimedia.org/r/1243882 (owner: 10Fabfur) [16:34:01] RECOVERY - HAProxy HTTPS wikipedia25.org ECDSA on cp5021 is OK: SSL OK - Certificate wikipedia25.org contains all required SANs:Certificate wikipedia25.org (ECDSA) valid until 2026-04-07 07:52:16 +0000 (expires in 40 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:34:22] (03CR) 10CDanis: [V:03+2 C:03+2] haproxy: dont log silentdrop [puppet] - 10https://gerrit.wikimedia.org/r/1243883 (owner: 10CDanis) [16:34:26] (03CR) 10Scott French: [C:03+1] cache:haproxy: temporary fix [puppet] - 10https://gerrit.wikimedia.org/r/1243882 (owner: 10Fabfur) [16:34:35] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp5021 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2026-05-07 21:41:31 +0000 (expires in 71 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:34:42] FIRING: [9x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:34:57] FIRING: [18x] ProbeDown: Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:34:58] FIRING: [17x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:35:00] (03CR) 10Ladsgroup: [C:03+1] "surrogate +1 from joe" [puppet] - 10https://gerrit.wikimedia.org/r/1243882 (owner: 10Fabfur) [16:35:01] !ack [16:35:02] no value provided for parameter incident and no default available [16:35:02] All incidents are already acked. [16:35:21] (03PS1) 10Federico Ceratto: dborch1003: temporarily disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1243879 (https://phabricator.wikimedia.org/T317179) [16:35:36] RESOLVED: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:35:40] !log dancy@deploy2002 Installing scap version "4.242.0" for 2 host(s) [16:36:05] RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [16:36:19] (03CR) 10Marostegui: [C:03+1] dborch1003: temporarily disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1243879 (https://phabricator.wikimedia.org/T317179) (owner: 10Federico Ceratto) [16:36:30] FIRING: [22x] LibericaUnhealthyRealserverPooled: Liberica service gerrit-httpslb6_443 has 6 unhealthy realservers pooled on lvs3008:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [16:36:45] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [16:36:47] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns7002 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [16:37:31] 06SRE, 06Commons, 06Infrastructure-Foundations, 10netops, and 2 others: 503 Service Unavailable No server is available to handle this request. - https://phabricator.wikimedia.org/T418392#11651469 (10AlexisJazz) [16:37:34] !log dancy@deploy2002 Installation of scap version "4.242.0" completed for 2 hosts [16:38:22] RESOLVED: [7x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:38:51] RESOLVED: [6x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [16:39:57] RESOLVED: [18x] ProbeDown: Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:39:58] RESOLVED: [15x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:40:12] 06SRE, 06Commons, 06Infrastructure-Foundations, 10netops, and 2 others: 503 Service Unavailable No server is available to handle this request. - https://phabricator.wikimedia.org/T418392#11651485 (10Jdforrester-WMF) p:05Triage→03Unbreak! [16:40:29] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [16:40:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242416 (https://phabricator.wikimedia.org/T414476) (owner: 10Itamar Givon) [16:40:58] 06SRE, 06Commons, 06Infrastructure-Foundations, 10netops, and 2 others: 503 Service Unavailable No server is available to handle this request. - https://phabricator.wikimedia.org/T418392#11651487 (10JaydenKieran) Can confirm has been affecting en.wikipedia.org and mediawiki.org too, though seems more stabl... [16:41:13] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [16:41:30] RESOLVED: [19x] LibericaUnhealthyRealserverPooled: Liberica service gerrit-httpslb6_443 has 6 unhealthy realservers pooled on lvs3008:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [16:41:47] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [16:41:47] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [16:42:18] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1243876 (owner: 10Elukey) [16:42:20] (03PS1) 10Marostegui: data.yaml: Add ssh-key for the bkup token. [puppet] - 10https://gerrit.wikimedia.org/r/1243889 [16:42:56] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 07Wikimedia-production-error: 503 Service Unavailable No server is available to handle this request. - https://phabricator.wikimedia.org/T418392#11651504 (10AlexisJazz) [16:43:37] (03CR) 10Federico Ceratto: [C:03+2] dborch1003: temporarily disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1243879 (https://phabricator.wikimedia.org/T317179) (owner: 10Federico Ceratto) [16:44:19] (03PS1) 10Muehlenhoff: Switch pki1002 to nftables variant of insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1243890 (https://phabricator.wikimedia.org/T416664) [16:44:58] (03CR) 10Elukey: [C:03+2] Switch pki1002 to nftables variant of insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1243890 (https://phabricator.wikimedia.org/T416664) (owner: 10Muehlenhoff) [16:45:13] (03CR) 10Elukey: [C:03+2] profile::installserver: move pki1002 to UEFI in preseed config [puppet] - 10https://gerrit.wikimedia.org/r/1243876 (owner: 10Elukey) [16:47:45] (03CR) 10Vgutierrez: [C:04-1] cache::haproxy: save x-wmf-ratelimit-class content for webrequest (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1243870 (https://phabricator.wikimedia.org/T417864) (owner: 10Fabfur) [16:49:47] (03PS1) 10Phuedx: JS SDK: Fix instrument_name field handling [extensions/TestKitchen] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1243892 [16:49:47] (03CR) 10Bking: [C:03+2] wdqs: Enable deadlock auto-remediation for codfw [puppet] - 10https://gerrit.wikimedia.org/r/1243699 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [16:52:15] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 07Wikimedia-production-error: 503 Service Unavailable No server is available to handle this request. - https://phabricator.wikimedia.org/T418392#11651544 (10Nemoralis) https://www.wikimediastatus.net/incidents/dgdcls8b0ybt [16:52:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [16:53:00] (03CR) 10Marostegui: Apply role to pki1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1243839 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey) [16:53:30] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 07Wikimedia-production-error: 503 Service Unavailable No server is available to handle this request. - https://phabricator.wikimedia.org/T418392#11651549 (10AlexisJazz) There was also a 5 minute spike in 50x errors at 14:15. Also between 15:30 and... [16:54:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:56:58] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, and 2 others: 503 Service Unavailable No server is available to handle this request. - https://phabricator.wikimedia.org/T418392#11651556 (10AlexisJazz) [16:58:12] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 07Wikimedia-Incident: 503 Service Unavailable No server is available to handle this request. - https://phabricator.wikimedia.org/T418392#11651560 (10AlexisJazz) [17:00:14] (03PS1) 10Jgreen: nsca_frack_cfg.erb remove frqueue2002 and add frqueue2004 [puppet] - 10https://gerrit.wikimedia.org/r/1243894 (https://phabricator.wikimedia.org/T418393) [17:00:25] 10SRE-SLO, 06Abstract Wikipedia team, 06ServiceOps new: wikifunctions-backend-combined-v1 SLI error budget has been rapidly dropping over Feb 2026 - https://phabricator.wikimedia.org/T418160#11651565 (10elukey) I tried to create some queries for this specific issue in the Thanos UI: https://w.wiki/Hzuu Ther... [17:00:55] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host pki1002.eqiad.wmnet with OS trixie [17:01:04] FIRING: PuppetDisabled: Puppet disabled on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=relforge&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [17:01:04] (03CR) 10CI reject: [V:04-1] nsca_frack_cfg.erb remove frqueue2002 and add frqueue2004 [puppet] - 10https://gerrit.wikimedia.org/r/1243894 (https://phabricator.wikimedia.org/T418393) (owner: 10Jgreen) [17:01:57] !log cgoubert@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [17:02:09] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1027.eqiad.wmnet with OS bookworm [17:02:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:03:01] !log cgoubert@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [17:06:41] (03PS1) 10Clément Goubert: shellbox-constraints: fix staging ram values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243895 [17:07:45] (03PS3) 10Fabfur: cache::haproxy: save x-wmf-ratelimit-class content for webrequest [puppet] - 10https://gerrit.wikimedia.org/r/1243870 (https://phabricator.wikimedia.org/T417864) [17:08:13] (03PS3) 10Aqu: Bump Blunderbuss image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243221 (https://phabricator.wikimedia.org/T415874) [17:08:20] (03CR) 10Fabfur: cache::haproxy: save x-wmf-ratelimit-class content for webrequest (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1243870 (https://phabricator.wikimedia.org/T417864) (owner: 10Fabfur) [17:08:41] (03CR) 10Btullis: [C:03+2] Bump Blunderbuss image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243221 (https://phabricator.wikimedia.org/T415874) (owner: 10Aqu) [17:10:11] (03CR) 10Clément Goubert: [C:03+2] shellbox-constraints: fix staging ram values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243895 (owner: 10Clément Goubert) [17:10:27] !log cgoubert@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [17:10:40] (03Merged) 10jenkins-bot: Bump Blunderbuss image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243221 (https://phabricator.wikimedia.org/T415874) (owner: 10Aqu) [17:12:10] (03Merged) 10jenkins-bot: shellbox-constraints: fix staging ram values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243895 (owner: 10Clément Goubert) [17:12:20] !log cgoubert@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [17:12:31] !log btullis@cumin1003 START - Cookbook sre.hosts.dhcp for host dse-k8s-worker1026.eqiad.wmnet [17:12:35] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [17:12:39] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [17:12:43] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host dse-k8s-worker1026.eqiad.wmnet [17:12:56] !log btullis@cumin1003 START - Cookbook sre.hosts.dhcp for host dse-k8s-worker1026.eqiad.wmnet [17:13:00] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [17:13:05] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [17:13:22] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:13:39] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [17:14:43] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:15:18] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [17:15:22] (03PS10) 10Tiziano Fogli: Thanos/Store: add a ruler(s)-dedicated store gateway [puppet] - 10https://gerrit.wikimedia.org/r/1219146 (https://phabricator.wikimedia.org/T412924) [17:15:22] (03PS1) 10Tiziano Fogli: prometheus/resource_config: add resource_title param [puppet] - 10https://gerrit.wikimedia.org/r/1243898 (https://phabricator.wikimedia.org/T412924) [17:15:24] (03PS1) 10Tiziano Fogli: prometheus/ops: monitor thanos store instances with resouce_config [puppet] - 10https://gerrit.wikimedia.org/r/1243899 (https://phabricator.wikimedia.org/T412924) [17:15:56] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [17:15:59] btullis@cumin1003 dhcp (PID 1821375) is awaiting input [17:16:50] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on pki1002.eqiad.wmnet with reason: host reimage [17:17:32] (03CR) 10Vgutierrez: cache::haproxy: save x-wmf-ratelimit-class content for webrequest (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1243870 (https://phabricator.wikimedia.org/T417864) (owner: 10Fabfur) [17:18:12] !log jgreen@cumin1003 START - Cookbook sre.dns.netbox [17:18:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:19:00] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to Superset for mikez - https://phabricator.wikimedia.org/T418098#11651679 (10Ahoelzl) Approved. [17:19:13] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 13Patch-For-Review: Requesting access to Superset for mikez - https://phabricator.wikimedia.org/T418098#11651681 (10Ahoelzl) [17:20:29] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Package Confluent Platform 7.5.x / Kafka 3.5 - https://phabricator.wikimedia.org/T416670#11651690 (10elukey) Next steps: * Test https://gerrit.wikimedia.org/r/1239135 in Pontoon, flipping the confluent packages back and forth between Kafka 1.1 and 3.5.... [17:21:14] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pki1002.eqiad.wmnet with reason: host reimage [17:22:05] !log jgreen@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove host frqueue2002.frack.codfw.wmnet from DNS for decommissioning - jgreen@cumin1003" [17:22:10] !log jgreen@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove host frqueue2002.frack.codfw.wmnet from DNS for decommissioning - jgreen@cumin1003" [17:22:10] !log jgreen@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:22:42] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops, 13Patch-For-Review: decommission frqueue2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T418393#11651700 (10Jgreen) a:05Jgreen→03None [17:23:51] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [17:24:48] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [17:24:57] (03PS2) 10Jgreen: nsca_frack_cfg.erb remove frqueue2002 and add frqueue2004 [puppet] - 10https://gerrit.wikimedia.org/r/1243894 (https://phabricator.wikimedia.org/T418393) [17:25:13] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [17:25:47] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [17:26:36] (03PS1) 10Clément Goubert: admin_ng: fix shellbox-constraints resourcequota [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243904 [17:26:45] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [17:26:52] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [17:26:55] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [17:27:23] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [17:27:33] !log Deployment of the multi-instance Thanos Store Gateway patches for T412924: Initial groundwork completed [17:27:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:38] T412924: Multi-instance thanos store gateway - https://phabricator.wikimedia.org/T412924 [17:28:17] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [17:28:22] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:ge-0/0/0 () - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:28:30] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [17:28:47] !log cgoubert@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [17:28:50] (03CR) 10Ottomata: "Just checking that we want to add a new field? I didn't read the whole slack thread, but I wonder if it would be simpler to add this into " [puppet] - 10https://gerrit.wikimedia.org/r/1243870 (https://phabricator.wikimedia.org/T417864) (owner: 10Fabfur) [17:29:06] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [17:29:09] (03CR) 10BCornwall: [V:03+1] ncmonitor: Add ncmonitor sysuser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1243258 (owner: 10BCornwall) [17:29:28] !log cgoubert@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [17:29:38] !log cgoubert@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. [17:29:53] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Platform-SRE: Requesting access to analytics-platform-eng-admins for milimetric - https://phabricator.wikimedia.org/T417906#11651729 (10Ottomata) [17:29:58] !log cgoubert@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. [17:30:05] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 06Data-Platform-SRE: Requesting access to analytics-platform-eng-admins for milimetric - https://phabricator.wikimedia.org/T417906#11651733 (10Ottomata) [17:30:30] !log cgoubert@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [17:31:27] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: sync [17:31:38] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: sync [17:32:16] !log aqu@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [17:32:20] !log cgoubert@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [17:33:34] !log aqu@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [17:34:27] FIRING: HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=kserve - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:35:41] (03CR) 10Clément Goubert: [C:03+2] admin_ng: fix shellbox-constraints resourcequota [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243904 (owner: 10Clément Goubert) [17:37:11] (03CR) 10Dzahn: [C:03+2] admin: add rsilvola to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1243196 (https://phabricator.wikimedia.org/T418004) (owner: 10Dzahn) [17:38:01] jouncebot nowandnext [17:38:01] No deployments scheduled for the next 0 hour(s) and 21 minute(s) [17:38:01] In 0 hour(s) and 21 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260225T1800) [17:38:34] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pki1002.eqiad.wmnet with OS trixie [17:40:11] 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 06Release-Engineering-Team, and 2 others: Request membership in deployment (and wmf-deployment group) for Rsilvola - https://phabricator.wikimedia.org/T418004#11651799 (10Dzahn) @Rsilvola You have been added to the `deployment` shell user group and... [17:40:50] 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 06Release-Engineering-Team, and 2 others: Request membership in deployment (and wmf-deployment group) for Rsilvola - https://phabricator.wikimedia.org/T418004#11651802 (10Dzahn) 05Open→03Resolved a:03Dzahn [17:41:28] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 06Data-Platform-SRE: Requesting access to analytics-platform-eng-admins for milimetric - https://phabricator.wikimedia.org/T417906#11651807 (10Ottomata) Approved! [17:41:38] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 06Data-Platform-SRE: Requesting access to analytics-platform-eng-admins for milimetric - https://phabricator.wikimedia.org/T417906#11651810 (10Ottomata) > maybe all analytics-admins should have admin in all airflow admin groups... [17:42:16] 06SRE, 06Infrastructure-Foundations, 10netops: Nokia SR-Linux DHCP Relay Bug - https://phabricator.wikimedia.org/T411054#11651815 (10BTullis) @ayounsi directed me to this ticket after reading: {T418398} I believe that this is also preventing the reimaging of: * `dse-k8s-worker1026` on `lsw1-c2-eqiad` * `dse... [17:43:15] (03Merged) 10jenkins-bot: admin_ng: fix shellbox-constraints resourcequota [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243904 (owner: 10Clément Goubert) [17:43:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:44:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:44:58] (03CR) 10Dzahn: [C:03+2] gerrit: cleanup Hiera and tests after gerrit2 renaming [puppet] - 10https://gerrit.wikimedia.org/r/1243187 (https://phabricator.wikimedia.org/T338470) (owner: 10Dzahn) [17:45:09] (03PS1) 10Fabfur: Revert "haproxy: dont log silentdrop" [puppet] - 10https://gerrit.wikimedia.org/r/1243909 [17:45:30] (03CR) 10Hashar: "Kubernetes is not part of the stack, so I am confused." [puppet] - 10https://gerrit.wikimedia.org/r/1243187 (https://phabricator.wikimedia.org/T338470) (owner: 10Dzahn) [17:47:22] (03CR) 10Dzahn: [C:03+2] "Yea, guys, this is why I did not touch that. It's the Gerrit-internal user." [puppet] - 10https://gerrit.wikimedia.org/r/1243187 (https://phabricator.wikimedia.org/T338470) (owner: 10Dzahn) [17:48:22] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:ge-0/0/0 () - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:49:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:51:17] (03PS4) 10Scott French: P:cache::haproxy: add moat-scope requestctl rules [puppet] - 10https://gerrit.wikimedia.org/r/1243905 [17:51:17] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243905 (owner: 10Scott French) [17:51:40] (03PS1) 10Giuseppe Lavagetto: moat mode [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1243914 [17:52:03] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] moat mode [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1243914 (owner: 10Giuseppe Lavagetto) [17:52:43] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "haproxy moat mode - oblivian@cumin1003" [17:52:46] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: haproxy moat mode - oblivian@cumin1003 [17:53:28] RECOVERY - Debian mirror in sync with upstream on mirror1001 is OK: /srv/mirrors/debian is over 0 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [17:53:39] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: haproxy moat mode - oblivian@cumin1003 [17:53:41] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "haproxy moat mode - oblivian@cumin1003" [17:57:34] (03CR) 10Hashar: [C:03+1] "Yup that was confusing and I got caught cause I did not bother looking at the rest of manifest to check where that `profile::gerrit::servi" [puppet] - 10https://gerrit.wikimedia.org/r/1243187 (https://phabricator.wikimedia.org/T338470) (owner: 10Dzahn) [18:00:05] swfrench-wmf: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki infrastructure (UTC late) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260225T1800). [18:00:06] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:02:06] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:03:10] (03CR) 10Bking: [C:03+2] wdqs: Enable deadlock auto-remediation for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1243700 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [18:03:17] (03PS1) 10Phuedx: JS SDK: Added `Instrument#submitClick` for backwards compatibility [extensions/TestKitchen] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1243922 [18:05:48] (03CR) 10Ssingh: [C:03+1] Revert "haproxy: dont log silentdrop" [puppet] - 10https://gerrit.wikimedia.org/r/1243909 (owner: 10Fabfur) [18:07:32] (03CR) 10Jsn.sherman: [C:03+1] Enable revert risk filters for first batch of wikis: < 1000 monthly edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1240672 (https://phabricator.wikimedia.org/T411485) (owner: 10Kgraessle) [18:08:29] (03CR) 10Fabfur: [C:03+2] Revert "haproxy: dont log silentdrop" [puppet] - 10https://gerrit.wikimedia.org/r/1243909 (owner: 10Fabfur) [18:09:11] o/ [18:09:31] I'll be deferring the work originally planned for this infra window to another day [18:10:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/TestKitchen] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1243892 (owner: 10Phuedx) [18:10:52] (03CR) 10Dzahn: [C:03+2] trafficserver: add map for status.wikimedia.org to miscweb-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1240416 (https://phabricator.wikimedia.org/T414098) (owner: 10Dzahn) [18:10:59] (03PS2) 10Dzahn: trafficserver: add map for status.wikimedia.org to miscweb-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1240416 (https://phabricator.wikimedia.org/T414098) [18:11:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/TestKitchen] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1243922 (owner: 10Phuedx) [18:11:34] (03CR) 10Dzahn: "no effect until a DNS change in the future" [puppet] - 10https://gerrit.wikimedia.org/r/1240416 (https://phabricator.wikimedia.org/T414098) (owner: 10Dzahn) [18:12:19] (03CR) 10Dzahn: [C:03+2] trafficserver: add map for status.wikimedia.org to miscweb-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1240416 (https://phabricator.wikimedia.org/T414098) (owner: 10Dzahn) [18:13:27] (03CR) 10Dzahn: [C:03+2] backup: adjust gerrit file set after renaming of gerrit2 [puppet] - 10https://gerrit.wikimedia.org/r/1243183 (https://phabricator.wikimedia.org/T417247) (owner: 10Dzahn) [18:14:37] (03CR) 10Dzahn: [C:03+2] admin: rename gerrit system user [puppet] - 10https://gerrit.wikimedia.org/r/1243188 (https://phabricator.wikimedia.org/T338470) (owner: 10Dzahn) [18:15:48] (03CR) 10Atieno: [C:03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242613 (https://phabricator.wikimedia.org/T414470) (owner: 10Aaron Schulz) [18:18:50] (03CR) 10Dzahn: [C:03+2] httpbb/miscweb: add test for status.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1240420 (https://phabricator.wikimedia.org/T414098) (owner: 10Dzahn) [18:18:55] (03PS2) 10Dzahn: httpbb/miscweb: add test for status.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1240420 (https://phabricator.wikimedia.org/T414098) [18:22:44] (03PS1) 10DDesouza: Deploy Comparative Reader Research survey on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243929 (https://phabricator.wikimedia.org/T417834) [18:23:42] (03PS1) 10DDesouza: Deploy Comparative Reader Research survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243930 (https://phabricator.wikimedia.org/T417829) [18:26:20] jouncebot: nowandnext [18:26:20] For the next 0 hour(s) and 33 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260225T1800) [18:26:20] In 0 hour(s) and 33 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260225T1900) [18:28:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [18:28:44] Deployment shellbox-main in shellbox-constraints at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=shellbox-constraints&var-deployment=shellbox-main - ... [18:28:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [18:29:12] (03CR) 10Dzahn: [C:03+2] httpbb/miscweb: add test for status.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1240420 (https://phabricator.wikimedia.org/T414098) (owner: 10Dzahn) [18:29:29] (03PS2) 10Dzahn: httpbb/miscweb: add tests for wikipedia25.org [puppet] - 10https://gerrit.wikimedia.org/r/1240421 (https://phabricator.wikimedia.org/T408592) [18:29:31] (03PS33) 10CDobbins: prometheus: add pooled host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) [18:30:21] (03CR) 10CI reject: [V:04-1] prometheus: add pooled host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [18:30:25] FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:31:58] (03PS34) 10CDobbins: prometheus: add pooled host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) [18:32:35] (03CR) 10CI reject: [V:04-1] prometheus: add pooled host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [18:32:49] (03CR) 10Herron: [C:03+1] prometheus/ops: monitor thanos store instances with resouce_config [puppet] - 10https://gerrit.wikimedia.org/r/1243899 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli) [18:33:07] (03CR) 10Herron: [C:03+1] prometheus/resource_config: add resource_title param [puppet] - 10https://gerrit.wikimedia.org/r/1243898 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli) [18:33:26] (03PS35) 10CDobbins: prometheus: add pooled host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) [18:33:44] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment shellbox-main in shellbox-constraints at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [18:34:09] (03CR) 10CI reject: [V:04-1] prometheus: add pooled host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [18:35:25] RESOLVED: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:36:58] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1243859|cleanup: Remove bunch of unnecessary code from ReassignMentees]] [18:39:10] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1243859|cleanup: Remove bunch of unnecessary code from ReassignMentees]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:40:07] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:40:07] PROBLEM - OSPF status on cr2-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:40:26] !log urbanecm@deploy2002 urbanecm: Continuing with sync [18:40:43] (03PS36) 10CDobbins: prometheus: add pooled host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) [18:41:07] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:41:07] RECOVERY - OSPF status on cr2-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:42:10] FIRING: BFDdown: BFD session down between cr2-eqdfw and fe80::b6f9:5dff:fe30:e538 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:42:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-magru (195.200.68.153) - group Confed_magru - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-eqdfw:9804&var-bgp_group=Confed_magru&var-bgp_neighbor=cr2-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [18:44:24] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1243859|cleanup: Remove bunch of unnecessary code from ReassignMentees]] (duration: 07m 26s) [18:45:43] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [18:46:12] (03CR) 10Dzahn: [C:03+2] Revert^2 "Gerrit: Disable auto reloading replication config" [puppet] - 10https://gerrit.wikimedia.org/r/1238043 (https://phabricator.wikimedia.org/T416929) (owner: 10Hashar) [18:47:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqdfw and fe80::b6f9:5dff:fe30:e538 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:47:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-magru (195.200.68.153) - group Confed_magru - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-eqdfw:9804&var-bgp_group=Confed_magru&var-bgp_neighbor=cr2-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [18:48:27] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1243889 (owner: 10Marostegui) [18:48:56] (03PS1) 10Scott French: Fix scoping logic for haproxy DSL [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1243934 [18:49:21] 07Puppet, 06collaboration-services, 10Gerrit, 13Patch-For-Review: Gerrit git replication should not break when Puppet changes its config - https://phabricator.wikimedia.org/T416929#11652066 (10Dzahn) > The short fix is to disable configuration autoreloading in the replication plugin. This config change ha... [18:49:54] (03CR) 10Scott French: [V:03+2] "Tested locally at `abe29497c0276ebe4f5ab583dbf5efd09771a5c2`." [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1243934 (owner: 10Scott French) [18:49:57] (03CR) 10Scott French: [V:03+2 C:03+2] Fix scoping logic for haproxy DSL [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1243934 (owner: 10Scott French) [18:51:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [18:52:48] (03CR) 10Urbanecm: "recheck" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243860 (owner: 10Urbanecm) [18:53:14] !log swfrench@cumin2002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Deploy: Fix scoping logic for haproxy DSL - swfrench@cumin2002" [18:53:16] !log swfrench@cumin2002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Deploy: Fix scoping logic for haproxy DSL - swfrench@cumin2002 [18:53:17] (03CR) 10CDobbins: [V:03+1] prometheus: add pooled host check (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [18:54:05] !log swfrench@cumin2002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Deploy: Fix scoping logic for haproxy DSL - swfrench@cumin2002 [18:54:06] !log swfrench@cumin2002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Deploy: Fix scoping logic for haproxy DSL - swfrench@cumin2002" [18:59:36] (03CR) 10CDanis: [C:03+1] P:cache::haproxy: add moat-scope requestctl rules [puppet] - 10https://gerrit.wikimedia.org/r/1243905 (owner: 10Scott French) [18:59:51] (03PS1) 10CDanis: hieradata: pilot use_etcd_moat_scope [puppet] - 10https://gerrit.wikimedia.org/r/1243918 (owner: 10Scott French) [18:59:55] (03CR) 10CDanis: [C:03+1] hieradata: pilot use_etcd_moat_scope [puppet] - 10https://gerrit.wikimedia.org/r/1243918 (owner: 10Scott French) [19:00:05] dduvall and dancy: #bothumor I � Unicode. All rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260225T1900). [19:00:34] o/ [19:01:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [19:04:51] (03CR) 10Scott French: [C:03+2] P:cache::haproxy: add moat-scope requestctl rules [puppet] - 10https://gerrit.wikimedia.org/r/1243905 (owner: 10Scott French) [19:05:13] (03PS1) 10TrainBranchBot: group1 to 1.46.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243939 (https://phabricator.wikimedia.org/T413808) [19:05:15] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dduvall@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243939 (https://phabricator.wikimedia.org/T413808) (owner: 10TrainBranchBot) [19:06:08] (03Merged) 10jenkins-bot: group1 to 1.46.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243939 (https://phabricator.wikimedia.org/T413808) (owner: 10TrainBranchBot) [19:06:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [19:09:15] (03CR) 10Scott French: [C:03+2] "Thanks, Chris!" [puppet] - 10https://gerrit.wikimedia.org/r/1243918 (owner: 10Scott French) [19:12:10] !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.46.0-wmf.17 refs T413808 [19:12:14] T413808: 1.46.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T413808 [19:20:29] (03PS1) 10Scott French: hieradata: enable use_etcd_moat_scope globally [puppet] - 10https://gerrit.wikimedia.org/r/1243944 [19:20:52] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243944 (owner: 10Scott French) [19:23:46] FIRING: Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [19:26:29] (03CR) 10CDanis: [C:03+1] hieradata: enable use_etcd_moat_scope globally [puppet] - 10https://gerrit.wikimedia.org/r/1243944 (owner: 10Scott French) [19:28:12] (03CR) 10Scott French: [C:03+2] hieradata: enable use_etcd_moat_scope globally [puppet] - 10https://gerrit.wikimedia.org/r/1243944 (owner: 10Scott French) [19:31:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [19:32:33] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 07Wikimedia-Incident: 503 Service Unavailable No server is available to handle this request. - https://phabricator.wikimedia.org/T418392#11652162 (10ssingh) This should now be resolved but leaving to the task author to mark this as "Resolved". We... [19:33:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [19:43:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [19:44:25] FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:45:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [19:49:25] RESOLVED: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:55:05] (03PS1) 10CDanis: haproxy: earlier drop [puppet] - 10https://gerrit.wikimedia.org/r/1243949 [19:55:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [19:57:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [20:00:33] 10ops-esams, 10ops-magru, 06DC-Ops: Data Required for Energy Efficiency Directive: Due March 31 for DRMRS & May 15 for ESAMS - https://phabricator.wikimedia.org/T418411 (10wiki_willy) 03NEW [20:03:49] (03PS2) 10CDanis: P:cache::haproxy: Revert temporary fix [puppet] - 10https://gerrit.wikimedia.org/r/1243947 (owner: 10Scott French) [20:07:14] (03CR) 10Scott French: [C:03+1] haproxy: earlier drop [puppet] - 10https://gerrit.wikimedia.org/r/1243949 (owner: 10CDanis) [20:08:37] (03CR) 10Scott French: [C:03+2] P:cache::haproxy: Revert temporary fix [puppet] - 10https://gerrit.wikimedia.org/r/1243947 (owner: 10Scott French) [20:10:22] cdobbins@cumin2002 reimage (PID 3572558) is awaiting input [20:11:07] (03Abandoned) 10Hashar: gerrit: bump Jetty threads [puppet] - 10https://gerrit.wikimedia.org/r/1239872 (https://phabricator.wikimedia.org/T417536) (owner: 10Arnaudb) [20:22:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [20:24:15] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp2043.codfw.wmnet with OS trixie [20:25:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [20:25:22] 10ops-esams, 10ops-magru, 06DC-Ops: Data Required for Energy Efficiency Directive: Due March 31 for DRMRS & May 15 for ESAMS - https://phabricator.wikimedia.org/T418411#11652251 (10ssingh) Thanks @wiki_willy. @BCornwall and @CDobbins will work on it from Traffic on the server and storage capacity. [20:25:25] FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:28:14] (03PS1) 10Xcollazo: Clean up list of dumps mirrors [puppet] - 10https://gerrit.wikimedia.org/r/1243954 (https://phabricator.wikimedia.org/T415193) [20:30:25] RESOLVED: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:31:44] (03PS1) 10Bking: superset: Disallow scheduling on 1Gbps hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243955 (https://phabricator.wikimedia.org/T418412) [20:36:03] (03CR) 10Bking: [C:04-1] "Throws an error when I try a helmdir deploy:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243955 (https://phabricator.wikimedia.org/T418412) (owner: 10Bking) [20:36:13] (03PS2) 10CDanis: haproxy: earlier drop [puppet] - 10https://gerrit.wikimedia.org/r/1243949 [20:36:16] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243949 (owner: 10CDanis) [20:42:57] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2043.codfw.wmnet with reason: host reimage [20:46:18] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 07Wikimedia-Incident: 503 Service Unavailable No server is available to handle this request. - https://phabricator.wikimedia.org/T418392#11652313 (10Aklapper) [20:46:56] (03CR) 10CDanis: [C:03+2] haproxy: earlier drop [puppet] - 10https://gerrit.wikimedia.org/r/1243949 (owner: 10CDanis) [20:46:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243817 (https://phabricator.wikimedia.org/T418089) (owner: 10Anzx) [20:47:16] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset: apply [20:47:22] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset: apply [20:48:09] (03CR) 10Hashar: "You can't just duplicate the upstream Debian and override it, it does not make sense to me. Specially given Apache2 on Debian already has:" [puppet] - 10https://gerrit.wikimedia.org/r/1240197 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [20:48:16] (03CR) 10Hashar: [C:04-1] gerrit: adapt httpd config to ATS [puppet] - 10https://gerrit.wikimedia.org/r/1240197 (https://phabricator.wikimedia.org/T417998) (owner: 10Arnaudb) [20:48:36] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2043.codfw.wmnet with reason: host reimage [20:49:24] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset: apply [20:49:27] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset: apply [20:50:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [20:52:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [20:54:40] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:57:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [21:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260225T2100). [21:00:05] cjming and anzx: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:17] o/ [21:01:04] FIRING: PuppetDisabled: Puppet disabled on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=relforge&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [21:01:07] o/ [21:01:13] i can self-deploy [21:01:21] anzx: do you need a deployer? [21:01:31] yes [21:02:28] alrighty - i'll actually start with yours then since it should be quick [21:03:36] cjming: please run maintenance script to empty usergroup https://www.irccloud.com/pastebin/F4xXCHgN [21:03:52] will do - thanks for pasting script [21:04:02] (03PS4) 10Anzx: zhwiki: remove accountcreator usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243817 (https://phabricator.wikimedia.org/T418089) [21:04:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:05:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [21:05:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243817 (https://phabricator.wikimedia.org/T418089) (owner: 10Anzx) [21:06:45] PROBLEM - jenkins_service_running on releases2003 is CRITICAL: PROCS CRITICAL: 2 processes with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [21:06:49] (03Merged) 10jenkins-bot: zhwiki: remove accountcreator usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243817 (https://phabricator.wikimedia.org/T418089) (owner: 10Anzx) [21:07:20] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1243817|zhwiki: remove accountcreator usergroup (T418089)]] [21:07:25] T418089: Remove "accountcreator" and allow "event-organizer" to add and remove "event participant" in zhwiki - https://phabricator.wikimedia.org/T418089 [21:07:45] RECOVERY - jenkins_service_running on releases2003 is OK: PROCS OK: 1 process with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [21:09:12] (03PS3) 10Bking: superset: Disallow scheduling on 1Gbps hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243955 (https://phabricator.wikimedia.org/T418412) [21:09:12] (03CR) 10Bking: "Removing my -1. I had to double-set affinity (see change comment) but it appears to work now." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243955 (https://phabricator.wikimedia.org/T418412) (owner: 10Bking) [21:09:24] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2043.codfw.wmnet with OS trixie [21:09:36] !log cjming@deploy2002 cjming, anzx: Backport for [[gerrit:1243817|zhwiki: remove accountcreator usergroup (T418089)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:10:01] anzx: testable? good to sync? [21:10:09] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset: apply [21:10:53] cjming: looks ok good to sync [21:10:58] !log cjming@deploy2002 cjming, anzx: Continuing with sync [21:11:09] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset: apply [21:11:43] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp2044.codfw.wmnet with OS trixie [21:12:31] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [21:14:10] (03PS4) 10Ryan Kemper: superset: Disallow scheduling on 1Gbps hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243955 (https://phabricator.wikimedia.org/T418412) (owner: 10Bking) [21:14:54] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1243817|zhwiki: remove accountcreator usergroup (T418089)]] (duration: 07m 34s) [21:14:59] T418089: Remove "accountcreator" and allow "event-organizer" to add and remove "event participant" in zhwiki - https://phabricator.wikimedia.org/T418089 [21:15:32] (03CR) 10Ryan Kemper: [C:03+1] "looks good. my only nit was missing newline at eof, which i added" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243955 (https://phabricator.wikimedia.org/T418412) (owner: 10Bking) [21:16:10] !log cjming@deploy2002 mwscript-k8s job started: emptyUserGroup zhwiki accountcreator '--log-reason=[[phab:T418089]]' # T418089 [21:16:29] anzx: should be live - ran script [21:16:46] ok [21:17:26] moving on to next patches in queue [21:18:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [extensions/TestKitchen] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1243892 (owner: 10Phuedx) [21:19:39] (03Merged) 10jenkins-bot: JS SDK: Fix instrument_name field handling [extensions/TestKitchen] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1243892 (owner: 10Phuedx) [21:19:43] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:20:11] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1243892|JS SDK: Fix instrument_name field handling]] [21:21:52] jouncebot: nowandnext [21:21:52] For the next 0 hour(s) and 38 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260225T2100) [21:21:52] In 0 hour(s) and 38 minute(s): Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260225T2200) [21:22:19] i'm running a scap backport now and have one more to go [21:22:20] !log cjming@deploy2002 phuedx, cjming: Backport for [[gerrit:1243892|JS SDK: Fix instrument_name field handling]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:22:21] cjming: Can you ping me when done? [21:22:37] Dreamy_Jazz: sure thing [21:22:41] Thanks [21:22:58] !log cjming@deploy2002 phuedx, cjming: Continuing with sync [21:23:22] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:23:29] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be2096.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:25:31] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [21:25:31] cjming: Dreamy_Jazz i have one more follow up patch to submit can i add to calendar [21:25:48] anzx: sure - feel free to add [21:25:55] Sure, my change is to private code (so it's not on the calendar) [21:26:04] (03PS3) 10Anzx: zhwiki: remove accountcreator usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243961 (https://phabricator.wikimedia.org/T418089) [21:26:14] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243961 (https://phabricator.wikimedia.org/T418089) (owner: 10Anzx) [21:26:35] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-fe2023.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:26:52] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1243892|JS SDK: Fix instrument_name field handling]] (duration: 06m 41s) [21:26:55] cjming: added https://gerrit.wikimedia.org/r/c/1243961/ [21:27:31] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-fe2024.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:27:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [extensions/TestKitchen] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1243922 (owner: 10Phuedx) [21:28:13] (03CR) 10Herron: [C:03+2] "confirmed via a meet" [puppet] - 10https://gerrit.wikimedia.org/r/1242411 (owner: 10Cwhite) [21:30:02] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2044.codfw.wmnet with reason: host reimage [21:32:06] (03CR) 10BPirkle: [C:04-1] "-1 for visibility, but I just had a question, not an objection" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242613 (https://phabricator.wikimedia.org/T414470) (owner: 10Aaron Schulz) [21:32:37] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dani Totten - https://phabricator.wikimedia.org/T418415 (10DTotten-WMF) 03NEW [21:34:00] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dani Totten - https://phabricator.wikimedia.org/T418415#11652411 (10Milimetric) approved! Welcome to data! [21:34:03] (03PS2) 10Xcollazo: Clean up list of dumps mirrors [puppet] - 10https://gerrit.wikimedia.org/r/1243954 (https://phabricator.wikimedia.org/T415193) [21:34:25] (03CR) 10Xcollazo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243954 (https://phabricator.wikimedia.org/T415193) (owner: 10Xcollazo) [21:34:27] FIRING: HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=kserve - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [21:34:31] (03Merged) 10jenkins-bot: JS SDK: Added `Instrument#submitClick` for backwards compatibility [extensions/TestKitchen] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1243922 (owner: 10Phuedx) [21:35:02] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1243922|JS SDK: Added `Instrument#submitClick` for backwards compatibility]] [21:35:28] anzx: ack - when Dreamy_Jazz is done, I'll deploy your 2nd patch [21:35:33] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2044.codfw.wmnet with reason: host reimage [21:35:49] Thanks. I'll start now then [21:36:00] Dreamy_Jazz: wait! [21:36:02] (03CR) 10Herron: Thanos/Store: add a ruler(s)-dedicated store gateway [puppet] - 10https://gerrit.wikimedia.org/r/1219146 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli) [21:36:03] still finishing up [21:36:14] Oh, apologies [21:36:19] should be done soon tho [21:36:20] Tbf scap would have stopped me anyway :D [21:36:29] Misread your message to say it was my turn [21:37:13] !log cjming@deploy2002 cjming, phuedx: Backport for [[gerrit:1243922|JS SDK: Added `Instrument#submitClick` for backwards compatibility]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:37:38] !log cjming@deploy2002 cjming, phuedx: Continuing with sync [21:39:34] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe2023.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:40:34] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe2024.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:40:44] (03PS1) 10Kosta Harlan: GetSecurityLogContextHandler: Add IP reputation country code [extensions/IPReputation] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1243965 (https://phabricator.wikimedia.org/T415354) [21:41:02] (03PS1) 10Kosta Harlan: GetSecurityLogContextHandler: Add IP reputation country code [extensions/IPReputation] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243966 (https://phabricator.wikimedia.org/T415354) [21:41:31] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1243922|JS SDK: Added `Instrument#submitClick` for backwards compatibility]] (duration: 06m 28s) [21:41:51] Dreamy_Jazz: all yours - can you ping me when you're done? [21:41:57] Yes. Will do [21:42:21] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ms-fe2023.codfw.wmnet with OS bullseye [21:42:35] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q3:rack/setup/install ms-fe202[1-4] - https://phabricator.wikimedia.org/T416243#11652422 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-fe2023.codfw.wmnet with OS bullseye [21:42:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/IPReputation] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1243965 (https://phabricator.wikimedia.org/T415354) (owner: 10Kosta Harlan) [21:42:40] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ms-fe2024.codfw.wmnet with OS bullseye [21:42:49] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q3:rack/setup/install ms-fe202[1-4] - https://phabricator.wikimedia.org/T416243#11652423 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-fe2024.codfw.wmnet with OS bullseye [21:42:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/IPReputation] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243966 (https://phabricator.wikimedia.org/T415354) (owner: 10Kosta Harlan) [21:42:58] I've added two patches that can go out at the same time [21:43:02] cc Dreamy_Jazz ^ [21:44:28] or rather cc cjming ^ [21:45:43] (03CR) 10Xcollazo: "PPC looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1243954 (https://phabricator.wikimedia.org/T415193) (owner: 10Xcollazo) [21:46:55] Scap started for my changes [21:46:57] kostajh: np - i can take care of them [21:47:17] (no logs will appear as it's private code changes) [21:47:24] ack [21:48:22] RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:49:13] cjming: ty! [21:49:43] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:50:31] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [21:50:56] (03CR) 10Eevans: cassandra: Java 8 no longer supported (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1242473 (https://phabricator.wikimedia.org/T418010) (owner: 10Eevans) [21:50:58] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-fe2024.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:51:01] jhancock@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [21:51:17] (03CR) 10Bking: [C:03+2] superset: Disallow scheduling on 1Gbps hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243955 (https://phabricator.wikimedia.org/T418412) (owner: 10Bking) [21:52:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [21:53:19] (03Merged) 10jenkins-bot: superset: Disallow scheduling on 1Gbps hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1243955 (https://phabricator.wikimedia.org/T418412) (owner: 10Bking) [21:53:47] Probably another 5 mins [21:53:59] cool - gtk [21:56:38] !log clouddumps1001/1002: removing 2 old dump files and renaming one for T417824 [21:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:09] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2044.codfw.wmnet with OS trixie [21:58:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe2023.codfw.wmnet with reason: host reimage [21:58:34] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe2024.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:59:44] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [22:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260225T2200) [22:00:24] cjming: Feel free to start +2'ing stuff, I am on the final straight (just waiting for it to deploy everywhere after testing it) [22:00:57] np - hopefully it's ok to bleed into the next window [22:00:57] Just to get gate-and-submit-wmf moving :D [22:01:07] Should be AFAIK [22:02:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [22:03:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe2023.codfw.wmnet with reason: host reimage [22:03:29] Annoyingly the private code changes are now causing errors not on the testservers but on non-testservers [22:03:35] I'll need to deploy again after you are done [22:03:52] bummer - so i'm ok to start? [22:03:57] Scap has finished though for me for the time being [22:04:07] got it - ok - continuing [22:04:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243961 (https://phabricator.wikimedia.org/T418089) (owner: 10Anzx) [22:05:28] (03Merged) 10jenkins-bot: zhwiki: remove accountcreator usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243961 (https://phabricator.wikimedia.org/T418089) (owner: 10Anzx) [22:05:57] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1243961|zhwiki: remove accountcreator usergroup (T418089)]] [22:06:01] T418089: Remove "accountcreator" and allow "event-organizer" to add and remove "event participant" in zhwiki - https://phabricator.wikimedia.org/T418089 [22:06:31] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen: apply [22:06:43] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen: apply [22:08:12] !log cjming@deploy2002 cjming, anzx: Backport for [[gerrit:1243961|zhwiki: remove accountcreator usergroup (T418089)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:08:25] checking [22:08:53] cjming: ok to sync [22:09:02] !log cjming@deploy2002 cjming, anzx: Continuing with sync [22:10:30] cjming: I'm ready with my private code fix. Could I jump the queue before kostajh? [22:10:48] (03PS1) 10Dzahn: zuul::executor: do not use /etc/zookeeper as cert dir [puppet] - 10https://gerrit.wikimedia.org/r/1243984 (https://phabricator.wikimedia.org/T395938) [22:10:52] (that's fine with me) [22:11:18] (03PS1) 10Eevans: csasandra: add new 'linked_artifacts' role (user) [puppet] - 10https://gerrit.wikimedia.org/r/1243985 (https://phabricator.wikimedia.org/T418420) [22:11:19] Dreamy_Jazz: sure - go for it - lmk when you're done [22:11:41] !log jhathaway@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage [22:11:46] (03CR) 10Dzahn: [C:03+2] zuul::executor: do not use /etc/zookeeper as cert dir [puppet] - 10https://gerrit.wikimedia.org/r/1243984 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [22:12:21] (03CR) 10Dzahn: [C:03+2] "fixing puppet error on systems not yet in production" [puppet] - 10https://gerrit.wikimedia.org/r/1243984 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [22:12:56] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1243961|zhwiki: remove accountcreator usergroup (T418089)]] (duration: 06m 59s) [22:13:01] T418089: Remove "accountcreator" and allow "event-organizer" to add and remove "event participant" in zhwiki - https://phabricator.wikimedia.org/T418089 [22:13:12] cjming: thanks for deploying [22:13:23] Proceeding with my private code changes now [22:13:31] anzx: ur welcome! [22:13:50] anzx: no follow up script needed for your 2nd patch? [22:14:32] (03PS1) 10Eevans: Add (phony) password for linked_artifacts Cassandra role [labs/private] - 10https://gerrit.wikimedia.org/r/1243986 (https://phabricator.wikimedia.org/T418420) [22:14:36] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage [22:15:24] cjming: no need [22:16:38] 👍 [22:24:51] (03PS1) 10Medelius: Suggestion Mode: add values for suggestion feedback properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1243990 (https://phabricator.wikimedia.org/T401739) [22:26:29] Private code changes are going to take a bit longer [22:26:57] cjming: Do you mind waiting another 5 or so mins [22:27:02] np [22:27:08] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:27:25] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:27:26] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe2023.codfw.wmnet with OS bullseye [22:27:33] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q3:rack/setup/install ms-fe202[1-4] - https://phabricator.wikimedia.org/T416243#11652515 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ms-fe2023.codfw.wmnet with OS bullseye completed: - ms-fe2023 (**WAR... [22:28:46] RESOLVED: Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [22:29:14] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q3:rack/setup/install ms-fe202[1-4] - https://phabricator.wikimedia.org/T416243#11652518 (10Jhancock.wm) [22:30:25] (03CR) 10Dzahn: [C:03+1] "the reasoning that the backend timeout must be higher than the frontend makes sense to me and the default of jetty is 30. related to 6 yea" [puppet] - 10https://gerrit.wikimedia.org/r/1241048 (https://phabricator.wikimedia.org/T246763) (owner: 10Hashar) [22:31:28] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2001.codfw.wmnet with OS bookworm [22:31:47] cjming: Go ahead [22:31:57] great thanks! [22:32:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [extensions/IPReputation] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1243965 (https://phabricator.wikimedia.org/T415354) (owner: 10Kosta Harlan) [22:32:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [extensions/IPReputation] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243966 (https://phabricator.wikimedia.org/T415354) (owner: 10Kosta Harlan) [22:33:44] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment shellbox-main in shellbox-constraints at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [22:34:28] (03Merged) 10jenkins-bot: GetSecurityLogContextHandler: Add IP reputation country code [extensions/IPReputation] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1243965 (https://phabricator.wikimedia.org/T415354) (owner: 10Kosta Harlan) [22:34:30] (03Merged) 10jenkins-bot: GetSecurityLogContextHandler: Add IP reputation country code [extensions/IPReputation] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243966 (https://phabricator.wikimedia.org/T415354) (owner: 10Kosta Harlan) [22:35:02] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1243965|GetSecurityLogContextHandler: Add IP reputation country code (T415354)]], [[gerrit:1243966|GetSecurityLogContextHandler: Add IP reputation country code (T415354)]] [22:35:07] T415354: Record CDN/Backend api and IP reputation values in editattemptsblocked schema - https://phabricator.wikimedia.org/T415354 [22:36:48] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host frdb1008 [22:36:51] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host frdb1008 [22:37:07] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [22:37:12] !log cjming@deploy2002 cjming, kharlan: Backport for [[gerrit:1243965|GetSecurityLogContextHandler: Add IP reputation country code (T415354)]], [[gerrit:1243966|GetSecurityLogContextHandler: Add IP reputation country code (T415354)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:37:56] kostajh: not sure if you're still around but i'm going to assume it's ok to sync unless you want to test [22:38:10] cjming: I'll look [22:38:20] cool - standing by [22:39:10] cjming: seems fine [22:39:20] !log cjming@deploy2002 cjming, kharlan: Continuing with sync [22:39:51] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:40:25] FIRING: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:43:05] (03PS2) 10Eevans: cassandra: add new 'linked_artifacts' role (user) [puppet] - 10https://gerrit.wikimedia.org/r/1243985 (https://phabricator.wikimedia.org/T418420) [22:43:13] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1243965|GetSecurityLogContextHandler: Add IP reputation country code (T415354)]], [[gerrit:1243966|GetSecurityLogContextHandler: Add IP reputation country code (T415354)]] (duration: 08m 11s) [22:43:17] T415354: Record CDN/Backend api and IP reputation values in editattemptsblocked schema - https://phabricator.wikimedia.org/T415354 [22:43:46] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1243985 (https://phabricator.wikimedia.org/T418420) (owner: 10Eevans) [22:45:25] RESOLVED: SystemdUnitFailed: wdqs-blazegraph-deadlock-check.service on wdqs2022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:47:17] (03PS1) 10Scott French: admin_ng: bump shellbox-constraints resourcequota [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244008 [22:48:24] (03CR) 10Urbanecm: [C:03+2] tests: Introduce MentorRemoverTest [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243860 (owner: 10Urbanecm) [22:48:37] !log end of UTC late backport window [22:48:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243860 (owner: 10Urbanecm) [23:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260225T2300) [23:00:14] (03Merged) 10jenkins-bot: tests: Introduce MentorRemoverTest [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1243860 (owner: 10Urbanecm) [23:01:01] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1243860|tests: Introduce MentorRemoverTest]] [23:02:56] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-fe2024.codfw.wmnet with OS bullseye [23:03:10] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q3:rack/setup/install ms-fe202[1-4] - https://phabricator.wikimedia.org/T416243#11652597 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ms-fe2024.codfw.wmnet with OS bullseye executed with errors: - ms-fe... [23:03:16] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1243860|tests: Introduce MentorRemoverTest]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:03:26] (03CR) 10Scott French: [C:03+2] admin_ng: bump shellbox-constraints resourcequota [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244008 (owner: 10Scott French) [23:04:19] !log urbanecm@deploy2002 urbanecm: Continuing with sync [23:08:13] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1243860|tests: Introduce MentorRemoverTest]] (duration: 07m 12s) [23:09:18] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [23:09:27] (03PS1) 10Urbanecm: SECURITY: ReassignMentees: Handle hidden users correctly [extensions/GrowthExperiments] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1244011 (https://phabricator.wikimedia.org/T418222) [23:09:53] (03PS1) 10Urbanecm: SECURITY: ReassignMentees: Handle hidden users correctly [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244012 (https://phabricator.wikimedia.org/T418222) [23:11:03] (03Merged) 10jenkins-bot: admin_ng: bump shellbox-constraints resourcequota [deployment-charts] - 10https://gerrit.wikimedia.org/r/1244008 (owner: 10Scott French) [23:11:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244012 (https://phabricator.wikimedia.org/T418222) (owner: 10Urbanecm) [23:11:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1244011 (https://phabricator.wikimedia.org/T418222) (owner: 10Urbanecm) [23:14:19] !log swfrench@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [23:15:18] !log swfrench@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [23:15:43] !log swfrench@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [23:17:43] !log swfrench@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [23:18:57] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [23:20:47] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [23:24:18] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [23:24:37] (03Merged) 10jenkins-bot: SECURITY: ReassignMentees: Handle hidden users correctly [extensions/GrowthExperiments] (wmf/1.46.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1244012 (https://phabricator.wikimedia.org/T418222) (owner: 10Urbanecm) [23:24:43] (03Merged) 10jenkins-bot: SECURITY: ReassignMentees: Handle hidden users correctly [extensions/GrowthExperiments] (wmf/1.46.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1244011 (https://phabricator.wikimedia.org/T418222) (owner: 10Urbanecm) [23:25:16] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1244012|SECURITY: ReassignMentees: Handle hidden users correctly (T418222)]], [[gerrit:1244011|SECURITY: ReassignMentees: Handle hidden users correctly (T418222)]] [23:27:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [23:27:34] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1244012|SECURITY: ReassignMentees: Handle hidden users correctly (T418222)]], [[gerrit:1244011|SECURITY: ReassignMentees: Handle hidden users correctly (T418222)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:28:21] !log urbanecm@deploy2002 urbanecm: Continuing with sync [23:32:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [23:32:18] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1244012|SECURITY: ReassignMentees: Handle hidden users correctly (T418222)]], [[gerrit:1244011|SECURITY: ReassignMentees: Handle hidden users correctly (T418222)]] (duration: 07m 01s) [23:33:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [23:35:09] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [23:36:04] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [23:37:31] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [23:38:44] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment shellbox-main in shellbox-constraints at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [23:39:41] will be gone soon ^ [23:40:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [23:40:28] mission succeeded, our servers are no longer executing an infinite loop [23:40:32] (or not this one, at least) [23:40:59] that sounds good :) [23:42:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2175:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2175 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:43:36] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdb1008 - https://phabricator.wikimedia.org/T414374#11652717 (10VRiley-WMF) [23:44:14] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdb1008 - https://phabricator.wikimedia.org/T414374#11652718 (10VRiley-WMF) Running into another error with the provisioning script. Working on this. [23:50:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [23:52:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker2175:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2175 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:53:44] RESOLVED: [2x] KubernetesDeploymentUnavailableReplicas: Deployment shellbox-main in shellbox-constraints at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [23:53:50] \o/