[00:05:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P79687 and previous config saved to /var/cache/conftool/dbconfig/20250723-000516-fceratto.json [00:05:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P79688 and previous config saved to /var/cache/conftool/dbconfig/20250723-000558-marostegui.json [00:08:39] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1171741 [00:08:39] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1171741 (owner: 10TrainBranchBot) [00:15:34] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon2005-dev.codfw.wmnet with OS bullseye [00:16:11] (03CR) 10RLazarus: [C:03+1] docker: remove bullseye-backports from sources.list [puppet] - 10https://gerrit.wikimedia.org/r/1171716 (https://phabricator.wikimedia.org/T383557) (owner: 10Scott French) [00:17:39] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephmon2006-dev.codfw.wmnet with OS bullseye [00:17:50] (03CR) 10Scott French: "Thanks, Reuven!" [puppet] - 10https://gerrit.wikimedia.org/r/1171716 (https://phabricator.wikimedia.org/T383557) (owner: 10Scott French) [00:18:12] (03CR) 10Scott French: [C:03+2] docker: remove bullseye-backports from sources.list [puppet] - 10https://gerrit.wikimedia.org/r/1171716 (https://phabricator.wikimedia.org/T383557) (owner: 10Scott French) [00:20:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P79689 and previous config saved to /var/cache/conftool/dbconfig/20250723-002024-fceratto.json [00:21:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T399249)', diff saved to https://phabricator.wikimedia.org/P79690 and previous config saved to /var/cache/conftool/dbconfig/20250723-002106-marostegui.json [00:21:11] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [00:21:22] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2222.codfw.wmnet with reason: Maintenance [00:21:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2222 (T399249)', diff saved to https://phabricator.wikimedia.org/P79691 and previous config saved to /var/cache/conftool/dbconfig/20250723-002129-marostegui.json [00:31:13] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1171741 (owner: 10TrainBranchBot) [00:33:09] !log ran DISTRIBUTIONS="bullseye" build-base-images on build2001 - T383557 [00:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:15] T383557: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557 [00:35:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T399728)', diff saved to https://phabricator.wikimedia.org/P79692 and previous config saved to /var/cache/conftool/dbconfig/20250723-003535-fceratto.json [00:35:40] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [00:35:51] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2202.codfw.wmnet with reason: Maintenance [00:37:15] (03PS1) 10Scott French: php8.1: rebuild to pick up removal of bullseye-backports [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1171747 (https://phabricator.wikimedia.org/T383557) [00:37:15] (03CR) 10Scott French: [V:03+2] "Built locally with docker-pkg." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1171747 (https://phabricator.wikimedia.org/T383557) (owner: 10Scott French) [00:37:28] !log andrew@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon2006-dev.codfw.wmnet with reason: host reimage [00:37:33] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2203.codfw.wmnet with reason: Maintenance [00:37:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2203 (T399728)', diff saved to https://phabricator.wikimedia.org/P79693 and previous config saved to /var/cache/conftool/dbconfig/20250723-003740-fceratto.json [00:39:13] (03CR) 10RLazarus: [C:03+1] php8.1: rebuild to pick up removal of bullseye-backports [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1171747 (https://phabricator.wikimedia.org/T383557) (owner: 10Scott French) [00:40:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2203 (T399728)', diff saved to https://phabricator.wikimedia.org/P79694 and previous config saved to /var/cache/conftool/dbconfig/20250723-004014-fceratto.json [00:41:48] (03CR) 10Scott French: [V:03+2] "Thanks, Reuven!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1171747 (https://phabricator.wikimedia.org/T383557) (owner: 10Scott French) [00:41:57] (03CR) 10Scott French: [V:03+2 C:03+2] php8.1: rebuild to pick up removal of bullseye-backports [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1171747 (https://phabricator.wikimedia.org/T383557) (owner: 10Scott French) [00:43:51] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon2006-dev.codfw.wmnet with reason: host reimage [00:46:15] !log rebuilt php8.1 production images (8.1.33-1-s2) on build2001 - T383557 [00:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:19] T383557: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557 [00:50:40] 06SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#11026451 (10Scott_French) Alright, MediaWiki deployments should no longer be at risk: the php8.1 production images have been rebuilt on `docker-registry.dis... [00:57:38] (03CR) 10Xcollazo: [C:03+1] Disable all dumps timers on snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/1170410 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [01:07:19] (03CR) 10Scott French: "These are now broken, as bullseye-backports has been archived. Thus, it would be good to get this merged soon." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1131630 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff) [01:12:34] 06SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#11026454 (10Scott_French) [01:21:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T399249)', diff saved to https://phabricator.wikimedia.org/P79695 and previous config saved to /var/cache/conftool/dbconfig/20250723-012120-marostegui.json [01:21:25] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [01:25:52] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2216.codfw.wmnet with reason: Maintenance [01:26:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2216 (T399728)', diff saved to https://phabricator.wikimedia.org/P79696 and previous config saved to /var/cache/conftool/dbconfig/20250723-012559-fceratto.json [01:26:05] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [01:28:43] andrew@cumin1003 reimage (PID 3006722) is awaiting input [01:29:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T399728)', diff saved to https://phabricator.wikimedia.org/P79697 and previous config saved to /var/cache/conftool/dbconfig/20250723-012944-fceratto.json [01:30:07] jouncebot: nowandnext [01:30:07] No deployments scheduled for the next 4 hour(s) and 29 minute(s) [01:30:08] In 4 hour(s) and 29 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250723T0600) [01:31:01] FYI, I'm going to start a noop deployment to pick up new php8.1 production images [01:32:21] !log swfrench@deploy1003 Started scap sync-world: Test deployment to verify new php8.1 images - T383557 [01:32:26] T383557: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557 [01:32:34] andrew@cumin1003 reimage (PID 3006722) is awaiting input [01:36:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P79698 and previous config saved to /var/cache/conftool/dbconfig/20250723-013627-marostegui.json [01:39:42] (03CR) 10Novem Linguae: zhwiki: Allow local securepoll setup (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100228 (https://phabricator.wikimedia.org/T380020) (owner: 10Stang) [01:44:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P79699 and previous config saved to /var/cache/conftool/dbconfig/20250723-014451-fceratto.json [01:51:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P79700 and previous config saved to /var/cache/conftool/dbconfig/20250723-015135-marostegui.json [02:00:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P79701 and previous config saved to /var/cache/conftool/dbconfig/20250723-015959-fceratto.json [02:04:33] !log swfrench@deploy1003 Finished scap sync-world: Test deployment to verify new php8.1 images - T383557 (duration: 34m 39s) [02:04:37] T383557: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557 [02:06:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T399249)', diff saved to https://phabricator.wikimedia.org/P79702 and previous config saved to /var/cache/conftool/dbconfig/20250723-020643-marostegui.json [02:06:48] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [02:09:31] 06SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#11026493 (10Scott_French) [02:15:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T399728)', diff saved to https://phabricator.wikimedia.org/P79703 and previous config saved to /var/cache/conftool/dbconfig/20250723-021507-fceratto.json [02:15:12] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [02:54:14] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:59:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:00:30] PROBLEM - Disk space on an-worker1121 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/d 153301 MB (4% inode=99%): /var/lib/hadoop/data/h 152125 MB (4% inode=99%): /var/lib/hadoop/data/b 163605 MB (4% inode=99%): /var/lib/hadoop/data/k 147931 MB (3% inode=99%): /var/lib/hadoop/data/m 148715 MB (3% inode=99%): /var/lib/hadoop/data/f 168500 MB (4% inode=99%): /var/lib/hadoop/data/j 155246 MB (4% inode=99%): /var/lib/hadoop/data [03:00:30] 6 MB (4% inode=99%): /var/lib/hadoop/data/l 161132 MB (4% inode=99%): /var/lib/hadoop/data/i 146731 MB (3% inode=99%): /var/lib/hadoop/data/g 158114 MB (4% inode=99%): /var/lib/hadoop/data/c 142579 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1121&var-datasource=eqiad+prometheus/ops [03:04:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:06:25] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:09:27] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:10:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:11:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [03:16:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [03:20:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:24:59] FIRING: KubernetesDeploymentUnavailableReplicas: ... [03:24:59] Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica [04:32:14] 06SRE, 06Traffic, 07affects-Kiwix-and-openZIM: Rate limiting/status code 429 for mwclient? - https://phabricator.wikimedia.org/T400018#11026576 (10Audiodude) Thanks again @Scott_French for the extremely helpful analysis! I plan to submit a PR to mwclient to update the docs for that method to indicate whi... [04:43:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:48:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:51:10] FIRING: BFDdown: BFD session down between cr2-eqiad and fe80::ee38:73ff:fee7:bc68 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:56:10] RESOLVED: BFDdown: BFD session down between cr2-eqiad and fe80::ee38:73ff:fee7:bc68 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:58:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:03:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250723T0600) [06:11:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:20:30] PROBLEM - Disk space on an-worker1121 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/d 159656 MB (4% inode=99%): /var/lib/hadoop/data/h 155636 MB (4% inode=99%): /var/lib/hadoop/data/b 155086 MB (4% inode=99%): /var/lib/hadoop/data/k 160693 MB (4% inode=99%): /var/lib/hadoop/data/m 156577 MB (4% inode=99%): /var/lib/hadoop/data/f 158585 MB (4% inode=99%): /var/lib/hadoop/data/j 153752 MB (4% inode=99%): /var/lib/hadoop/data [06:20:30] 1 MB (3% inode=99%): /var/lib/hadoop/data/l 152637 MB (4% inode=99%): /var/lib/hadoop/data/i 157805 MB (4% inode=99%): /var/lib/hadoop/data/g 154449 MB (4% inode=99%): /var/lib/hadoop/data/c 156183 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1121&var-datasource=eqiad+prometheus/ops [06:39:22] (03CR) 10Filippo Giunchedi: [C:03+1] role::titan: install promtool [puppet] - 10https://gerrit.wikimedia.org/r/1171591 (https://phabricator.wikimedia.org/T349521) (owner: 10Herron) [06:54:14] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:00:05] Amir1, Urbanecm, and awight: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250723T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:06:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:09:08] (03CR) 10Cyndywikime: [C:03+1] Growth: enable new way of refreshing LinkRecommendations for more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164287 (https://phabricator.wikimedia.org/T386250) (owner: 10Michael Große) [07:09:27] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:18:39] (03CR) 10Volans: "replies inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) (owner: 10Ryan Kemper) [07:19:29] !log mvernon@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7:00:00 on aqs1012.eqiad.wmnet with reason: wait for eevans [07:24:59] FIRING: KubernetesDeploymentUnavailableReplicas: ... [07:24:59] Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica [07:46:54] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2151.codfw.wmnet with reason: Maintenance [07:47:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2151 (T399728)', diff saved to https://phabricator.wikimedia.org/P79704 and previous config saved to /var/cache/conftool/dbconfig/20250723-074700-fceratto.json [07:47:06] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [07:49:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T399728)', diff saved to https://phabricator.wikimedia.org/P79705 and previous config saved to /var/cache/conftool/dbconfig/20250723-074945-fceratto.json [07:51:27] !log gmodena@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [07:51:44] !log gmodena@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [08:04:49] (03CR) 10Elukey: "Thanks for the ping! I noticed:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1131630 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff) [08:04:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P79706 and previous config saved to /var/cache/conftool/dbconfig/20250723-080453-fceratto.json [08:08:22] (03CR) 10C. Scott Ananian: "My plan was to merge this first, and then let the other one ride the train after we were certain everything had settled down." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165094 (https://phabricator.wikimedia.org/T363484) (owner: 10C. Scott Ananian) [08:08:32] (03PS1) 10Clément Goubert: thumbor: Lower thumbor_workers, more memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171982 (https://phabricator.wikimedia.org/T392348) [08:15:41] !log gmodena@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [08:15:55] !log gmodena@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [08:17:50] (03CR) 10Elukey: [C:03+1] role::titan: install promtool [puppet] - 10https://gerrit.wikimedia.org/r/1171591 (https://phabricator.wikimedia.org/T349521) (owner: 10Herron) [08:20:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P79707 and previous config saved to /var/cache/conftool/dbconfig/20250723-082000-fceratto.json [08:20:13] (03PS1) 10Clément Goubert: wmnet: Remove maintenance.eqiad.wmnet record [dns] - 10https://gerrit.wikimedia.org/r/1171983 (https://phabricator.wikimedia.org/T397017) [08:27:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170760 (https://phabricator.wikimedia.org/T399736) (owner: 10DDesouza) [08:35:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T399728)', diff saved to https://phabricator.wikimedia.org/P79708 and previous config saved to /var/cache/conftool/dbconfig/20250723-083508-fceratto.json [08:35:14] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [08:35:24] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2158.codfw.wmnet with reason: Maintenance [08:35:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2158 (T399728)', diff saved to https://phabricator.wikimedia.org/P79709 and previous config saved to /var/cache/conftool/dbconfig/20250723-083531-fceratto.json [08:38:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T399728)', diff saved to https://phabricator.wikimedia.org/P79710 and previous config saved to /var/cache/conftool/dbconfig/20250723-083814-fceratto.json [08:40:01] (03PS1) 10Elukey: eventrouter: update Build-Depends to golang 1.19 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1171985 [08:43:32] (03CR) 10Elukey: "https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1171985" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1131630 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff) [08:44:03] (03CR) 10Clément Goubert: [C:03+1] eventrouter: update Build-Depends to golang 1.19 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1171985 (owner: 10Elukey) [08:45:23] (03CR) 10Elukey: [V:03+2 C:03+2] eventrouter: update Build-Depends to golang 1.19 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1171985 (owner: 10Elukey) [08:46:12] (03CR) 10Elukey: [V:03+2 C:03+2] Remove golang-1.17 and golang-1.18 images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1131630 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff) [08:46:20] (03CR) 10Elukey: Remove golang-1.17 and golang-1.18 images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1131630 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff) [08:47:07] (03PS3) 10Elukey: Remove golang-1.17 and golang-1.18 images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1131630 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff) [08:53:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P79711 and previous config saved to /var/cache/conftool/dbconfig/20250723-085321-fceratto.json [09:06:47] 06SRE: Add ability to validate JWTs in haproxy - https://phabricator.wikimedia.org/T400238 (10Joe) 03NEW [09:08:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P79712 and previous config saved to /var/cache/conftool/dbconfig/20250723-090829-fceratto.json [09:14:49] seen !log drain cr2-codfw of traffic to execute juniper commands to resolve stats issue T400205 [09:14:50] T400205: Inaccurate stats reported by cr2-codfw - https://phabricator.wikimedia.org/T400205 [09:15:48] !log arnaudb@cumin1003 START - Cookbook sre.gitlab.failover Failover of gitlab from gitlab1004.wikimedia.org to gitlab1003.wikimedia.org [09:20:09] (03PS1) 10Gkyziridis: ml-services: Configure autoscaling for edit-check model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171991 (https://phabricator.wikimedia.org/T400162) [09:23:37] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T399728)', diff saved to https://phabricator.wikimedia.org/P79714 and previous config saved to /var/cache/conftool/dbconfig/20250723-092336-fceratto.json [09:23:42] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [09:23:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:23:53] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2169.codfw.wmnet with reason: Maintenance [09:24:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2169 (T399728)', diff saved to https://phabricator.wikimedia.org/P79715 and previous config saved to /var/cache/conftool/dbconfig/20250723-092359-fceratto.json [09:26:39] FIRING: CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-codfw:9804&var-bgp_group=Confed_eqord&var-bgp_neighbor=cr2-eqord - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:26:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T399728)', diff saved to https://phabricator.wikimedia.org/P79716 and previous config saved to /var/cache/conftool/dbconfig/20250723-092641-fceratto.json [09:27:55] !log gmodena@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [09:28:01] !log gmodena@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [09:29:43] (03CR) 10Hnowlan: [C:03+1] thumbor: Lower thumbor_workers, more memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171982 (https://phabricator.wikimedia.org/T392348) (owner: 10Clément Goubert) [09:29:44] (03PS2) 10Gkyziridis: ml-services: Configure autoscaling for edit-check model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171991 (https://phabricator.wikimedia.org/T400162) [09:31:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:33:44] (03CR) 10Arnaudb: [C:03+2] Gitlab: switchover between gitlab-replica-a and gitlab-replica-b [puppet] - 10https://gerrit.wikimedia.org/r/1171539 (https://phabricator.wikimedia.org/T400121) (owner: 10Arnaudb) [09:35:11] (03CR) 10Gkyziridis: ml-services: Configure autoscaling for edit-check model. (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171991 (https://phabricator.wikimedia.org/T400162) (owner: 10Gkyziridis) [09:37:43] (03PS11) 10Effie Mouzeli: hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171562 (https://phabricator.wikimedia.org/T399211) [09:38:30] (03PS1) 10Hnowlan: thumbor: change haproxy load balancing algorithm to leastconn [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171996 (https://phabricator.wikimedia.org/T392348) [09:38:35] (03PS3) 10Gkyziridis: ml-services: Configure autoscaling for edit-check model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171991 (https://phabricator.wikimedia.org/T400162) [09:40:22] (03CR) 10Clément Goubert: [C:03+2] thumbor: Lower thumbor_workers, more memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171982 (https://phabricator.wikimedia.org/T392348) (owner: 10Clément Goubert) [09:41:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P79717 and previous config saved to /var/cache/conftool/dbconfig/20250723-094149-fceratto.json [09:46:38] jouncebot: nowandnext [09:46:38] No deployments scheduled for the next 0 hour(s) and 13 minute(s) [09:46:38] In 0 hour(s) and 13 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250723T1000) [09:47:45] (03Merged) 10jenkins-bot: thumbor: Lower thumbor_workers, more memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171982 (https://phabricator.wikimedia.org/T392348) (owner: 10Clément Goubert) [09:49:08] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply [09:52:35] (03PS1) 10Elukey: redfish: improve is_uefi for Supermicro [software/spicerack] - 10https://gerrit.wikimedia.org/r/1172000 (https://phabricator.wikimedia.org/T393948) [09:54:14] (03PS1) 10Ayounsi: k8s: replace legacy codfw vlans with future legacy eqiad vlans [puppet] - 10https://gerrit.wikimedia.org/r/1172001 [09:54:49] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [09:56:42] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [09:56:42] (03CR) 10Elukey: "Just realized that we are missing the test for dell, adding it." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1172000 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey) [09:56:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P79718 and previous config saved to /var/cache/conftool/dbconfig/20250723-095656-fceratto.json [09:59:02] (03CR) 10Elukey: "Correction, we already have it, all good :)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1172000 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey) [09:59:43] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [09:59:53] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250723T1000) [10:01:02] (03CR) 10Elukey: [C:03+1] "LGTM, but let's wait somebody from ServiceOps to confirm!" [puppet] - 10https://gerrit.wikimedia.org/r/1172001 (owner: 10Ayounsi) [10:01:13] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [10:01:46] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [10:01:47] FIRING: HelmReleaseBadStatus: Helm release thumbor/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=thumbor - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:01:55] yeah that's me, on it [10:01:57] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [10:06:47] RESOLVED: HelmReleaseBadStatus: Helm release thumbor/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=thumbor - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:10:20] (03PS1) 10Ayounsi: BGPPeers nodeSelector: remove old codfw rows, add future eqiad pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172004 (https://phabricator.wikimedia.org/T333948) [10:11:20] (03PS2) 10Ayounsi: BGPPeers nodeSelector: remove old codfw rows, add future eqiad pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172004 (https://phabricator.wikimedia.org/T333948) [10:11:20] (03CR) 10Volans: [C:03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1172000 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey) [10:12:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T399728)', diff saved to https://phabricator.wikimedia.org/P79719 and previous config saved to /var/cache/conftool/dbconfig/20250723-101204-fceratto.json [10:12:09] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [10:12:19] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2180.codfw.wmnet with reason: Maintenance [10:12:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2180 (T399728)', diff saved to https://phabricator.wikimedia.org/P79720 and previous config saved to /var/cache/conftool/dbconfig/20250723-101226-fceratto.json [10:13:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T399728)', diff saved to https://phabricator.wikimedia.org/P79721 and previous config saved to /var/cache/conftool/dbconfig/20250723-101358-fceratto.json [10:16:04] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply [10:17:41] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [10:21:20] (03CR) 10Arnaudb: [C:03+2] Gitlab: switchover between gitlab-replica-a and gitlab-replica-b [dns] - 10https://gerrit.wikimedia.org/r/1171537 (https://phabricator.wikimedia.org/T400121) (owner: 10Arnaudb) [10:21:49] !log arnaudb@dns1004 START - running authdns-update [10:22:25] arnaudb@cumin1003 failover (PID 3057135) is awaiting input [10:22:30] (03PS13) 10Elukey: sre.hosts.provision: add custom settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357) [10:23:09] !log arnaudb@dns1004 END - running authdns-update [10:23:28] !log arnaudb@cumin1003 START - Cookbook sre.dns.wipe-cache 'https://gitlab-replica-a.wikimedia.org/ https://gitlab-replica-b.wikimedia.org/' on all recursors [10:23:32] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'https://gitlab-replica-a.wikimedia.org/ https://gitlab-replica-b.wikimedia.org/' on all recursors [10:24:04] 06SRE, 06Infrastructure-Foundations, 10netops: Inaccurate stats reported by cr2-codfw - https://phabricator.wikimedia.org/T400205#11027210 (10cmooney) Ok so I drained cr2-codfw of traffic and tried issuing the commands. Commands as supplied by Juniper aren't 100% correct either which is reassuring when medd... [10:24:28] (03PS1) 10Jcrespo: mariadb: Upgrade db1171 backup source MariaDB package to 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1172008 (https://phabricator.wikimedia.org/T399955) [10:24:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [10:24:44] Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica [10:25:22] (03CR) 10Ilias Sarantopoulos: "I'd suggest to decouple the admin_ng changes from the edit_check changes in separate patches as they refer to different deployments" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171991 (https://phabricator.wikimedia.org/T400162) (owner: 10Gkyziridis) [10:26:39] 07Puppet, 06SRE, 10Beta-Cluster-Infrastructure: Puppet configures kernel.core_pattern |/usr/lib/systemd/systemd-coredump, but systemd-coredump is not installed - https://phabricator.wikimedia.org/T400247 (10Lucas_Werkmeister_WMDE) 03NEW [10:27:50] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.gitlab.failover (exit_code=0) Failover of gitlab from gitlab1004.wikimedia.org to gitlab1003.wikimedia.org [10:28:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:29:00] (03PS4) 10Gkyziridis: ml-services: Configure autoscaling for edit-check model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171991 (https://phabricator.wikimedia.org/T400162) [10:29:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P79722 and previous config saved to /var/cache/conftool/dbconfig/20250723-102905-fceratto.json [10:29:21] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1171.eqiad.wmnet with reason: upgrade mariadb [10:29:31] 06SRE: Add ability to validate JWTs in haproxy - https://phabricator.wikimedia.org/T400238#11027254 (10Vgutierrez) > For now, we might also want to check for a mw session token instead. Please correct me if I’m wrong, but in this case, validation is just a matter of checking whether the token is present or not.... [10:30:54] (03PS1) 10Lucas Werkmeister (WMDE): systemd::coredump: Install systemd-coredump iff enabled [puppet] - 10https://gerrit.wikimedia.org/r/1172010 (https://phabricator.wikimedia.org/T400247) [10:31:32] (03CR) 10Cathal Mooney: [C:03+1] "LGTM, probably best to get someone more familiar with it to check too but it's simple enough." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172004 (https://phabricator.wikimedia.org/T333948) (owner: 10Ayounsi) [10:33:22] (03PS1) 10Kevin Bazira: ml-services: update RRLA and RRML images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172011 (https://phabricator.wikimedia.org/T399437) [10:34:25] (03CR) 10Jcrespo: [C:03+2] mariadb: Upgrade db1171 backup source MariaDB package to 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1172008 (https://phabricator.wikimedia.org/T399955) (owner: 10Jcrespo) [10:34:48] (03PS14) 10Elukey: sre.hosts.provision: add custom settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357) [10:35:01] (03CR) 10Hnowlan: [C:03+2] thumbor: change haproxy load balancing algorithm to leastconn [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171996 (https://phabricator.wikimedia.org/T392348) (owner: 10Hnowlan) [10:36:19] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [10:36:21] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [10:36:41] (03Merged) 10jenkins-bot: thumbor: change haproxy load balancing algorithm to leastconn [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171996 (https://phabricator.wikimedia.org/T392348) (owner: 10Hnowlan) [10:37:19] (03PS15) 10Elukey: sre.hosts.provision: add custom settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357) [10:37:37] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [10:38:45] (03PS1) 10Jcrespo: mariadb: Upgrade db2198 backup source MariaDB package to 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1172013 (https://phabricator.wikimedia.org/T399955) [10:38:55] (03PS1) 10Volans: redfish: improve iDRAC 10 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/1172014 [10:39:05] (03CR) 10Lucas Werkmeister (WMDE): "Perhaps we were still using Debian 8 (Jessie) when this puppet class was first written? If I’m reading the Debian archives correctly, the " [puppet] - 10https://gerrit.wikimedia.org/r/1172010 (https://phabricator.wikimedia.org/T400247) (owner: 10Lucas Werkmeister (WMDE)) [10:39:59] (03PS1) 10Jelto: gitlab failover: improve message for API token [cookbooks] - 10https://gerrit.wikimedia.org/r/1172015 (https://phabricator.wikimedia.org/T400121) [10:40:03] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply [10:40:10] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply [10:41:04] (03CR) 10Elukey: [C:03+1] redfish: improve iDRAC 10 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/1172014 (owner: 10Volans) [10:42:59] (03CR) 10Vgutierrez: hcaptcha::proxy: use mtail for nginx- metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1171562 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli) [10:43:02] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply [10:44:04] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2198.codfw.wmnet with reason: upgrade mariadb [10:44:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P79723 and previous config saved to /var/cache/conftool/dbconfig/20250723-104412-fceratto.json [10:45:49] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [10:47:39] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [10:49:13] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply [10:49:24] (03CR) 10Clément Goubert: [C:03+1] "LGTM, bare metal wikikube in codfw is completely migrated to the new switches. Maybe needs a check for the other clusters?" [puppet] - 10https://gerrit.wikimedia.org/r/1172001 (owner: 10Ayounsi) [10:51:15] !log jynus@cumin1003 START - Cookbook sre.hosts.remove-downtime for db1171.eqiad.wmnet [10:51:16] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db1171.eqiad.wmnet [10:51:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [10:53:22] (03PS1) 10Máté Szabó: Enable wgWikimediaEventsCreateAccountInstrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172016 (https://phabricator.wikimedia.org/T394744) [10:54:01] (03CR) 10Bartosz Wójtowicz: ml-services: update RRLA and RRML images (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172011 (https://phabricator.wikimedia.org/T399437) (owner: 10Kevin Bazira) [10:54:14] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:56:23] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [10:56:56] seen !log un-drain cr2-codfw of traffic after executing juniper commands to resolve stats issue T400205 [10:56:57] T400205: Inaccurate stats reported by cr2-codfw - https://phabricator.wikimedia.org/T400205 [10:57:35] (03CR) 10Jcrespo: [C:03+2] mariadb: Upgrade db2198 backup source MariaDB package to 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1172013 (https://phabricator.wikimedia.org/T399955) (owner: 10Jcrespo) [10:57:39] (03CR) 10Kevin Bazira: ml-services: update RRLA and RRML images (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172011 (https://phabricator.wikimedia.org/T399437) (owner: 10Kevin Bazira) [10:58:51] (03CR) 10Clément Goubert: [C:03+1] "LGTM, the "pod" naming, while apparently standard in networking (I read the task!), could get a little confusing wrt to kubernetes, but si" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172004 (https://phabricator.wikimedia.org/T333948) (owner: 10Ayounsi) [10:59:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T399728)', diff saved to https://phabricator.wikimedia.org/P79724 and previous config saved to /var/cache/conftool/dbconfig/20250723-105919-fceratto.json [10:59:24] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [10:59:35] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2193.codfw.wmnet with reason: Maintenance [10:59:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2193 (T399728)', diff saved to https://phabricator.wikimedia.org/P79725 and previous config saved to /var/cache/conftool/dbconfig/20250723-105941-fceratto.json [11:00:05] mvolz: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250723T1100). [11:02:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T399728)', diff saved to https://phabricator.wikimedia.org/P79726 and previous config saved to /var/cache/conftool/dbconfig/20250723-110217-fceratto.json [11:02:42] (03PS16) 10Elukey: sre.hosts.provision: add custom settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357) [11:03:11] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:04:50] (03PS17) 10Elukey: sre.hosts.provision: add custom settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357) [11:04:50] (03CR) 10Lucas Werkmeister (WMDE): "> Perhaps we were still using Debian 8 (Jessie) when this puppet class was first written?" [puppet] - 10https://gerrit.wikimedia.org/r/1172010 (https://phabricator.wikimedia.org/T400247) (owner: 10Lucas Werkmeister (WMDE)) [11:06:29] (03PS1) 10Majavah: team-wmcs: neutron: Stop using min_over_time [alerts] - 10https://gerrit.wikimedia.org/r/1172018 (https://phabricator.wikimedia.org/T399705) [11:06:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:07:09] !log jynus@cumin1003 START - Cookbook sre.hosts.remove-downtime for db2198.codfw.wmnet [11:07:10] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2198.codfw.wmnet [11:09:27] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:13:24] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:14:06] (03PS1) 10Jcrespo: dbbackups: Upgrade dbprov1006 and dbprov2006 to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1172021 (https://phabricator.wikimedia.org/T394487) [11:14:46] (03CR) 10Jcrespo: [C:03+2] dbbackups: Upgrade dbprov1006 and dbprov2006 to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1172021 (https://phabricator.wikimedia.org/T394487) (owner: 10Jcrespo) [11:17:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P79727 and previous config saved to /var/cache/conftool/dbconfig/20250723-111725-fceratto.json [11:23:40] (03CR) 10Bartosz Wójtowicz: ml-services: update RRLA and RRML images (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172011 (https://phabricator.wikimedia.org/T399437) (owner: 10Kevin Bazira) [11:32:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P79728 and previous config saved to /var/cache/conftool/dbconfig/20250723-113233-fceratto.json [11:35:37] (03CR) 10Arnaudb: [C:03+1] "looks good to me! thanks for the modification" [cookbooks] - 10https://gerrit.wikimedia.org/r/1172015 (https://phabricator.wikimedia.org/T400121) (owner: 10Jelto) [11:45:47] (03PS12) 10Effie Mouzeli: hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171562 (https://phabricator.wikimedia.org/T399211) [11:47:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T399728)', diff saved to https://phabricator.wikimedia.org/P79729 and previous config saved to /var/cache/conftool/dbconfig/20250723-114740-fceratto.json [11:47:45] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [11:47:56] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2197.codfw.wmnet with reason: Maintenance [11:48:16] (03CR) 10CI reject: [V:04-1] hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171562 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli) [11:48:46] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2214.codfw.wmnet with reason: Maintenance [11:48:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2214 (T399728)', diff saved to https://phabricator.wikimedia.org/P79730 and previous config saved to /var/cache/conftool/dbconfig/20250723-114853-fceratto.json [11:51:38] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214 (T399728)', diff saved to https://phabricator.wikimedia.org/P79731 and previous config saved to /var/cache/conftool/dbconfig/20250723-115137-fceratto.json [11:52:56] (03CR) 10Jelto: [C:03+2] gitlab failover: improve message for API token [cookbooks] - 10https://gerrit.wikimedia.org/r/1172015 (https://phabricator.wikimedia.org/T400121) (owner: 10Jelto) [11:54:35] (03PS2) 10Ayounsi: k8s: replace legacy codfw vlans with future legacy eqiad vlans [puppet] - 10https://gerrit.wikimedia.org/r/1172001 [11:55:04] PROBLEM - Host an-worker1179 is DOWN: PING CRITICAL - Packet loss = 100% [11:56:55] (03CR) 10Kosta Harlan: [C:03+1] Enable wgWikimediaEventsCreateAccountInstrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172016 (https://phabricator.wikimedia.org/T394744) (owner: 10Máté Szabó) [11:57:38] (03CR) 10Clément Goubert: [C:03+1] k8s: replace legacy codfw vlans with future legacy eqiad vlans (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1172001 (owner: 10Ayounsi) [11:57:44] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T400061#11027513 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [11:58:56] PROBLEM - Host cp1106 is DOWN: PING CRITICAL - Packet loss = 100% [11:59:06] (03CR) 10Ayounsi: "The full list of hosts still on the old vlans are there : https://netbox.wikimedia.org/extras/scripts/results/221711/ from a quick look th" [puppet] - 10https://gerrit.wikimedia.org/r/1172001 (owner: 10Ayounsi) [11:59:34] RECOVERY - Host an-worker1179 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [11:59:49] (03Merged) 10jenkins-bot: gitlab failover: improve message for API token [cookbooks] - 10https://gerrit.wikimedia.org/r/1172015 (https://phabricator.wikimedia.org/T400121) (owner: 10Jelto) [12:02:26] RECOVERY - Host cp1106 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [12:03:58] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp1106 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [12:04:20] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp1106 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [12:05:06] PROBLEM - haproxy process on cp1106 is CRITICAL: PROCS CRITICAL: 0 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [12:05:20] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp1106 is OK: SSL OK - Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2025-09-15 06:00:30 +0000 (expires in 53 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:05:58] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp1106 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 48 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:06:06] RECOVERY - haproxy process on cp1106 is OK: PROCS OK: 2 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [12:06:45] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P79732 and previous config saved to /var/cache/conftool/dbconfig/20250723-120645-fceratto.json [12:06:45] (03PS1) 10Jelto: Gitlab: switchover from gitlab2002 to gitlab1004 [puppet] - 10https://gerrit.wikimedia.org/r/1172026 (https://phabricator.wikimedia.org/T400252) [12:07:29] (03CR) 10FNegri: [C:03+1] team-wmcs: neutron: Stop using min_over_time [alerts] - 10https://gerrit.wikimedia.org/r/1172018 (https://phabricator.wikimedia.org/T399705) (owner: 10Majavah) [12:08:49] (03CR) 10Arnaudb: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1172026 (https://phabricator.wikimedia.org/T400252) (owner: 10Jelto) [12:11:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:12:27] (03CR) 10Majavah: [C:03+2] team-wmcs: neutron: Stop using min_over_time [alerts] - 10https://gerrit.wikimedia.org/r/1172018 (https://phabricator.wikimedia.org/T399705) (owner: 10Majavah) [12:12:51] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Link errors: ssw1-d1-codfw <-> ssw1-f1-codfw - https://phabricator.wikimedia.org/T400253 (10cmooney) 03NEW p:05Triage→03Medium [12:14:17] (03Merged) 10jenkins-bot: team-wmcs: neutron: Stop using min_over_time [alerts] - 10https://gerrit.wikimedia.org/r/1172018 (https://phabricator.wikimedia.org/T399705) (owner: 10Majavah) [12:15:04] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:21:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P79733 and previous config saved to /var/cache/conftool/dbconfig/20250723-122152-fceratto.json [12:23:12] (03CR) 10Jelto: [C:04-1] "thanks for the review! This should not be merged before the cookbook run" [puppet] - 10https://gerrit.wikimedia.org/r/1172026 (https://phabricator.wikimedia.org/T400252) (owner: 10Jelto) [12:23:17] 10ops-eqiad, 06SRE, 06DC-Ops: decom cloudsw2-d5-eqiad - https://phabricator.wikimedia.org/T400157#11027566 (10Jclark-ctr) 05Open→03Resolved [12:23:50] !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [12:24:30] !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [12:25:44] !log gmodena@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [12:25:45] !log gmodena@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [12:27:03] !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [12:27:20] !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [12:27:35] !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [12:27:54] !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [12:28:37] !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [12:28:39] !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [12:28:45] (03CR) 10Vgutierrez: [C:03+1] haproxy: script to perform configuration validation [puppet] - 10https://gerrit.wikimedia.org/r/1170572 (https://phabricator.wikimedia.org/T399941) (owner: 10Fabfur) [12:28:56] !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [12:29:14] !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [12:29:27] !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [12:31:46] !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [12:32:10] !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [12:36:03] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task - https://phabricator.wikimedia.org/T396065#11027638 (10Jclark-ctr) @VRiley-WMF When you get a chance, can you update the ticket with the cable lengths you've come up with? Thanks! [12:36:59] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Degraded RAID on an-worker1186 - https://phabricator.wikimedia.org/T399991#11027639 (10Jclark-ctr) Received replacement drive. Btullis is off tomorrow should be able to swap tomorrow [12:37:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214 (T399728)', diff saved to https://phabricator.wikimedia.org/P79734 and previous config saved to /var/cache/conftool/dbconfig/20250723-123659-fceratto.json [12:37:05] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [12:37:15] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2217.codfw.wmnet with reason: Maintenance [12:37:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2217 (T399728)', diff saved to https://phabricator.wikimedia.org/P79735 and previous config saved to /var/cache/conftool/dbconfig/20250723-123722-fceratto.json [12:40:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T399728)', diff saved to https://phabricator.wikimedia.org/P79736 and previous config saved to /var/cache/conftool/dbconfig/20250723-124003-fceratto.json [12:45:43] (03CR) 10Herron: [C:03+2] role::titan: install promtool [puppet] - 10https://gerrit.wikimedia.org/r/1171591 (https://phabricator.wikimedia.org/T349521) (owner: 10Herron) [12:48:53] (03PS1) 10Jelto: Gitlab: switchover from gitlab2002 to gitlab1004 [dns] - 10https://gerrit.wikimedia.org/r/1172029 (https://phabricator.wikimedia.org/T400252) [12:51:07] (03CR) 10Arnaudb: [C:03+2] Gitlab: switchover from gitlab2002 to gitlab1004 [dns] - 10https://gerrit.wikimedia.org/r/1172029 (https://phabricator.wikimedia.org/T400252) (owner: 10Jelto) [12:51:19] (03CR) 10Arnaudb: Gitlab: switchover from gitlab2002 to gitlab1004 [dns] - 10https://gerrit.wikimedia.org/r/1172029 (https://phabricator.wikimedia.org/T400252) (owner: 10Jelto) [12:52:39] (03CR) 10Arnaudb: [C:03+1] "lgtm, sorry for the accidental +2" [dns] - 10https://gerrit.wikimedia.org/r/1172029 (https://phabricator.wikimedia.org/T400252) (owner: 10Jelto) [12:52:45] (03CR) 10Jelto: [C:04-1] "This should not be merged before the cookbook run" [dns] - 10https://gerrit.wikimedia.org/r/1172029 (https://phabricator.wikimedia.org/T400252) (owner: 10Jelto) [12:55:11] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P79737 and previous config saved to /var/cache/conftool/dbconfig/20250723-125510-fceratto.json [12:57:38] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [12:57:50] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [12:58:44] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1013.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: That opportune time for a UTC afternoon backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250723T1300). [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:00:07] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve1013.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:00:27] (03CR) 10Ayounsi: [C:03+1] "LGTM, you also need to add 64613 to config/sites.yaml codfw: -> customers:" [homer/public] - 10https://gerrit.wikimedia.org/r/1171621 (https://phabricator.wikimedia.org/T400037) (owner: 10Cathal Mooney) [13:00:43] * Lucas_WMDE also sees no gerrit patches in the deployment calendar [13:01:49] jouncebot: refresh [13:01:50] I refreshed my knowledge about deployments. [13:01:56] jouncebot: nowandnext [13:01:56] For the next 0 hour(s) and 58 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250723T1300) [13:01:56] In 1 hour(s) and 28 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250723T1430) [13:02:00] Excellent. [13:04:51] (03PS13) 10Effie Mouzeli: hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171562 (https://phabricator.wikimedia.org/T399211) [13:06:50] (03PS14) 10Effie Mouzeli: hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171562 (https://phabricator.wikimedia.org/T399211) [13:07:31] (03CR) 10JHathaway: [C:03+1] redfish: improve is_uefi for Supermicro [software/spicerack] - 10https://gerrit.wikimedia.org/r/1172000 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey) [13:10:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P79738 and previous config saved to /var/cache/conftool/dbconfig/20250723-131018-fceratto.json [13:10:51] (03PS14) 10Fabfur: haproxy: script to perform configuration validation [puppet] - 10https://gerrit.wikimedia.org/r/1170572 (https://phabricator.wikimedia.org/T399941) [13:11:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:12:45] (03CR) 10Vgutierrez: [C:03+1] haproxy: script to perform configuration validation [puppet] - 10https://gerrit.wikimedia.org/r/1170572 (https://phabricator.wikimedia.org/T399941) (owner: 10Fabfur) [13:14:56] (03CR) 10Fabfur: [C:03+2] haproxy: script to perform configuration validation [puppet] - 10https://gerrit.wikimedia.org/r/1170572 (https://phabricator.wikimedia.org/T399941) (owner: 10Fabfur) [13:15:04] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:16:07] (03CR) 10Vgutierrez: [C:03+1] hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171562 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli) [13:18:22] (03CR) 10Clément Goubert: [C:03+1] "I don't see any host from kubernetes clusters." [puppet] - 10https://gerrit.wikimedia.org/r/1172001 (owner: 10Ayounsi) [13:20:20] (03CR) 10Effie Mouzeli: [C:03+2] hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171562 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli) [13:21:02] 10SRE-SLO, 10EditCheck, 10Editing-team (Kanban Board), 07Essential-Work: Fix EditCheck's SLO metrics and create a dashboard for it - https://phabricator.wikimedia.org/T395444#11027750 (10elukey) @DLynch Hi! Gentle ping :) [13:22:40] (03PS1) 10Federico Ceratto: Add wmfmariadbpy package generation [puppet] - 10https://gerrit.wikimedia.org/r/1172025 (https://phabricator.wikimedia.org/T397305) [13:25:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T399728)', diff saved to https://phabricator.wikimedia.org/P79739 and previous config saved to /var/cache/conftool/dbconfig/20250723-132525-fceratto.json [13:25:30] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [13:25:41] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2224.codfw.wmnet with reason: Maintenance [13:25:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2224 (T399728)', diff saved to https://phabricator.wikimedia.org/P79740 and previous config saved to /var/cache/conftool/dbconfig/20250723-132548-fceratto.json [13:26:18] (03PS3) 10Federico Ceratto: Add MariaDB test-s8 section VMs [puppet] - 10https://gerrit.wikimedia.org/r/1171597 (https://phabricator.wikimedia.org/T390087) [13:26:18] (03CR) 10Federico Ceratto: "Prepare deployment of test DB hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1171597 (https://phabricator.wikimedia.org/T390087) (owner: 10Federico Ceratto) [13:27:36] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [13:28:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:28:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T399728)', diff saved to https://phabricator.wikimedia.org/P79741 and previous config saved to /var/cache/conftool/dbconfig/20250723-132831-fceratto.json [13:32:54] jouncebot: nowandnext [13:32:54] For the next 0 hour(s) and 27 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250723T1300) [13:32:54] In 0 hour(s) and 57 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250723T1430) [13:33:19] ayounsi@cumin1003 netbox (PID 3084578) is awaiting input [13:34:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172016 (https://phabricator.wikimedia.org/T394744) (owner: 10Máté Szabó) [13:34:54] (03PS1) 10Vgutierrez: site,lvs,cumin: Stop using lvs1013 as liberica canary instance [puppet] - 10https://gerrit.wikimedia.org/r/1172036 (https://phabricator.wikimedia.org/T400259) [13:35:03] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ssw1-d1-eqiad mgmt - ayounsi@cumin1003" [13:35:08] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ssw1-d1-eqiad mgmt - ayounsi@cumin1003" [13:35:08] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:35:44] (03Merged) 10jenkins-bot: Enable wgWikimediaEventsCreateAccountInstrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172016 (https://phabricator.wikimedia.org/T394744) (owner: 10Máté Szabó) [13:35:50] 06SRE, 06Infrastructure-Foundations, 10netops: Homer: PyEz "ignore_warnings" does not work for port-block speed change warning - https://phabricator.wikimedia.org/T400261 (10cmooney) 03NEW p:05Triage→03Medium [13:35:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [13:36:21] !log mszabo@deploy1003 Started scap sync-world: Backport for [[gerrit:1172016|Enable wgWikimediaEventsCreateAccountInstrumentation (T394744)]] [13:36:26] T394744: Instrument account creation funnel (analytics for Special:CreateAccount) - https://phabricator.wikimedia.org/T394744 [13:37:41] (03CR) 10CI reject: [V:04-1] site,lvs,cumin: Stop using lvs1013 as liberica canary instance [puppet] - 10https://gerrit.wikimedia.org/r/1172036 (https://phabricator.wikimedia.org/T400259) (owner: 10Vgutierrez) [13:38:35] !log mszabo@deploy1003 mszabo: Backport for [[gerrit:1172016|Enable wgWikimediaEventsCreateAccountInstrumentation (T394744)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:39:21] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash2030.codfw.wmnet with OS bookworm [13:40:03] !log mszabo@deploy1003 mszabo: Continuing with sync [13:42:46] (03PS1) 10Cathal Mooney: JunOS: pass ingore_warnings list to diff() and rollback() functions [software/homer] - 10https://gerrit.wikimedia.org/r/1172037 (https://phabricator.wikimedia.org/T400261) [13:43:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P79743 and previous config saved to /var/cache/conftool/dbconfig/20250723-134338-fceratto.json [13:44:58] (03PS2) 10Cathal Mooney: JunOS: pass ingore_warnings list to diff() and rollback() functions [software/homer] - 10https://gerrit.wikimedia.org/r/1172037 (https://phabricator.wikimedia.org/T400261) [13:45:53] !log mszabo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1172016|Enable wgWikimediaEventsCreateAccountInstrumentation (T394744)]] (duration: 09m 31s) [13:45:58] T394744: Instrument account creation funnel (analytics for Special:CreateAccount) - https://phabricator.wikimedia.org/T394744 [13:47:23] (03CR) 10Ayounsi: [C:03+1] "lgtm!" [software/homer] - 10https://gerrit.wikimedia.org/r/1172037 (https://phabricator.wikimedia.org/T400261) (owner: 10Cathal Mooney) [13:48:30] 10SRE-SLO, 10Observability-Metrics: Clear & Backfill Tonecheck Pyrra Metrics - https://phabricator.wikimedia.org/T400071#11027868 (10herron) This morning I've done: ` herron@prometheus1005:~/tmp/backfill/tonecheck$ time promtool tsdb create-blocks-from rules --start=2025-07-01T00:00:00Z --end=2025-07-02T00:00... [13:50:54] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 07Python3-Porting: Puppet tox: properly lint both Py2 and Py3 files - https://phabricator.wikimedia.org/T184435#11027873 (10jijiki) Hey folks, I ran into this issue myself, having CI failing my patches over and over again. ` py2-pep8: skipped because could... [13:52:05] (03CR) 10Kevin Bazira: ml-services: update RRLA and RRML images (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172011 (https://phabricator.wikimedia.org/T399437) (owner: 10Kevin Bazira) [13:56:41] 06SRE: Add ability to validate JWTs in haproxy - https://phabricator.wikimedia.org/T400238#11027886 (10Joe) >>! In T400238#11027254, @Vgutierrez wrote: >> For now, we might also want to check for a mw session token instead. > Please correct me if I’m wrong, but in this case, validation is just a matter of checki... [13:58:31] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash2030.codfw.wmnet with reason: host reimage [13:58:39] (03CR) 10CI reject: [V:04-1] JunOS: pass ingore_warnings list to diff() and rollback() functions [software/homer] - 10https://gerrit.wikimedia.org/r/1172037 (https://phabricator.wikimedia.org/T400261) (owner: 10Cathal Mooney) [13:58:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P79745 and previous config saved to /var/cache/conftool/dbconfig/20250723-135846-fceratto.json [13:59:08] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1013.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:59:50] 06SRE: Add ability to validate JWTs in haproxy - https://phabricator.wikimedia.org/T400238#11027907 (10Vgutierrez) it's not uncommon to have several keys in place at any given point in time, it should be fine in terms of performance as long as we keep it under control [14:01:53] swfrench-wmf, urandom, all smooth [14:02:21] elukey@cumin1003 provision (PID 3087619) is awaiting input [14:02:33] thanks, XioNoX! [14:03:09] thanks! [14:03:09] (03CR) 10Vgutierrez: [C:03+1] "runbooks need some work but this can be merged (please fix the commit message typo)" [alerts] - 10https://gerrit.wikimedia.org/r/1171176 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [14:03:15] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve1013.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:03:31] (03PS18) 10Elukey: sre.hosts.provision: add custom settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357) [14:03:32] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash2030.codfw.wmnet with reason: host reimage [14:04:03] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1013.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:04:33] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve1013.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:05:16] (03PS19) 10Elukey: sre.hosts.provision: add custom settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357) [14:05:39] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1013.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:05:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:06:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:08:19] (03PS20) 10Elukey: sre.hosts.provision: add custom settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357) [14:10:17] 06SRE, 10LDAP-Access-Requests: Grant Access to nda & logstash for Novem Linguae - https://phabricator.wikimedia.org/T400176#11027982 (10Samwalton9-WMF) Novem is a productive and capable volunteer developer and I think he can be trusted with this access. [14:13:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T399728)', diff saved to https://phabricator.wikimedia.org/P79746 and previous config saved to /var/cache/conftool/dbconfig/20250723-141353-fceratto.json [14:14:00] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [14:16:57] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:17:18] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:18:48] (03PS2) 10Elukey: redfish: improve is_uefi for Supermicro [software/spicerack] - 10https://gerrit.wikimedia.org/r/1172000 (https://phabricator.wikimedia.org/T393948) [14:19:02] (03CR) 10Elukey: redfish: improve is_uefi for Supermicro (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1172000 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey) [14:19:54] (03CR) 10Volans: [C:03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1172000 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey) [14:21:36] (03CR) 10Scott French: [C:03+1] "Good catch, and thank you for doing that!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1131630 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff) [14:21:54] 10SRE-SLO, 10EditCheck, 10Lift-Wing, 06Machine-Learning-Team, 10Editing-team (Tracking): Create SLO dashboard for tone (peacock) check model - https://phabricator.wikimedia.org/T390706#11028049 (10elukey) [14:22:39] (03PS3) 10Cathal Mooney: JunOS: pass ingore_warnings list to diff() and rollback() functions [software/homer] - 10https://gerrit.wikimedia.org/r/1172037 (https://phabricator.wikimedia.org/T400261) [14:25:53] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve1013.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:26:34] 06SRE, 10Hiddenparma, 06Traffic: Browser behaviour detection at the edge - https://phabricator.wikimedia.org/T400270 (10Joe) 03NEW [14:26:58] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1013.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:27:11] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve1013.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:30:06] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250723T1430) [14:31:50] (03PS1) 10Bking: mw-content-history-reconcile-enrich: increase jobmanager.memory.off-heap.size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172053 (https://phabricator.wikimedia.org/T395984) [14:34:21] (03CR) 10CI reject: [V:04-1] JunOS: pass ingore_warnings list to diff() and rollback() functions [software/homer] - 10https://gerrit.wikimedia.org/r/1172037 (https://phabricator.wikimedia.org/T400261) (owner: 10Cathal Mooney) [14:37:48] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash2030.codfw.wmnet with OS bookworm [14:38:02] (03CR) 10Elukey: [C:03+2] Remove golang-1.17 and golang-1.18 images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1131630 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff) [14:38:12] (03CR) 10Elukey: [V:03+2 C:03+2] Remove golang-1.17 and golang-1.18 images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1131630 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff) [14:38:48] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash2031.codfw.wmnet with OS bookworm [14:40:09] (03PS11) 10Fabfur: traffic: new alerts for haproxykafka [alerts] - 10https://gerrit.wikimedia.org/r/1171176 (https://phabricator.wikimedia.org/T400039) [14:40:20] (03CR) 10Fabfur: "ack, added some extra info to that page too" [alerts] - 10https://gerrit.wikimedia.org/r/1171176 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [14:40:24] 06SRE: Add ability to validate JWTs in haproxy - https://phabricator.wikimedia.org/T400238#11028148 (10Vgutierrez) @Tgr would it be possible to perform some lightweight validation of current MediaWiki session tokens? For example, checking whether the token has a specific length, or whether it's valid base64 / ba... [14:41:52] (03CR) 10Fabfur: [C:03+2] traffic: new alerts for haproxykafka [alerts] - 10https://gerrit.wikimedia.org/r/1171176 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [14:43:26] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11028165 (10Jhancock.wm) @Marostegui lemme know when you want to do es2036 [14:43:42] 06SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#11028166 (10dancy) Thanks for the fixes @Scott_French ! [14:46:34] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [14:46:50] !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [14:48:29] (03PS2) 10Vgutierrez: site,lvs,cumin: Stop using lvs1013 as liberica canary instance [puppet] - 10https://gerrit.wikimedia.org/r/1172036 (https://phabricator.wikimedia.org/T400259) [14:48:39] (03CR) 10Gmodena: [C:03+1] mw-content-history-reconcile-enrich: increase jobmanager.memory.off-heap.size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172053 (https://phabricator.wikimedia.org/T395984) (owner: 10Bking) [14:49:02] (03PS10) 10Tiziano Fogli: nrpe wrapper: add wrapper to be invoked a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1168150 (https://phabricator.wikimedia.org/T395446) [14:49:02] (03CR) 10Tiziano Fogli: "This patch is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/1168150 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [14:52:56] (03CR) 10Herron: [C:03+1] nrpe wrapper: add wrapper to be invoked a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1168150 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [14:54:14] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:55:02] 06SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#11028177 (10Scott_French) [14:58:28] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash2031.codfw.wmnet with reason: host reimage [15:02:17] 06SRE, 10Hiddenparma, 06Traffic: Browser behaviour detection at the edge - https://phabricator.wikimedia.org/T400270#11028212 (10Joe) [15:03:13] (03PS1) 10CDobbins: dnsrecursor: add dynamic forward_zones to recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608) [15:03:23] (03CR) 10Fabfur: [C:03+1] "LGTM, I would just mention in the commit message the change in the test but it's a [nit]" [puppet] - 10https://gerrit.wikimedia.org/r/1172036 (https://phabricator.wikimedia.org/T400259) (owner: 10Vgutierrez) [15:03:34] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash2031.codfw.wmnet with reason: host reimage [15:04:09] (03PS2) 10CDobbins: dnsrecursor: add dynamic forward_zones to recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608) [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:27] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:09:40] (03PS3) 10CDobbins: dnsrecursor: add dynamic forward_zones to recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608) [15:10:04] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdata2002 & frmx2002 - https://phabricator.wikimedia.org/T400275 (10RobH) 03NEW [15:10:28] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdata2002 & frmx2002 - https://phabricator.wikimedia.org/T400275#11028250 (10RobH) [15:11:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:11:42] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdata2002 & frmx2002 - https://phabricator.wikimedia.org/T400275#11028268 (10RobH) a:03Jgreen @Jgreen, As discussed in IRC, I'm assigning this over to you to double-check the assumed hostnames and update the racking details as you see f... [15:12:54] 06SRE, 10SRE-Access-Requests: Requesting access to SSH login to analytics clients with Hadoop access for ttaylor - https://phabricator.wikimedia.org/T400277 (10ttaylor) 03NEW [15:12:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:13:05] 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#11028283 (10Jhancock.wm) quick update on one of the last things in this list. Cyrus One is still working on getting us a badge reader for the door. I opened a ticket with them on the 15th. Th... [15:13:19] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2148.codfw.wmnet with reason: Maintenance [15:13:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2148 (T399728)', diff saved to https://phabricator.wikimedia.org/P79749 and previous config saved to /var/cache/conftool/dbconfig/20250723-151325-fceratto.json [15:13:31] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [15:14:56] 06SRE, 10SRE-Access-Requests: Requesting access to SSH login to analytics clients with Hadoop access for ttaylor - https://phabricator.wikimedia.org/T400277#11028287 (10calbon) I approve this request. [15:15:02] 06SRE, 10SRE-Access-Requests: Requesting access to SSH login to analytics clients with Hadoop access for ttaylor - https://phabricator.wikimedia.org/T400277#11028288 (10ttaylor) I probably have some of these perms/group memberships but not all of them, and I have a new ssh key for this purpose. [15:16:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:16:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T399728)', diff saved to https://phabricator.wikimedia.org/P79750 and previous config saved to /var/cache/conftool/dbconfig/20250723-151630-fceratto.json [15:16:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:17:19] !log restarted haproxykafka on cp3071 due to unavailability [15:17:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:21:31] 10SRE-swift-storage, 06Commons: File on Commons lost: File:LAGUNA DE ORURIILO.jpg - https://phabricator.wikimedia.org/T399389#11028308 (10MatthewVernon) I don't think there's anything more I can do here, I'm afraid. [15:30:00] (03PS2) 10JHathaway: reposync: don't enforce ownership after init [puppet] - 10https://gerrit.wikimedia.org/r/993797 [15:30:40] (03PS1) 10Fabfur: haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) [15:31:24] 06SRE, 06FR-donorrelations: Custom URL for survey pop-up - https://phabricator.wikimedia.org/T400278 (10EBrill-WMF) 03NEW [15:31:38] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P79751 and previous config saved to /var/cache/conftool/dbconfig/20250723-153137-fceratto.json [15:31:44] (03PS2) 10Stevemunene: dse-k8s: deploy etcd service [puppet] - 10https://gerrit.wikimedia.org/r/1171584 (https://phabricator.wikimedia.org/T397293) [15:31:50] (03CR) 10CI reject: [V:04-1] haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [15:33:13] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: Decommission frack hosts: frpig2001 pay-lvs2001 pay-lvs2002 - https://phabricator.wikimedia.org/T397868#11028369 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [15:33:48] (03PS4) 10CDobbins: dnsrecursor: add dynamic forward_zones to recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608) [15:36:20] (03PS5) 10CDobbins: dnsrecursor: add dynamic forward_zones to recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608) [15:37:04] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6402/co" [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [15:37:40] (03PS2) 10Fabfur: haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) [15:38:49] (03CR) 10CI reject: [V:04-1] haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [15:38:50] (03PS6) 10CDobbins: dnsrecursor: add dynamic forward_zones to recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608) [15:39:39] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6403/co" [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [15:39:49] (03PS2) 10Bking: mw-content-history-reconcile-enrich: increase jobmanager.memory.off-heap.size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172053 (https://phabricator.wikimedia.org/T397330) [15:41:20] (03PS7) 10CDobbins: dnsrecursor: add dynamic forward_zones to recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608) [15:42:05] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6404/co" [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [15:45:37] (03PS3) 10Fabfur: haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) [15:46:29] (03CR) 10Bking: [C:03+2] mw-content-history-reconcile-enrich: increase jobmanager.memory.off-heap.size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172053 (https://phabricator.wikimedia.org/T397330) (owner: 10Bking) [15:46:45] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P79752 and previous config saved to /var/cache/conftool/dbconfig/20250723-154645-fceratto.json [15:47:26] (03PS8) 10CDobbins: dnsrecursor: add dynamic forward_zones to recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608) [15:47:51] (03CR) 10CI reject: [V:04-1] haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [15:48:11] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6405/co" [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [15:49:54] 06SRE, 06Infrastructure-Foundations, 06serviceops: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#11028456 (10Ottomata) Should we perhaps use `latest` tag for Gitlab CI images? I suppose other things could break if the base image is silently upgraded between different pipeli... [15:50:30] (03PS4) 10Cathal Mooney: JunOS: pass ingore_warnings list to diff() and rollback() functions [software/homer] - 10https://gerrit.wikimedia.org/r/1172037 (https://phabricator.wikimedia.org/T400261) [15:51:24] (03CR) 10Ottomata: [C:03+2] eventbus: register with team-data-engineering. [alerts] - 10https://gerrit.wikimedia.org/r/1168119 (https://phabricator.wikimedia.org/T398437) (owner: 10Gmodena) [15:51:33] (03CR) 10Elukey: [C:03+2] redfish: improve is_uefi for Supermicro [software/spicerack] - 10https://gerrit.wikimedia.org/r/1172000 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey) [15:51:34] (03CR) 10Ottomata: [C:03+2] eventgate: alert on traffic deviation. [alerts] - 10https://gerrit.wikimedia.org/r/1167620 (https://phabricator.wikimedia.org/T398437) (owner: 10Gmodena) [15:51:49] (03PS4) 10Fabfur: haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) [15:52:07] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:52:34] (03Merged) 10jenkins-bot: eventbus: register with team-data-engineering. [alerts] - 10https://gerrit.wikimedia.org/r/1168119 (https://phabricator.wikimedia.org/T398437) (owner: 10Gmodena) [15:53:42] (03Merged) 10jenkins-bot: eventgate: alert on traffic deviation. [alerts] - 10https://gerrit.wikimedia.org/r/1167620 (https://phabricator.wikimedia.org/T398437) (owner: 10Gmodena) [15:53:46] (03CR) 10CI reject: [V:04-1] haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [15:58:04] (03CR) 10Vgutierrez: haproxykafka: fixed missing site in dashboard link (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [15:59:55] (03Merged) 10jenkins-bot: redfish: improve is_uefi for Supermicro [software/spicerack] - 10https://gerrit.wikimedia.org/r/1172000 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey) [15:59:55] (03PS9) 10CDobbins: dnsrecursor: add dynamic forward_zones to recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608) [16:00:15] (03PS2) 10Volans: redfish: improve iDRAC 10 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/1172014 [16:00:38] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6406/co" [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [16:01:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T399728)', diff saved to https://phabricator.wikimedia.org/P79753 and previous config saved to /var/cache/conftool/dbconfig/20250723-160152-fceratto.json [16:01:54] (03CR) 10CI reject: [V:04-1] JunOS: pass ingore_warnings list to diff() and rollback() functions [software/homer] - 10https://gerrit.wikimedia.org/r/1172037 (https://phabricator.wikimedia.org/T400261) (owner: 10Cathal Mooney) [16:01:58] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [16:02:08] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2175.codfw.wmnet with reason: Maintenance [16:02:16] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2175 (T399728)', diff saved to https://phabricator.wikimedia.org/P79754 and previous config saved to /var/cache/conftool/dbconfig/20250723-160215-fceratto.json [16:03:08] (03PS5) 10Fabfur: haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) [16:03:52] (03CR) 10Subramanya Sastry: "Works for me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165094 (https://phabricator.wikimedia.org/T363484) (owner: 10C. Scott Ananian) [16:04:17] (03CR) 10CI reject: [V:04-1] haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [16:04:22] (03CR) 10Subramanya Sastry: [C:03+1] Enable the "Report Visual Bug" feature of Extension:ParserMigration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170549 (https://phabricator.wikimedia.org/T365371) (owner: 10C. Scott Ananian) [16:05:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T399728)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250723-160516-fceratto.json [16:05:37] (03PS10) 10CDobbins: dnsrecursor: add dynamic forward_zones to recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608) [16:06:01] 06SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#11028543 (10Scott_French) @Ottomata - So, image build workflows in CI that use the `latest` tag would still have been affected by this, but they would have... [16:06:07] (03PS6) 10Fabfur: haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) [16:06:22] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6407/co" [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [16:06:42] (03PS1) 10Ahmon Dancy: cli.py: The mode/action argument is required [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1172064 [16:07:19] (03CR) 10CI reject: [V:04-1] haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [16:07:56] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Install serial port breakout card on sretest2001 - https://phabricator.wikimedia.org/T400211#11028557 (10wiki_willy) a:05Papaul→03Jhancock.wm Hi @Jhancock.wm - since @Papaul is out on sabbatical, can you take a look at this one? It's related... [16:09:17] (03CR) 10CI reject: [V:04-1] cli.py: The mode/action argument is required [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1172064 (owner: 10Ahmon Dancy) [16:11:13] (03CR) 10Volans: [C:03+2] redfish: improve iDRAC 10 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/1172014 (owner: 10Volans) [16:11:19] (03PS11) 10CDobbins: dnsrecursor: remove hardcoded values and tidy up [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608) [16:12:03] (03PS7) 10Fabfur: haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) [16:13:12] (03CR) 10CI reject: [V:04-1] haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [16:14:31] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6408/co" [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [16:17:01] 06SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#11028586 (10Scott_French) [16:20:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P79755 and previous config saved to /var/cache/conftool/dbconfig/20250723-162028-fceratto.json [16:24:47] (03PS12) 10CDobbins: dnsrecursor: remove hardcoded values and tidy up [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608) [16:26:03] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [16:29:22] (03PS1) 10Ahmon Dancy: tox.ini: Pass --diff to black [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1172071 [16:29:39] 06SRE, 06Data-Engineering: WE 5.4 FY 25/26: Improve automata detection at the edge and pass it to the refinery pipeline - https://phabricator.wikimedia.org/T396562#11028642 (10Milimetric) Data Engineering is ready to do or help with this work whenever you need. [16:30:14] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash2031.codfw.wmnet with OS bookworm [16:31:11] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash2032.codfw.wmnet with OS bookworm [16:31:27] FIRING: ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:32:54] (03PS2) 10Ahmon Dancy: cli.py: The mode/action argument is required [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1172064 [16:35:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P79756 and previous config saved to /var/cache/conftool/dbconfig/20250723-163536-fceratto.json [16:35:49] (03PS13) 10CDobbins: dnsrecursor: remove hardcoded values and tidy up [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608) [16:36:27] RESOLVED: ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:37:04] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [16:38:56] (03PS1) 10Ahmon Dancy: cli.py: Improve UX when config file does not exist [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1172072 [16:40:28] (03PS5) 10Cathal Mooney: JunOS: pass ingore_warnings list to diff() and rollback() functions [software/homer] - 10https://gerrit.wikimedia.org/r/1172037 (https://phabricator.wikimedia.org/T400261) [16:42:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:44:24] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [16:50:37] (03PS14) 10CDobbins: dnsrecursor: remove hardcoded values and tidy up [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608) [16:50:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T399728)', diff saved to https://phabricator.wikimedia.org/P79757 and previous config saved to /var/cache/conftool/dbconfig/20250723-165043-fceratto.json [16:50:49] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [16:50:59] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2189.codfw.wmnet with reason: Maintenance [16:51:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2189 (T399728)', diff saved to https://phabricator.wikimedia.org/P79758 and previous config saved to /var/cache/conftool/dbconfig/20250723-165106-fceratto.json [16:51:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:52:01] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [16:53:31] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash2032.codfw.wmnet with reason: host reimage [16:54:08] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T399728)', diff saved to https://phabricator.wikimedia.org/P79759 and previous config saved to /var/cache/conftool/dbconfig/20250723-165407-fceratto.json [16:56:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:57:05] (03PS27) 10CDobbins: dnsrecursor: add recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) [16:57:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:58:13] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [16:58:38] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash2032.codfw.wmnet with reason: host reimage [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250723T1700) [17:04:42] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [17:04:50] I didn't get a chance to explicitly schedule it today, but I'll be deploying mediawiki shortly to pick up an image builder change [17:08:28] !log swfrench@deploy1003 Started scap sync-world: Deploy to remove php-ldap from debug images [17:09:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P79761 and previous config saved to /var/cache/conftool/dbconfig/20250723-170915-fceratto.json [17:10:26] 06SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#11028782 (10Ottomata) FTR, went with versioned tag for repeatability. [17:10:36] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove m-dot subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#11028781 (10Krinkle) a:03Krinkle [17:11:09] !log swfrench@deploy1003 Finished scap sync-world: Deploy to remove php-ldap from debug images (duration: 03m 29s) [17:11:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:12:36] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11028786 (10elukey) @Jclark-ctr we have found a workaround for provisioning and reimage that seems to have worked for ml-serve1012, I'll have to do more tests so for th... [17:17:31] 10ops-codfw, 06SRE, 06DC-Ops: Inbound errors on interface cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://phabricator.wikimedia.org/T399916#11028841 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [17:17:48] (03PS7) 10Cathal Mooney: Capirca: handle script having no 'status' attribute gracefully [software/homer] - 10https://gerrit.wikimedia.org/r/1166373 [17:22:41] !log deleted tags for docker-registry.discovery.wmnet/mediawiki-httpd-bookworm - T378128 [17:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:46] T378128: Upgrade httpd images to bullseye or bookworm - https://phabricator.wikimedia.org/T378128 [17:23:59] !log deleted tags for docker-registry.discovery.wmnet/httpd-fcgi-bookworm - T378128 [17:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P79762 and previous config saved to /var/cache/conftool/dbconfig/20250723-172423-fceratto.json [17:25:00] !log deleted tags for docker-registry.discovery.wmnet/httpd-bookworm - T378128 [17:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:27:39] PROBLEM - Disk space on an-worker1120 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/k 153346 MB (4% inode=99%): /var/lib/hadoop/data/m 156622 MB (4% inode=99%): /var/lib/hadoop/data/d 148261 MB (3% inode=99%): /var/lib/hadoop/data/b 153087 MB (4% inode=99%): /var/lib/hadoop/data/e 156701 MB (4% inode=99%): /var/lib/hadoop/data/g 157130 MB (4% inode=99%): /var/lib/hadoop/data/f 157566 MB (4% inode=99%): /var/lib/hadoop/data [17:27:39] 7 MB (4% inode=99%): /var/lib/hadoop/data/i 156062 MB (4% inode=99%): /var/lib/hadoop/data/j 158423 MB (4% inode=99%): /var/lib/hadoop/data/l 159446 MB (4% inode=99%): /var/lib/hadoop/data/c 153552 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1120&var-datasource=eqiad+prometheus/ops [17:30:31] PROBLEM - Disk space on an-worker1128 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/c 173357 MB (4% inode=99%): /var/lib/hadoop/data/d 175314 MB (4% inode=99%): /var/lib/hadoop/data/j 167486 MB (4% inode=99%): /var/lib/hadoop/data/f 177228 MB (4% inode=99%): /var/lib/hadoop/data/g 186771 MB (4% inode=99%): /var/lib/hadoop/data/i 173587 MB (4% inode=99%): /var/lib/hadoop/data/b 186891 MB (4% inode=99%): /var/lib/hadoop/data [17:30:31] 0 MB (4% inode=99%): /var/lib/hadoop/data/e 182198 MB (4% inode=99%): /var/lib/hadoop/data/h 149613 MB (3% inode=99%): /var/lib/hadoop/data/k 170259 MB (4% inode=99%): /var/lib/hadoop/data/m 184454 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1128&var-datasource=eqiad+prometheus/ops [17:36:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:37:02] !log cmooney@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 63516 [17:37:52] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 63516 [17:39:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T399728)', diff saved to https://phabricator.wikimedia.org/P79763 and previous config saved to /var/cache/conftool/dbconfig/20250723-173930-fceratto.json [17:39:36] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [17:39:46] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2197.codfw.wmnet with reason: Maintenance [17:40:37] (03PS1) 10Ottomata: eventgate-*-external - bump to 1.16.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172079 (https://phabricator.wikimedia.org/T376026) [17:40:57] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2207.codfw.wmnet with reason: Maintenance [17:41:04] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2207 (T399728)', diff saved to https://phabricator.wikimedia.org/P79764 and previous config saved to /var/cache/conftool/dbconfig/20250723-174104-fceratto.json [17:41:13] !log cmooney@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 8309 [17:42:34] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 8309 [17:43:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:44:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207 (T399728)', diff saved to https://phabricator.wikimedia.org/P79765 and previous config saved to /var/cache/conftool/dbconfig/20250723-174405-fceratto.json [17:48:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:49:29] (03PS2) 10Cathal Mooney: Add ASN mapping and import policy for dse-k8s codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1171621 (https://phabricator.wikimedia.org/T400037) [17:49:40] (03CR) 10Cathal Mooney: "Ah good spot thanks!" [homer/public] - 10https://gerrit.wikimedia.org/r/1171621 (https://phabricator.wikimedia.org/T400037) (owner: 10Cathal Mooney) [17:51:57] (03CR) 10Cathal Mooney: [C:03+2] Add ASN mapping and import policy for dse-k8s codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1171621 (https://phabricator.wikimedia.org/T400037) (owner: 10Cathal Mooney) [17:52:28] (03Merged) 10jenkins-bot: Add ASN mapping and import policy for dse-k8s codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1171621 (https://phabricator.wikimedia.org/T400037) (owner: 10Cathal Mooney) [17:58:20] (03PS1) 10Dzahn: Copied the global build ARGs from upstream docker file: [container/codesearch] - 10https://gerrit.wikimedia.org/r/1172080 (https://phabricator.wikimedia.org/T268199) [17:59:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207', diff saved to https://phabricator.wikimedia.org/P79766 and previous config saved to /var/cache/conftool/dbconfig/20250723-175912-fceratto.json [17:59:44] (03CR) 10Dzahn: [C:03+2] "this was copying global build ARGs from upstream docker file." [container/codesearch] - 10https://gerrit.wikimedia.org/r/1171715 (owner: 10Dzahn) [18:00:04] dduvall and dancy: #bothumor I � Unicode. All rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250723T1800). [18:00:53] (03Abandoned) 10Dzahn: use /sbin/tini as entrypoint [container/codesearch] - 10https://gerrit.wikimedia.org/r/1171630 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [18:01:12] (03CR) 10Dzahn: [C:03+2] Copied the global build ARGs from upstream docker file: [container/codesearch] - 10https://gerrit.wikimedia.org/r/1172080 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [18:01:26] (03Merged) 10jenkins-bot: Copied the global build ARGs from upstream docker file: [container/codesearch] - 10https://gerrit.wikimedia.org/r/1172080 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [18:03:52] 06SRE, 10SRE-Access-Requests, 06SRE Observability: Logstash access - https://phabricator.wikimedia.org/T400288 (10HCoplin-WMF) 03NEW [18:06:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:07:39] PROBLEM - Disk space on an-worker1120 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/k 155867 MB (4% inode=99%): /var/lib/hadoop/data/m 153377 MB (4% inode=99%): /var/lib/hadoop/data/d 151023 MB (4% inode=99%): /var/lib/hadoop/data/b 150224 MB (4% inode=99%): /var/lib/hadoop/data/e 152440 MB (4% inode=99%): /var/lib/hadoop/data/g 152904 MB (4% inode=99%): /var/lib/hadoop/data/f 153871 MB (4% inode=99%): /var/lib/hadoop/data [18:07:39] 4 MB (4% inode=99%): /var/lib/hadoop/data/i 152325 MB (4% inode=99%): /var/lib/hadoop/data/j 156451 MB (4% inode=99%): /var/lib/hadoop/data/l 154182 MB (4% inode=99%): /var/lib/hadoop/data/c 149200 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1120&var-datasource=eqiad+prometheus/ops [18:11:15] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task - https://phabricator.wikimedia.org/T396065#11029024 (10VRiley-WMF) So, looking at this, I believe the cable lengths would be the following. @Jclark-ctr would you agree? | Connection Type | Est. Length | Quantity... [18:11:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:11:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:14:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207', diff saved to https://phabricator.wikimedia.org/P79767 and previous config saved to /var/cache/conftool/dbconfig/20250723-181420-fceratto.json [18:18:10] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash2032.codfw.wmnet with OS bookworm [18:19:02] 06SRE, 10SRE-Access-Requests, 06SRE Observability: Logstash access for HCoplin - https://phabricator.wikimedia.org/T400288#11029050 (10Dzahn) [18:21:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:21:46] 10SRE-SLO, 10Observability-Metrics, 13Patch-For-Review: Prometheus/Pyrra: establish backfill process for recording rules - https://phabricator.wikimedia.org/T349521#11029054 (10herron) >>! In T349521#9706188, @fgiunchedi wrote: > Following up from a chat yesterday: > > The idea of creating backfilled blocks... [18:21:46] (03PS2) 10Ottomata: eventgate-*-external - bump to 1.17.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172079 (https://phabricator.wikimedia.org/T376026) [18:21:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [18:22:09] (03PS1) 10Kosta Harlan: AuthManager: Move temp account login to continueAuthentication [core] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1172082 (https://phabricator.wikimedia.org/T398270) [18:26:43] (03CR) 10Jforrester: [C:03+1] "Neat!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170616 (owner: 10Krinkle) [18:27:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:29:28] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207 (T399728)', diff saved to https://phabricator.wikimedia.org/P79768 and previous config saved to /var/cache/conftool/dbconfig/20250723-182928-fceratto.json [18:29:33] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [18:29:44] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2225.codfw.wmnet with reason: Maintenance [18:29:51] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2225 (T399728)', diff saved to https://phabricator.wikimedia.org/P79769 and previous config saved to /var/cache/conftool/dbconfig/20250723-182951-fceratto.json [18:32:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2225 (T399728)', diff saved to https://phabricator.wikimedia.org/P79770 and previous config saved to /var/cache/conftool/dbconfig/20250723-183254-fceratto.json [18:38:14] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task - https://phabricator.wikimedia.org/T396065#11029120 (10Jclark-ctr) @VRiley-WMF I think that's a good start for cabling some might be short some might be a little long, but keep in mind that you lose over a meter in drop length from the... [18:39:04] (03CR) 10Ottomata: [C:03+2] eventgate-*-external - bump to 1.17.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172079 (https://phabricator.wikimedia.org/T376026) (owner: 10Ottomata) [18:39:37] (03CR) 10Ottomata: "Old patch, should we abandon?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/959184 (https://phabricator.wikimedia.org/T345244) (owner: 10Clément Goubert) [18:40:52] dduvall: dancy hello again, is the train clear? :) [18:41:16] ottomata: it is not. rolling now :) [18:41:22] k will wait, ty! [18:41:33] (03Merged) 10jenkins-bot: eventgate-*-external - bump to 1.17.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172079 (https://phabricator.wikimedia.org/T376026) (owner: 10Ottomata) [18:41:39] (03PS1) 10TrainBranchBot: group1 to 1.45.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172086 (https://phabricator.wikimedia.org/T396372) [18:41:41] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.45.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172086 (https://phabricator.wikimedia.org/T396372) (owner: 10TrainBranchBot) [18:42:32] (03Merged) 10jenkins-bot: group1 to 1.45.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172086 (https://phabricator.wikimedia.org/T396372) (owner: 10TrainBranchBot) [18:42:32] gonna get ahead and just do staging instances and some testing [18:42:46] !log otto@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [18:42:51] sounds good [18:43:17] !log otto@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [18:47:25] !log otto@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [18:47:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:47:55] !log otto@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [18:48:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2225', diff saved to https://phabricator.wikimedia.org/P79771 and previous config saved to /var/cache/conftool/dbconfig/20250723-184801-fceratto.json [18:50:09] !log dduvall@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.11 refs T396372 [18:50:14] T396372: 1.45.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T396372 [18:51:52] !log bking@cumin2002 conftool action : set/pooled=false; selector: dnsdisc=search,name=eqiad [18:52:36] !log depool eqiad in preparation for rolling restart T399162 [18:52:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:40] T399162: Regression: Cirrus exact string regexp search for insource:/"u.a."/ has stopped working - https://phabricator.wikimedia.org/T399162 [18:54:14] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:55:31] ottomata: all clear! [18:57:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:58:23] ty! [18:58:48] !log otto@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply [18:59:29] !log otto@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply [18:59:57] !log otto@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [19:00:25] !log deploying eventgate-analytics-external and eventgate-logging-external to get meta.dt logic change - T376026 [19:00:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:30] T376026: Update event-producing tools to overwrite `meta.dt` - https://phabricator.wikimedia.org/T376026 [19:01:12] !log otto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [19:01:28] !log bking@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: activate new plugins packages - bking@cumin1002 - T397227 [19:01:32] T397227: Build and deploy OpenSearch plugins package for updated regex search - https://phabricator.wikimedia.org/T397227 [19:02:03] !log dzahn@cumin2002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: security release 20250723 [19:02:19] !log otto@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [19:03:05] !log otto@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [19:03:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2225', diff saved to https://phabricator.wikimedia.org/P79772 and previous config saved to /var/cache/conftool/dbconfig/20250723-190309-fceratto.json [19:04:30] !log otto@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [19:06:03] !log otto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [19:09:27] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [19:11:23] !log dzahn@cumin2002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: security release 20250723 [19:12:02] 06SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#11029224 (10Jdforrester-WMF) >>! In T383557#11026158, @Scott_French wrote: > I'm no longer seeing any references to bullseye-backports in puppet, so I belie... [19:12:19] ottomata: let me know when you’re done please, as I’d like to deploy a MediaWiki patch [19:14:08] !log bking@cumin1002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: activate new plugins packages - bking@cumin1002 - T397227 [19:14:13] T397227: Build and deploy OpenSearch plugins package for updated regex search - https://phabricator.wikimedia.org/T397227 [19:14:35] !log bking@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: activate new plugins packages - bking@cumin1002 - T397227 [19:16:02] !log dzahn@cumin2002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: security release 20250723 [19:16:58] 06SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#11029230 (10Scott_French) @Jdforrester-WMF - Basically, the rebuilds would need to start at the first image that depends on `docker-registry.discovery.wmnet... [19:18:19] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2225 (T399728)', diff saved to https://phabricator.wikimedia.org/P79773 and previous config saved to /var/cache/conftool/dbconfig/20250723-191817-fceratto.json [19:18:24] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [19:18:28] (03PS1) 10RLazarus: deployment_server: Fix argparse double-dash handling in mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1172090 (https://phabricator.wikimedia.org/T341553) [19:18:34] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2226.codfw.wmnet with reason: Maintenance [19:18:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2226 (T399728)', diff saved to https://phabricator.wikimedia.org/P79774 and previous config saved to /var/cache/conftool/dbconfig/20250723-191841-fceratto.json [19:19:04] (03CR) 10BCornwall: [C:03+1] wmnet: Remove maintenance.eqiad.wmnet record [dns] - 10https://gerrit.wikimedia.org/r/1171983 (https://phabricator.wikimedia.org/T397017) (owner: 10Clément Goubert) [19:20:11] 06SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#11029245 (10dancy) @Jdforrester-WMF I'll do the docker-pkg stuff and pass it by you for review. [19:20:34] !log bking@cumin1002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: activate new plugins packages - bking@cumin1002 - T397227 [19:20:39] T397227: Build and deploy OpenSearch plugins package for updated regex search - https://phabricator.wikimedia.org/T397227 [19:21:37] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2226 (T399728)', diff saved to https://phabricator.wikimedia.org/P79775 and previous config saved to /var/cache/conftool/dbconfig/20250723-192136-fceratto.json [19:21:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [19:24:01] jouncebot: nowandnext [19:24:02] For the next 0 hour(s) and 35 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250723T1800) [19:24:02] In 0 hour(s) and 35 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250723T2000) [19:24:58] !log dzahn@cumin2002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: security release 20250723 [19:25:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1172082 (https://phabricator.wikimedia.org/T398270) (owner: 10Kosta Harlan) [19:26:36] !log dzahn@cumin2002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: security release 20250723 [19:28:44] !log bking@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: activate new plugins packages - bking@cumin1002 - T397227 [19:28:48] T397227: Build and deploy OpenSearch plugins package for updated regex search - https://phabricator.wikimedia.org/T397227 [19:29:40] (03Merged) 10jenkins-bot: AuthManager: Move temp account login to continueAuthentication [core] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1172082 (https://phabricator.wikimedia.org/T398270) (owner: 10Kosta Harlan) [19:29:48] !log gitlab-runner* - apt-get upgrade - upgrading gitlab-runner, libgnutls30, ca-certificates [19:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:05] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1172082|AuthManager: Move temp account login to continueAuthentication (T398270)]] [19:30:10] T398270: Temp Account persists after logging in and out - https://phabricator.wikimedia.org/T398270 [19:32:18] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1172082|AuthManager: Move temp account login to continueAuthentication (T398270)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [19:33:47] (03CR) 10RLazarus: "This has the side effect that" [puppet] - 10https://gerrit.wikimedia.org/r/1172090 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [19:34:23] 06SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#11029288 (10dancy) docker-registry.wikimedia.org/python3-devel:latest is another image that needs a rebuild. [19:34:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [19:36:06] !log kharlan@deploy1003 kharlan: Continuing with sync [19:36:45] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2226', diff saved to https://phabricator.wikimedia.org/P79776 and previous config saved to /var/cache/conftool/dbconfig/20250723-193644-fceratto.json [19:41:03] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash1023.eqiad.wmnet with OS bookworm [19:41:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170549 (https://phabricator.wikimedia.org/T365371) (owner: 10C. Scott Ananian) [19:41:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170549 (https://phabricator.wikimedia.org/T365371) (owner: 10C. Scott Ananian) [19:41:13] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170549 (https://phabricator.wikimedia.org/T365371) (owner: 10C. Scott Ananian) [19:41:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165094 (https://phabricator.wikimedia.org/T363484) (owner: 10C. Scott Ananian) [19:41:44] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1172082|AuthManager: Move temp account login to continueAuthentication (T398270)]] (duration: 11m 39s) [19:41:50] Done deploying [19:41:50] T398270: Temp Account persists after logging in and out - https://phabricator.wikimedia.org/T398270 [19:49:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [19:51:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2226', diff saved to https://phabricator.wikimedia.org/P79777 and previous config saved to /var/cache/conftool/dbconfig/20250723-195152-fceratto.json [19:53:28] !log jhathaway@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on sretest2001.codfw.wmnet with reason: redfish-test [19:53:56] RECOVERY - Disk space on stat1008 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1008&var-datasource=eqiad+prometheus/ops [19:53:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [19:57:25] !log jhathaway@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ml-serve1012.eqiad.wmnet with reason: redfish-test [19:57:55] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1023.eqiad.wmnet with reason: host reimage [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250723T2000). [20:00:05] danisztls and cscott: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:16] o/ [20:02:17] o/ [20:02:21] I can deploy but I need a few minutes [20:02:39] my patches can be deployed together. i can spiderpig. [20:02:47] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1023.eqiad.wmnet with reason: host reimage [20:07:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2226 (T399728)', diff saved to https://phabricator.wikimedia.org/P79778 and previous config saved to /var/cache/conftool/dbconfig/20250723-200659-fceratto.json [20:07:07] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [20:07:15] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2238.codfw.wmnet with reason: Maintenance [20:07:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2238 (T399728)', diff saved to https://phabricator.wikimedia.org/P79779 and previous config saved to /var/cache/conftool/dbconfig/20250723-200722-fceratto.json [20:10:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2238 (T399728)', diff saved to https://phabricator.wikimedia.org/P79780 and previous config saved to /var/cache/conftool/dbconfig/20250723-201025-fceratto.json [20:11:16] is anyone deploying right now? i'm going to start spiderpigging my config patches if not. [20:12:33] (03PS1) 10DDesouza: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172106 (https://phabricator.wikimedia.org/T390007) [20:12:41] (03CR) 10CI reject: [V:04-1] miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172106 (https://phabricator.wikimedia.org/T390007) (owner: 10DDesouza) [20:12:57] I'm not doing anything yet, you can go for it as far as I'm concerned [20:13:25] Also I guess I could have deployed from my phone while eating lunch now that we have Spiderpig, but maybe better that I didn't :) [20:16:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1075-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [20:18:26] !log jhathaway@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on sretest1003.eqiad.wmnet with reason: redfish-test [20:19:56] ok, i'm going for it. [20:20:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170549 (https://phabricator.wikimedia.org/T365371) (owner: 10C. Scott Ananian) [20:20:46] RoanKattouw: that's the sort of stress test that proves value though. ;) [20:21:03] (03PS2) 10DDesouza: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172106 (https://phabricator.wikimedia.org/T390007) [20:21:12] (03Merged) 10jenkins-bot: Enable the "Report Visual Bug" feature of Extension:ParserMigration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170549 (https://phabricator.wikimedia.org/T365371) (owner: 10C. Scott Ananian) [20:21:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165094 (https://phabricator.wikimedia.org/T363484) (owner: 10C. Scott Ananian) [20:21:35] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1170549|Enable the "Report Visual Bug" feature of Extension:ParserMigration (T365371)]] [20:21:40] T365371: ParserMigration: Add "report visual bug" link - https://phabricator.wikimedia.org/T365371 [20:23:43] !log cscott@deploy1003 cscott: Backport for [[gerrit:1170549|Enable the "Report Visual Bug" feature of Extension:ParserMigration (T365371)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:24:52] (03CR) 10DDesouza: [C:03+2] miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172106 (https://phabricator.wikimedia.org/T390007) (owner: 10DDesouza) [20:25:28] (03PS1) 10C. Scott Ananian: Create "report visual bug" dialog [extensions/ParserMigration] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1172108 (https://phabricator.wikimedia.org/T365371) [20:25:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2238', diff saved to https://phabricator.wikimedia.org/P79781 and previous config saved to /var/cache/conftool/dbconfig/20250723-202533-fceratto.json [20:25:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/ParserMigration] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1172108 (https://phabricator.wikimedia.org/T365371) (owner: 10C. Scott Ananian) [20:26:53] !log cscott@deploy1003 cscott: Continuing with sync [20:26:56] (03Merged) 10jenkins-bot: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172106 (https://phabricator.wikimedia.org/T390007) (owner: 10DDesouza) [20:28:44] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [20:29:11] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [20:29:12] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [20:29:48] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [20:29:49] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [20:29:53] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): cirrussearch2089 (A4) and cirrussearch2091 (A7) possible hardware issues - https://phabricator.wikimedia.org/T400099#11029439 (10bking) @Jhancock.wm following up on our IRC discussion yesterday, I've already spent hours troublesho... [20:30:15] RoanKattouw: the hard part to do from your phone is X-Wikimedia-Debug, I expect. [20:30:35] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [20:30:59] cscott: Yeah but if someone else requested the patch you can make them test it :) [20:32:08] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1170549|Enable the "Report Visual Bug" feature of Extension:ParserMigration (T365371)]] (duration: 10m 32s) [20:32:13] T365371: ParserMigration: Add "report visual bug" link - https://phabricator.wikimedia.org/T365371 [20:32:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:32:40] i've got one more, hang on [20:33:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [extensions/ParserMigration] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1172108 (https://phabricator.wikimedia.org/T365371) (owner: 10C. Scott Ananian) [20:33:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165094 (https://phabricator.wikimedia.org/T363484) (owner: 10C. Scott Ananian) [20:33:59] (03Merged) 10jenkins-bot: Disable ParserMigration indicator and user notice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165094 (https://phabricator.wikimedia.org/T363484) (owner: 10C. Scott Ananian) [20:34:11] (03Merged) 10jenkins-bot: Create "report visual bug" dialog [extensions/ParserMigration] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1172108 (https://phabricator.wikimedia.org/T365371) (owner: 10C. Scott Ananian) [20:34:35] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1172108|Create "report visual bug" dialog (T365371)]], [[gerrit:1165094|Disable ParserMigration indicator and user notice (T363484 T363472)]] [20:34:44] T363484: Update ParserMigration notice - https://phabricator.wikimedia.org/T363484 [20:34:45] T363472: MinT MVP: Support gradual deployments - https://phabricator.wikimedia.org/T363472 [20:36:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.326s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:37:03] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:37:09] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:37:41] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash1023.eqiad.wmnet with OS bookworm [20:38:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [20:39:17] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:40:02] (03CR) 10Dzahn: [C:03+1] Gitlab: switchover from gitlab2002 to gitlab1004 [dns] - 10https://gerrit.wikimedia.org/r/1172029 (https://phabricator.wikimedia.org/T400252) (owner: 10Jelto) [20:40:09] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 06 Oct 2025 08:56:14 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:40:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2238', diff saved to https://phabricator.wikimedia.org/P79783 and previous config saved to /var/cache/conftool/dbconfig/20250723-204041-fceratto.json [20:41:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.047s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:43:14] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172110 [20:43:48] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Homer: PyEz "ignore_warnings" does not work for port-block speed change warning - https://phabricator.wikimedia.org/T400261#11029477 (10cmooney) [20:44:17] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:45:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:47:11] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 06 Oct 2025 08:56:14 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:47:36] RoanKattouw: can you deploy mine? [20:47:55] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.230 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:48:01] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54225 bytes in 0.220 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:50:28] sorry i forgot that one of my backlogs touches localization and so it will Take Forever to rebuild the container images [20:50:48] i should have let danisztls slip in ahead of me [20:51:32] cscott: no problem, I can wait [20:55:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2238 (T399728)', diff saved to https://phabricator.wikimedia.org/P79784 and previous config saved to /var/cache/conftool/dbconfig/20250723-205548-fceratto.json [20:55:54] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [20:58:49] !log cscott@deploy1003 cscott: Backport for [[gerrit:1172108|Create "report visual bug" dialog (T365371)]], [[gerrit:1165094|Disable ParserMigration indicator and user notice (T363484 T363472)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:58:56] T365371: ParserMigration: Add "report visual bug" link - https://phabricator.wikimedia.org/T365371 [20:58:57] T363484: Update ParserMigration notice - https://phabricator.wikimedia.org/T363484 [20:58:57] T363472: MinT MVP: Support gradual deployments - https://phabricator.wikimedia.org/T363472 [21:00:05] (03PS1) 10Xcollazo: analytics: Remove rsync scripts that import Dumps 1 XML into HDFS [puppet] - 10https://gerrit.wikimedia.org/r/1172113 (https://phabricator.wikimedia.org/T396031) [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250723T2100) [21:00:32] (03CR) 10CI reject: [V:04-1] analytics: Remove rsync scripts that import Dumps 1 XML into HDFS [puppet] - 10https://gerrit.wikimedia.org/r/1172113 (https://phabricator.wikimedia.org/T396031) (owner: 10Xcollazo) [21:01:20] (03CR) 10Cathal Mooney: [C:03+1] "<3" [software/homer] - 10https://gerrit.wikimedia.org/r/1171160 (owner: 10Volans) [21:02:44] !log cscott@deploy1003 cscott: Continuing with sync [21:02:57] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:04:54] (03CR) 10Xcollazo: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1172113 (https://phabricator.wikimedia.org/T396031) (owner: 10Xcollazo) [21:06:11] 06SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#11029488 (10Scott_French) [21:06:52] (03CR) 10Xcollazo: "Hmm.. tests are failing but the logs don't say why." [puppet] - 10https://gerrit.wikimedia.org/r/1172113 (https://phabricator.wikimedia.org/T396031) (owner: 10Xcollazo) [21:08:17] (03CR) 10Volans: [C:03+2] setup.py: pin prospector [software/homer] - 10https://gerrit.wikimedia.org/r/1171160 (owner: 10Volans) [21:11:39] RESOLVED: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1075-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [21:11:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:11:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [21:15:32] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1172108|Create "report visual bug" dialog (T365371)]], [[gerrit:1165094|Disable ParserMigration indicator and user notice (T363484 T363472)]] (duration: 40m 57s) [21:15:40] T365371: ParserMigration: Add "report visual bug" link - https://phabricator.wikimedia.org/T365371 [21:15:41] T363484: Update ParserMigration notice - https://phabricator.wikimedia.org/T363484 [21:15:41] T363472: MinT MVP: Support gradual deployments - https://phabricator.wikimedia.org/T363472 [21:21:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [21:22:34] (03Merged) 10jenkins-bot: setup.py: pin prospector [software/homer] - 10https://gerrit.wikimedia.org/r/1171160 (owner: 10Volans) [21:24:20] (03CR) 10JHathaway: "looks good overall, just a few questions and ideas" [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357) (owner: 10Elukey) [21:27:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170760 (https://phabricator.wikimedia.org/T399736) (owner: 10DDesouza) [21:37:33] I'm done, sorry didn't immediately say that here. [21:37:49] RoanKattouw were you doing to do danisztls' patch? [21:39:09] cscott: I was an hour ago but I am busy now, sorry [21:39:31] danisztls: if you're available to test, i'm happy to run spiderpig for you. [21:43:48] (03CR) 10Scott French: [C:03+1] "Thanks, Reuven! Also wow, I definitely learned something from the sleuthing you did on why `REMAINDER` isn't documented." [puppet] - 10https://gerrit.wikimedia.org/r/1172090 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [21:50:30] (03CR) 10Scott French: [C:03+1] "Thanks, Ahmon!" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1172071 (owner: 10Ahmon Dancy) [21:52:08] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [21:52:46] (03CR) 10Scott French: [C:03+1] "Thanks, Ahmon!" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1172064 (owner: 10Ahmon Dancy) [21:53:26] 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10Infrastructure Security, and 2 others: Re-opening our DMarcian Trial - https://phabricator.wikimedia.org/T394788#11029549 (10nisrael) Great thank you Jesse! Just want to confirm, am I safe toinstruct our rep at DMarcian to restart our free trial? [21:54:43] (03CR) 10Scott French: [C:03+1] "Thanks, Ahmon!" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1172072 (owner: 10Ahmon Dancy) [21:55:04] 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10Infrastructure Security, and 2 others: Re-opening our DMarcian Trial - https://phabricator.wikimedia.org/T394788#11029551 (10jhathaway) >>! In T394788#11029549, @nisrael wrote: > Great thank you Jesse! Just want to confirm, am I safe toinstruct our re... [21:55:29] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt clouddb1022 - vriley@cumin1002" [21:55:33] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt clouddb1022 - vriley@cumin1002" [21:55:34] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:56:05] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host clouddb1022 [21:57:25] !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host clouddb1022 [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250723T2200) [22:05:13] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host clouddb1022.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:11:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:14:04] 06SRE, 10SRE-Access-Requests, 06Infrastructure-Foundations, 10Mail: Access Request to DMarcDigests - https://phabricator.wikimedia.org/T399976#11029580 (10jhathaway) @nisrael I sent you an invite, let me know if you can get in. [22:14:43] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11029581 (10VRiley-WMF) [22:16:03] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:16:31] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1100 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:16:33] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1114 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:16:35] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1083 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:16:35] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1117 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:16:35] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1120 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:16:35] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1121 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:16:35] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1081 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:16:36] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1070 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:16:36] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1110 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:16:37] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1090 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:16:37] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1080 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:16:38] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1097 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:16:38] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1082 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:16:39] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1089 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:16:40] PROBLEM - ElasticSearch health check for shards on 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.eqiad.wmnet:9243/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.eqiad.wmnet, port=9243): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:16:40] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1116 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:16:41] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1107 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:16:41] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1119 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:16:42] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1109 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:16:44] 06SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#11029582 (10Scott_French) [22:16:47] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1074 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:16:47] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1102 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:16:47] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1078 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:16:47] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1091 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:16:47] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1088 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:16:49] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1099 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:16:49] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1125 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:16:53] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1077 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:16:53] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1072 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:16:53] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1112 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:16:53] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1103 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:17:01] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1086 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:17:01] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1108 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:17:01] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host clouddb1022.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:17:03] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1084 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:17:03] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1069 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:17:03] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1095 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:17:03] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1124 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:17:03] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1101 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:17:04] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1115 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:17:07] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1118 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:17:07] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1111 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:17:13] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1068 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:17:13] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1123 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:17:13] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1094 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:17:13] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1093 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:17:23] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1113 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:17:27] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1098 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:17:27] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1076 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:17:27] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1087 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:17:27] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1079 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:17:27] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1085 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:17:28] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1073 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:17:28] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1075 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:17:29] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1092 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:17:29] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1096 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:17:30] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1071 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:17:45] ^^ on it [22:17:52] eqiad is depooled so no user impact [22:18:12] thanks, inflatador! [22:18:25] saw your depool earlier, but was just about to ask :) [22:23:32] (03PS2) 10Ahmon Dancy: tox.ini: Pass --diff to black [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1172071 [22:24:09] (03CR) 10Ahmon Dancy: tox.ini: Pass --diff to black (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1172071 (owner: 10Ahmon Dancy) [22:25:11] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1094 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 59, [22:25:11] of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 2246, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [22:25:13] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1093 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 63, [22:25:13] of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 4116, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [22:25:15] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1113 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 63, [22:25:15] of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 4774, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [22:25:15] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1123 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 63, [22:25:15] of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 4934, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [22:25:15] swfrench-wmf np. we really need to figure out why quorum is an issue with this one particular cluster ;( [22:25:19] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1076 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 2 [22:25:19] _of_in_flight_fetch: 55, task_max_waiting_in_queue_millis: 939, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [22:25:19] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1071 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 2 [22:25:19] _of_in_flight_fetch: 55, task_max_waiting_in_queue_millis: 941, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [22:25:19] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1085 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 2 [22:25:20] _of_in_flight_fetch: 55, task_max_waiting_in_queue_millis: 950, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [22:25:20] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1096 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 2 [22:25:21] _of_in_flight_fetch: 55, task_max_waiting_in_queue_millis: 961, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [22:25:21] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1098 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 2 [22:25:22] _of_in_flight_fetch: 55, task_max_waiting_in_queue_millis: 961, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [22:25:22] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1075 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 2 [22:25:23] _of_in_flight_fetch: 55, task_max_waiting_in_queue_millis: 966, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [22:25:23] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1087 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 2 [22:25:24] _of_in_flight_fetch: 55, task_max_waiting_in_queue_millis: 966, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [22:25:24] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1092 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 2 [22:25:25] _of_in_flight_fetch: 55, task_max_waiting_in_queue_millis: 963, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [22:25:25] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1079 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 2 [22:25:26] _of_in_flight_fetch: 55, task_max_waiting_in_queue_millis: 974, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [22:25:26] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1073 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 2 [22:25:27] _of_in_flight_fetch: 55, task_max_waiting_in_queue_millis: 1002, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [22:26:39] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1078 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 3944, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 417, delayed_unassigned_shards: 0, number_of_pendi [22:26:39] : 3, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 61228, active_shards_percent_as_number: 90.43797294198578 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:26:39] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1088 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 3944, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 417, delayed_unassigned_shards: 0, number_of_pendi [22:26:39] : 3, number_of_in_flight_fetch: 55, task_max_waiting_in_queue_millis: 61237, active_shards_percent_as_number: 90.43797294198578 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:26:39] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1091 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 3944, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 417, delayed_unassigned_shards: 0, number_of_pendi [22:26:40] : 3, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 61256, active_shards_percent_as_number: 90.43797294198578 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:26:40] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1102 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 3944, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 417, delayed_unassigned_shards: 0, number_of_pendi [22:26:41] : 3, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 61253, active_shards_percent_as_number: 90.43797294198578 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:26:41] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1074 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 3944, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 417, delayed_unassigned_shards: 0, number_of_pendi [22:26:42] : 3, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 61271, active_shards_percent_as_number: 90.43797294198578 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:26:42] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1099 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4090, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 269, delayed_unassigned_shards: 0, number_of_pendi [22:26:42] (03CR) 10RLazarus: [C:03+2] deployment_server: Fix argparse double-dash handling in mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1172090 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [22:26:43] : 59, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 64005, active_shards_percent_as_number: 93.7858289383169 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:26:43] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1125 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4090, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 269, delayed_unassigned_shards: 0, number_of_pendi [22:26:44] : 59, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 64018, active_shards_percent_as_number: 93.7858289383169 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:26:45] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1072 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4314, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 42, delayed_unassigned_shards: 0, number_of_pendin [22:26:45] 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 68059, active_shards_percent_as_number: 98.92226553542766 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:26:45] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1112 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4314, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 42, delayed_unassigned_shards: 0, number_of_pendin [22:26:46] 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 68066, active_shards_percent_as_number: 98.92226553542766 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:26:46] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1077 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4314, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 42, delayed_unassigned_shards: 0, number_of_pendin [22:26:47] 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 68081, active_shards_percent_as_number: 98.92226553542766 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:26:47] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1103 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4314, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 42, delayed_unassigned_shards: 0, number_of_pendin [22:26:48] 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 68101, active_shards_percent_as_number: 98.92226553542766 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:26:53] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1086 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: green, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4361, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_ [22:26:53] , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:26:53] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1108 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: green, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4361, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_ [22:26:53] , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:26:55] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1095 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: green, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4361, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_ [22:26:55] , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:26:55] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1101 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: green, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4361, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_ [22:26:55] , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:26:55] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1124 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: green, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4361, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_ [22:26:56] , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:26:56] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1115 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: green, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4361, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_ [22:26:57] , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:26:57] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1084 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: green, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4361, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_ [22:26:58] , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:26:58] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1069 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: green, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4361, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_ [22:26:59] , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:26:59] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1111 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: green, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4361, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_ [22:27:00] , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:27:00] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1118 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: green, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4361, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_ [22:27:01] , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:27:07] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1068 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: green, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4361, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_ [22:27:07] , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:27:23] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1100 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: green, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4361, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_ [22:27:23] , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:27:25] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1114 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: green, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4361, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_ [22:27:25] , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:27:27] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1081 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: green, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4361, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_ [22:27:27] , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:27:27] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1120 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: green, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4361, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_ [22:27:27] , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:27:27] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1090 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: green, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4361, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_ [22:27:28] , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:27:28] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1110 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: green, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4361, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_ [22:38:57] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1122 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:01] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1086 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:01] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1108 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:03] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1095 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:05] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1115 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:05] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1084 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:05] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1124 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:05] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1101 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:05] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1069 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:06] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1111 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:17] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1123 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:17] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1093 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:17] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1094 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:17] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1068 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:23] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1113 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:27] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1071 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:27] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1079 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:27] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1073 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:27] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1096 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:28] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1076 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:28] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1092 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:28] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1085 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:28] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1087 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:29] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1075 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:29] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1098 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:31] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1100 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:33] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1114 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:35] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1110 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:35] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1070 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:35] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1090 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:35] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1083 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:35] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1120 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:35] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1121 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:35] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1117 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:36] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1089 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:36] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1097 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:37] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1080 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:37] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1082 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:38] PROBLEM - ElasticSearch health check for shards on 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.eqiad.wmnet:9243/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.eqiad.wmnet, port=9243): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:39] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1116 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:39] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1107 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:40] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1109 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:47] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1091 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:47] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1078 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:47] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1074 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:47] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1102 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:47] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1088 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:49] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1099 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:49] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1125 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:53] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1077 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:53] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1072 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:53] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1112 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:39:53] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1103 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:41:48] FIRING: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [22:43:55] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1086 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1404, active_shards: 4280, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 81, delayed_unassigned_shards: 81, number_of_pending_ [22:43:55] 4, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 1274, active_shards_percent_as_number: 98.14262783765192 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:43:55] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1122 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1404, active_shards: 4280, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 81, delayed_unassigned_shards: 81, number_of_pending_ [22:43:55] 4, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 1528, active_shards_percent_as_number: 98.14262783765192 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:43:55] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1108 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1404, active_shards: 4280, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 81, delayed_unassigned_shards: 81, number_of_pending_ [22:43:55] 4, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 2046, active_shards_percent_as_number: 98.14262783765192 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:43:57] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1084 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1403, active_shards: 4118, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 243, delayed_unassigned_shards: 162, number_of_pendin [22:43:57] 46, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 3837, active_shards_percent_as_number: 94.42788351295575 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:43:57] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1115 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1403, active_shards: 4118, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 243, delayed_unassigned_shards: 162, number_of_pendin [22:43:57] 46, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 3846, active_shards_percent_as_number: 94.42788351295575 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:43:57] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1124 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1403, active_shards: 4118, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 243, delayed_unassigned_shards: 162, number_of_pendin [22:43:58] 46, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 3842, active_shards_percent_as_number: 94.42788351295575 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:43:59] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1095 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1403, active_shards: 4118, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 243, delayed_unassigned_shards: 162, number_of_pendin [22:43:59] 46, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 3843, active_shards_percent_as_number: 94.42788351295575 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:43:59] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1101 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1403, active_shards: 4118, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 243, delayed_unassigned_shards: 162, number_of_pendin [22:44:00] 46, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 3847, active_shards_percent_as_number: 94.42788351295575 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:44:00] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1111 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1403, active_shards: 4118, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 243, delayed_unassigned_shards: 162, number_of_pendin [22:44:01] 48, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 5691, active_shards_percent_as_number: 94.42788351295575 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:44:01] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1069 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1403, active_shards: 4118, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 243, delayed_unassigned_shards: 162, number_of_pendin [22:44:02] 48, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 5717, active_shards_percent_as_number: 94.42788351295575 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:44:09] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1123 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4120, relocating_shards: 7, initializing_shards: 0, unassigned_shards: 241, delayed_unassigned_shards: 161, number_of_pen [22:44:09] ks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 94.47374455400137 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:44:09] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1093 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4120, relocating_shards: 7, initializing_shards: 0, unassigned_shards: 241, delayed_unassigned_shards: 161, number_of_pen [22:44:09] ks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 94.47374455400137 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:44:09] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1068 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4120, relocating_shards: 7, initializing_shards: 0, unassigned_shards: 241, delayed_unassigned_shards: 161, number_of_pen [22:44:09] ks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 94.47374455400137 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:44:09] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1094 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4120, relocating_shards: 7, initializing_shards: 0, unassigned_shards: 241, delayed_unassigned_shards: 161, number_of_pen [22:44:10] ks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 94.47374455400137 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:44:15] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1113 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4120, relocating_shards: 7, initializing_shards: 0, unassigned_shards: 241, delayed_unassigned_shards: 161, number_of_pen [22:44:15] ks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 94.47374455400137 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:44:19] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1087 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4120, relocating_shards: 7, initializing_shards: 0, unassigned_shards: 241, delayed_unassigned_shards: 161, number_of_pen [22:44:19] ks: 4, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 262, active_shards_percent_as_number: 94.47374455400137 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:44:19] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1092 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4120, relocating_shards: 7, initializing_shards: 0, unassigned_shards: 241, delayed_unassigned_shards: 161, number_of_pen [22:44:19] ks: 4, number_of_in_flight_fetch: 55, task_max_waiting_in_queue_millis: 274, active_shards_percent_as_number: 94.47374455400137 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:44:19] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1075 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4120, relocating_shards: 7, initializing_shards: 0, unassigned_shards: 241, delayed_unassigned_shards: 161, number_of_pen [22:44:19] ks: 4, number_of_in_flight_fetch: 55, task_max_waiting_in_queue_millis: 274, active_shards_percent_as_number: 94.47374455400137 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:44:20] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1076 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4120, relocating_shards: 7, initializing_shards: 0, unassigned_shards: 241, delayed_unassigned_shards: 161, number_of_pen [22:44:20] ks: 4, number_of_in_flight_fetch: 330, task_max_waiting_in_queue_millis: 286, active_shards_percent_as_number: 94.47374455400137 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:44:20] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1073 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4120, relocating_shards: 7, initializing_shards: 0, unassigned_shards: 241, delayed_unassigned_shards: 161, number_of_pen [22:44:21] ks: 4, number_of_in_flight_fetch: 330, task_max_waiting_in_queue_millis: 286, active_shards_percent_as_number: 94.47374455400137 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:44:21] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1098 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4120, relocating_shards: 7, initializing_shards: 0, unassigned_shards: 241, delayed_unassigned_shards: 161, number_of_pen [22:44:22] ks: 4, number_of_in_flight_fetch: 495, task_max_waiting_in_queue_millis: 295, active_shards_percent_as_number: 94.47374455400137 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:44:22] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1079 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4120, relocating_shards: 7, initializing_shards: 0, unassigned_shards: 241, delayed_unassigned_shards: 161, number_of_pen [22:44:23] ks: 4, number_of_in_flight_fetch: 880, task_max_waiting_in_queue_millis: 307, active_shards_percent_as_number: 94.47374455400137 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:44:23] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1096 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4120, relocating_shards: 7, initializing_shards: 0, unassigned_shards: 241, delayed_unassigned_shards: 161, number_of_pen [22:44:24] ks: 4, number_of_in_flight_fetch: 880, task_max_waiting_in_queue_millis: 307, active_shards_percent_as_number: 94.47374455400137 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:44:24] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1071 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4120, relocating_shards: 7, initializing_shards: 0, unassigned_shards: 241, delayed_unassigned_shards: 161, number_of_pen [22:44:25] ks: 4, number_of_in_flight_fetch: 1045, task_max_waiting_in_queue_millis: 321, active_shards_percent_as_number: 94.47374455400137 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:44:25] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1085 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4120, relocating_shards: 7, initializing_shards: 0, unassigned_shards: 241, delayed_unassigned_shards: 161, number_of_pen [22:44:26] ks: 5, number_of_in_flight_fetch: 1265, task_max_waiting_in_queue_millis: 336, active_shards_percent_as_number: 94.47374455400137 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:44:26] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1100 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4120, relocating_shards: 7, initializing_shards: 0, unassigned_shards: 241, delayed_unassigned_shards: 161, number_of_pen [22:44:27] ks: 39, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 3728, active_shards_percent_as_number: 94.47374455400137 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:44:27] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1114 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4176, relocating_shards: 7, initializing_shards: 4, unassigned_shards: 181, delayed_unassigned_shards: 121, number_of_pen [22:53:43] !log bking@cumin1002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: activate new plugins packages - bking@cumin1002 - T397227 [22:53:48] T397227: Build and deploy OpenSearch plugins package for updated regex search - https://phabricator.wikimedia.org/T397227 [22:54:14] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:58:48] 06SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#11029630 (10Scott_French) [23:01:49] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1122 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f048ae0e1c0: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitec [23:01:49] dia.org/wiki/Search%23Administration [23:01:53] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1077 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:01:53] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1072 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:01:53] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1112 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:01:53] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1103 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:01] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1086 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:01] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1108 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:05] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1095 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:07] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1115 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:07] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1101 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:07] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1084 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:07] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1124 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:07] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1111 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:08] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1118 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:08] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1069 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:14] ^^ we're still testing this, I think we have a root cause now [23:02:19] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1094 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:19] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1093 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:19] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1123 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:19] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1068 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:23] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1113 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:27] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1076 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:27] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1085 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:27] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1079 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:27] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1087 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:27] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1098 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:28] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1092 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:28] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1071 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:29] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1073 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:29] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1075 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:30] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1096 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:31] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1100 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:33] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1114 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:35] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1070 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:35] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1120 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:35] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1121 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:35] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1083 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:35] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1081 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:36] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1090 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:36] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1117 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:37] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1110 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:37] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1097 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:38] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1080 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:38] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1089 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:39] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1082 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:40] PROBLEM - ElasticSearch health check for shards on 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.eqiad.wmnet:9243/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.eqiad.wmnet, port=9243): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:40] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1116 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:41] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1107 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:41] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1119 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:42] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1109 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:47] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1078 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:47] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1088 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:47] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1102 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:47] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1091 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:47] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1074 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:51] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1099 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:51] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1125 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:05:13] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1094 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 56, [23:05:13] of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 1811, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [23:05:13] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1068 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 57, [23:05:13] of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 2788, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [23:05:15] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1093 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 57, [23:05:15] of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 3232, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [23:05:15] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1123 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 57, [23:05:15] of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 3949, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [23:05:17] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1113 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 53, number_of_data_nodes: 53, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 4 [23:05:17] _of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 6873, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [23:05:19] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1076 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 2 [23:05:19] _of_in_flight_fetch: 54, task_max_waiting_in_queue_millis: 1260, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [23:05:19] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1079 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 2 [23:05:19] _of_in_flight_fetch: 54, task_max_waiting_in_queue_millis: 1255, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [23:05:19] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1073 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 2 [23:05:19] _of_in_flight_fetch: 54, task_max_waiting_in_queue_millis: 1261, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [23:06:39] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1088 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 3878, relocating_shards: 0, initializing_shards: 37, unassigned_shards: 446, delayed_unassigned_shards: 0, number_of_pend [23:06:39] s: 63, number_of_in_flight_fetch: 55, task_max_waiting_in_queue_millis: 76567, active_shards_percent_as_number: 88.92455858747994 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:06:39] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1091 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 3878, relocating_shards: 0, initializing_shards: 37, unassigned_shards: 446, delayed_unassigned_shards: 0, number_of_pend [23:06:39] s: 63, number_of_in_flight_fetch: 55, task_max_waiting_in_queue_millis: 76575, active_shards_percent_as_number: 88.92455858747994 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:06:39] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1078 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 3878, relocating_shards: 0, initializing_shards: 37, unassigned_shards: 446, delayed_unassigned_shards: 0, number_of_pend [23:06:39] s: 63, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 76596, active_shards_percent_as_number: 88.92455858747994 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:06:39] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1102 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 3878, relocating_shards: 0, initializing_shards: 37, unassigned_shards: 446, delayed_unassigned_shards: 0, number_of_pend [23:06:40] s: 22, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 76597, active_shards_percent_as_number: 88.92455858747994 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:06:40] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1074 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 3899, relocating_shards: 0, initializing_shards: 16, unassigned_shards: 446, delayed_unassigned_shards: 0, number_of_pend [23:06:41] s: 22, number_of_in_flight_fetch: 55, task_max_waiting_in_queue_millis: 76612, active_shards_percent_as_number: 89.40609951845907 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:06:43] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1099 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4062, relocating_shards: 0, initializing_shards: 7, unassigned_shards: 292, delayed_unassigned_shards: 0, number_of_pendi [23:06:43] : 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 80124, active_shards_percent_as_number: 93.14377436367806 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:06:43] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1125 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4062, relocating_shards: 0, initializing_shards: 7, unassigned_shards: 292, delayed_unassigned_shards: 0, number_of_pendi [23:06:43] : 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 80136, active_shards_percent_as_number: 93.14377436367806 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:06:45] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1072 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4165, relocating_shards: 0, initializing_shards: 7, unassigned_shards: 189, delayed_unassigned_shards: 0, number_of_pendi [23:06:45] : 22, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 83002, active_shards_percent_as_number: 95.50561797752809 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:06:45] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1112 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4165, relocating_shards: 0, initializing_shards: 7, unassigned_shards: 189, delayed_unassigned_shards: 0, number_of_pendi [23:06:45] : 24, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 83009, active_shards_percent_as_number: 95.50561797752809 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:06:45] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1077 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4165, relocating_shards: 0, initializing_shards: 7, unassigned_shards: 189, delayed_unassigned_shards: 0, number_of_pendi [23:06:46] : 23, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 83006, active_shards_percent_as_number: 95.50561797752809 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:06:46] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1103 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4165, relocating_shards: 0, initializing_shards: 7, unassigned_shards: 189, delayed_unassigned_shards: 0, number_of_pendi [23:06:47] : 31, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 83023, active_shards_percent_as_number: 95.50561797752809 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:06:49] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1122 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4282, relocating_shards: 0, initializing_shards: 10, unassigned_shards: 69, delayed_unassigned_shards: 0, number_of_pendi [23:06:49] : 12, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 86669, active_shards_percent_as_number: 98.18848887869754 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:06:53] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1086 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4333, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 24, delayed_unassigned_shards: 0, number_of_pendin [23:06:53] 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.35794542536117 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:06:53] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1108 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4333, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 24, delayed_unassigned_shards: 0, number_of_pendin [23:06:53] 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.35794542536117 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:06:59] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1069 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4347, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 9, delayed_unassigned_shards: 0, number_of_pending [23:06:59] 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.67897271268058 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:06:59] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1101 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4347, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 9, delayed_unassigned_shards: 0, number_of_pending [23:06:59] 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.67897271268058 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:06:59] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1095 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4347, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 9, delayed_unassigned_shards: 0, number_of_pending [23:06:59] 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.67897271268058 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:06:59] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1115 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4347, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 9, delayed_unassigned_shards: 0, number_of_pending [23:07:00] 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.67897271268058 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:07:00] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1084 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4347, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 9, delayed_unassigned_shards: 0, number_of_pending [23:07:01] 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.67897271268058 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:07:01] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1124 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4347, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 9, delayed_unassigned_shards: 0, number_of_pending [23:07:02] 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.67897271268058 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:07:02] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1111 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4347, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 9, delayed_unassigned_shards: 0, number_of_pending [23:07:03] 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.67897271268058 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:07:03] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1118 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4347, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 9, delayed_unassigned_shards: 0, number_of_pending [23:07:04] 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.67897271268058 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:07:19] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1096 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4357, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending [23:07:19] 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.90827791790873 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:07:19] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1071 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4357, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending [23:07:19] 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.90827791790873 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:07:19] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1085 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4357, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending [23:07:20] 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.90827791790873 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:07:20] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1087 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4357, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending [23:07:21] 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.90827791790873 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:07:21] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1092 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4357, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending [23:07:22] 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.90827791790873 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:07:22] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1075 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4357, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending [23:07:23] 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.90827791790873 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:07:23] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1098 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4357, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending [23:07:24] 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.90827791790873 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:07:24] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1100 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4357, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending [23:07:25] 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.90827791790873 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:07:25] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1114 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4357, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending [23:07:26] 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.90827791790873 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:07:27] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1121 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4357, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending [23:07:27] 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.90827791790873 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:07:27] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1090 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4357, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending [23:08:57] !log bking@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 55 hosts with reason: testing cluster quorum [23:09:27] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [23:11:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:14:38] !log bking@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=search,name=eqiad [23:15:03] !log pool cirrussearch eqiad, will resume investigations tomorrow T400160 [23:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:08] T400160: Investigate eqiad cluster quorum failure issues - https://phabricator.wikimedia.org/T400160 [23:16:03] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:38:05] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1172124 [23:38:05] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1172124 (owner: 10TrainBranchBot) [23:39:44] dzahn@cumin2002 dzahn: The backup on gitlab2002 is complete, ready to proceed with upgrade. [23:42:44] dzahn@cumin2002 upgrade (PID 1166963) is awaiting input [23:46:09] !log ryankemper@cumin1002 conftool action : set/pooled=false; selector: dnsdisc=search,name=codfw [23:46:35] !log [Cirrus] Depooled codfw in anticipation of rolling restart. Hopefully minimal noise on this one :) [23:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:58] !log ryankemper@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: activate new plugins packages - ryankemper@cumin1002 - T397227 [23:49:03] T397227: Build and deploy OpenSearch plugins package for updated regex search - https://phabricator.wikimedia.org/T397227 [23:51:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [23:53:46] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1172124 (owner: 10TrainBranchBot) [23:54:09] !log dzahn@cumin2002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: security release 20250723 [23:54:25] FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:55:03] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:55:11] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:56:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:56:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [23:59:25] RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed