[00:05:17] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P79687 and previous config saved to /var/cache/conftool/dbconfig/20250723-000516-fceratto.json
[00:05:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P79688 and previous config saved to /var/cache/conftool/dbconfig/20250723-000558-marostegui.json
[00:08:39] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1171741
[00:08:39] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1171741 (owner: 10TrainBranchBot)
[00:15:34] <logmsgbot>	 !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon2005-dev.codfw.wmnet with OS bullseye
[00:16:11] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] docker: remove bullseye-backports from sources.list [puppet] - 10https://gerrit.wikimedia.org/r/1171716 (https://phabricator.wikimedia.org/T383557) (owner: 10Scott French)
[00:17:39] <logmsgbot>	 !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephmon2006-dev.codfw.wmnet with OS bullseye
[00:17:50] <wikibugs>	 (03CR) 10Scott French: "Thanks, Reuven!" [puppet] - 10https://gerrit.wikimedia.org/r/1171716 (https://phabricator.wikimedia.org/T383557) (owner: 10Scott French)
[00:18:12] <wikibugs>	 (03CR) 10Scott French: [C:03+2] docker: remove bullseye-backports from sources.list [puppet] - 10https://gerrit.wikimedia.org/r/1171716 (https://phabricator.wikimedia.org/T383557) (owner: 10Scott French)
[00:20:27] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P79689 and previous config saved to /var/cache/conftool/dbconfig/20250723-002024-fceratto.json
[00:21:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T399249)', diff saved to https://phabricator.wikimedia.org/P79690 and previous config saved to /var/cache/conftool/dbconfig/20250723-002106-marostegui.json
[00:21:11] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[00:21:22] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2222.codfw.wmnet with reason: Maintenance
[00:21:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2222 (T399249)', diff saved to https://phabricator.wikimedia.org/P79691 and previous config saved to /var/cache/conftool/dbconfig/20250723-002129-marostegui.json
[00:31:13] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1171741 (owner: 10TrainBranchBot)
[00:33:09] <swfrench-wmf>	 !log ran DISTRIBUTIONS="bullseye" build-base-images on build2001 - T383557
[00:33:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:33:15] <stashbot>	 T383557: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557
[00:35:36] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T399728)', diff saved to https://phabricator.wikimedia.org/P79692 and previous config saved to /var/cache/conftool/dbconfig/20250723-003535-fceratto.json
[00:35:40] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[00:35:51] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2202.codfw.wmnet with reason: Maintenance
[00:37:15] <wikibugs>	 (03PS1) 10Scott French: php8.1: rebuild to pick up removal of bullseye-backports [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1171747 (https://phabricator.wikimedia.org/T383557)
[00:37:15] <wikibugs>	 (03CR) 10Scott French: [V:03+2] "Built locally with docker-pkg." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1171747 (https://phabricator.wikimedia.org/T383557) (owner: 10Scott French)
[00:37:28] <logmsgbot>	 !log andrew@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon2006-dev.codfw.wmnet with reason: host reimage
[00:37:33] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2203.codfw.wmnet with reason: Maintenance
[00:37:41] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2203 (T399728)', diff saved to https://phabricator.wikimedia.org/P79693 and previous config saved to /var/cache/conftool/dbconfig/20250723-003740-fceratto.json
[00:39:13] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] php8.1: rebuild to pick up removal of bullseye-backports [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1171747 (https://phabricator.wikimedia.org/T383557) (owner: 10Scott French)
[00:40:15] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2203 (T399728)', diff saved to https://phabricator.wikimedia.org/P79694 and previous config saved to /var/cache/conftool/dbconfig/20250723-004014-fceratto.json
[00:41:48] <wikibugs>	 (03CR) 10Scott French: [V:03+2] "Thanks, Reuven!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1171747 (https://phabricator.wikimedia.org/T383557) (owner: 10Scott French)
[00:41:57] <wikibugs>	 (03CR) 10Scott French: [V:03+2 C:03+2] php8.1: rebuild to pick up removal of bullseye-backports [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1171747 (https://phabricator.wikimedia.org/T383557) (owner: 10Scott French)
[00:43:51] <logmsgbot>	 !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon2006-dev.codfw.wmnet with reason: host reimage
[00:46:15] <swfrench-wmf>	 !log rebuilt php8.1 production images (8.1.33-1-s2) on build2001 - T383557
[00:46:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:46:19] <stashbot>	 T383557: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557
[00:50:40] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#11026451 (10Scott_French) Alright, MediaWiki deployments should no longer be at risk: the php8.1 production images have been rebuilt on `docker-registry.dis...
[00:57:38] <wikibugs>	 (03CR) 10Xcollazo: [C:03+1] Disable all dumps timers on snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/1170410 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis)
[01:07:19] <wikibugs>	 (03CR) 10Scott French: "These are now broken, as bullseye-backports has been archived. Thus, it would be good to get this merged soon." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1131630 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff)
[01:12:34] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#11026454 (10Scott_French)
[01:21:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T399249)', diff saved to https://phabricator.wikimedia.org/P79695 and previous config saved to /var/cache/conftool/dbconfig/20250723-012120-marostegui.json
[01:21:25] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[01:25:52] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2216.codfw.wmnet with reason: Maintenance
[01:26:00] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2216 (T399728)', diff saved to https://phabricator.wikimedia.org/P79696 and previous config saved to /var/cache/conftool/dbconfig/20250723-012559-fceratto.json
[01:26:05] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[01:28:43] <logmsgbot>	 andrew@cumin1003 reimage (PID 3006722) is awaiting input
[01:29:44] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T399728)', diff saved to https://phabricator.wikimedia.org/P79697 and previous config saved to /var/cache/conftool/dbconfig/20250723-012944-fceratto.json
[01:30:07] <swfrench-wmf>	 jouncebot: nowandnext
[01:30:07] <jouncebot>	 No deployments scheduled for the next 4 hour(s) and 29 minute(s)
[01:30:08] <jouncebot>	 In 4 hour(s) and 29 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250723T0600)
[01:31:01] <swfrench-wmf>	 FYI, I'm going to start a noop deployment to pick up new php8.1 production images
[01:32:21] <logmsgbot>	 !log swfrench@deploy1003 Started scap sync-world: Test deployment to verify new php8.1 images - T383557
[01:32:26] <stashbot>	 T383557: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557
[01:32:34] <logmsgbot>	 andrew@cumin1003 reimage (PID 3006722) is awaiting input
[01:36:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P79698 and previous config saved to /var/cache/conftool/dbconfig/20250723-013627-marostegui.json
[01:39:42] <wikibugs>	 (03CR) 10Novem Linguae: zhwiki: Allow local securepoll setup (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100228 (https://phabricator.wikimedia.org/T380020) (owner: 10Stang)
[01:44:52] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P79699 and previous config saved to /var/cache/conftool/dbconfig/20250723-014451-fceratto.json
[01:51:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P79700 and previous config saved to /var/cache/conftool/dbconfig/20250723-015135-marostegui.json
[02:00:00] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P79701 and previous config saved to /var/cache/conftool/dbconfig/20250723-015959-fceratto.json
[02:04:33] <logmsgbot>	 !log swfrench@deploy1003 Finished scap sync-world: Test deployment to verify new php8.1 images - T383557 (duration: 34m 39s)
[02:04:37] <stashbot>	 T383557: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557
[02:06:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T399249)', diff saved to https://phabricator.wikimedia.org/P79702 and previous config saved to /var/cache/conftool/dbconfig/20250723-020643-marostegui.json
[02:06:48] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[02:09:31] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#11026493 (10Scott_French)
[02:15:07] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T399728)', diff saved to https://phabricator.wikimedia.org/P79703 and previous config saved to /var/cache/conftool/dbconfig/20250723-021507-fceratto.json
[02:15:12] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[02:54:14] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[02:59:27] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:00:30] <icinga-wm>	 PROBLEM - Disk space on an-worker1121 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/d 153301 MB (4% inode=99%): /var/lib/hadoop/data/h 152125 MB (4% inode=99%): /var/lib/hadoop/data/b 163605 MB (4% inode=99%): /var/lib/hadoop/data/k 147931 MB (3% inode=99%): /var/lib/hadoop/data/m 148715 MB (3% inode=99%): /var/lib/hadoop/data/f 168500 MB (4% inode=99%): /var/lib/hadoop/data/j 155246 MB (4% inode=99%): /var/lib/hadoop/data
[03:00:30] <icinga-wm>	 6 MB (4% inode=99%): /var/lib/hadoop/data/l 161132 MB (4% inode=99%): /var/lib/hadoop/data/i 146731 MB (3% inode=99%): /var/lib/hadoop/data/g 158114 MB (4% inode=99%): /var/lib/hadoop/data/c 142579 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1121&var-datasource=eqiad+prometheus/ops
[03:04:27] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:06:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:09:27] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[03:10:27] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:11:58] <jinxer-wm>	 FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[03:16:58] <jinxer-wm>	 RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[03:20:27] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:24:59] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[03:24:59] <jinxer-wm>	 Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica
[04:32:14] <wikibugs>	 06SRE, 06Traffic, 07affects-Kiwix-and-openZIM: Rate limiting/status code 429 for mwclient? - https://phabricator.wikimedia.org/T400018#11026576 (10Audiodude) Thanks again @Scott_French for the extremely helpful analysis! I plan to submit a PR to mwclient to update the docs for that method to indicate whi...
[04:43:27] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:48:27] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:51:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr2-eqiad and fe80::ee38:73ff:fee7:bc68 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[04:56:10] <jinxer-wm>	 RESOLVED: BFDdown: BFD session down between cr2-eqiad and fe80::ee38:73ff:fee7:bc68 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[04:58:27] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:03:27] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250723T0600)
[06:11:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:20:30] <icinga-wm>	 PROBLEM - Disk space on an-worker1121 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/d 159656 MB (4% inode=99%): /var/lib/hadoop/data/h 155636 MB (4% inode=99%): /var/lib/hadoop/data/b 155086 MB (4% inode=99%): /var/lib/hadoop/data/k 160693 MB (4% inode=99%): /var/lib/hadoop/data/m 156577 MB (4% inode=99%): /var/lib/hadoop/data/f 158585 MB (4% inode=99%): /var/lib/hadoop/data/j 153752 MB (4% inode=99%): /var/lib/hadoop/data
[06:20:30] <icinga-wm>	 1 MB (3% inode=99%): /var/lib/hadoop/data/l 152637 MB (4% inode=99%): /var/lib/hadoop/data/i 157805 MB (4% inode=99%): /var/lib/hadoop/data/g 154449 MB (4% inode=99%): /var/lib/hadoop/data/c 156183 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1121&var-datasource=eqiad+prometheus/ops
[06:39:22] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] role::titan: install promtool [puppet] - 10https://gerrit.wikimedia.org/r/1171591 (https://phabricator.wikimedia.org/T349521) (owner: 10Herron)
[06:54:14] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250723T0700).
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:06:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:09:08] <wikibugs>	 (03CR) 10Cyndywikime: [C:03+1] Growth: enable new way of refreshing LinkRecommendations for more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164287 (https://phabricator.wikimedia.org/T386250) (owner: 10Michael Große)
[07:09:27] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[07:18:39] <wikibugs>	 (03CR) 10Volans: "replies inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) (owner: 10Ryan Kemper)
[07:19:29] <logmsgbot>	 !log mvernon@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7:00:00 on aqs1012.eqiad.wmnet with reason: wait for eevans
[07:24:59] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[07:24:59] <jinxer-wm>	 Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica
[07:46:54] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2151.codfw.wmnet with reason: Maintenance
[07:47:01] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2151 (T399728)', diff saved to https://phabricator.wikimedia.org/P79704 and previous config saved to /var/cache/conftool/dbconfig/20250723-074700-fceratto.json
[07:47:06] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[07:49:46] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T399728)', diff saved to https://phabricator.wikimedia.org/P79705 and previous config saved to /var/cache/conftool/dbconfig/20250723-074945-fceratto.json
[07:51:27] <logmsgbot>	 !log gmodena@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply
[07:51:44] <logmsgbot>	 !log gmodena@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[08:04:49] <wikibugs>	 (03CR) 10Elukey: "Thanks for the ping! I noticed:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1131630 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff)
[08:04:53] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P79706 and previous config saved to /var/cache/conftool/dbconfig/20250723-080453-fceratto.json
[08:08:22] <wikibugs>	 (03CR) 10C. Scott Ananian: "My plan was to merge this first, and then let the other one ride the train after we were certain everything had settled down." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165094 (https://phabricator.wikimedia.org/T363484) (owner: 10C. Scott Ananian)
[08:08:32] <wikibugs>	 (03PS1) 10Clément Goubert: thumbor: Lower thumbor_workers, more memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171982 (https://phabricator.wikimedia.org/T392348)
[08:15:41] <logmsgbot>	 !log gmodena@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply
[08:15:55] <logmsgbot>	 !log gmodena@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[08:17:50] <wikibugs>	 (03CR) 10Elukey: [C:03+1] role::titan: install promtool [puppet] - 10https://gerrit.wikimedia.org/r/1171591 (https://phabricator.wikimedia.org/T349521) (owner: 10Herron)
[08:20:01] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P79707 and previous config saved to /var/cache/conftool/dbconfig/20250723-082000-fceratto.json
[08:20:13] <wikibugs>	 (03PS1) 10Clément Goubert: wmnet: Remove maintenance.eqiad.wmnet record [dns] - 10https://gerrit.wikimedia.org/r/1171983 (https://phabricator.wikimedia.org/T397017)
[08:27:40] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170760 (https://phabricator.wikimedia.org/T399736) (owner: 10DDesouza)
[08:35:09] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T399728)', diff saved to https://phabricator.wikimedia.org/P79708 and previous config saved to /var/cache/conftool/dbconfig/20250723-083508-fceratto.json
[08:35:14] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[08:35:24] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2158.codfw.wmnet with reason: Maintenance
[08:35:32] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2158 (T399728)', diff saved to https://phabricator.wikimedia.org/P79709 and previous config saved to /var/cache/conftool/dbconfig/20250723-083531-fceratto.json
[08:38:14] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T399728)', diff saved to https://phabricator.wikimedia.org/P79710 and previous config saved to /var/cache/conftool/dbconfig/20250723-083814-fceratto.json
[08:40:01] <wikibugs>	 (03PS1) 10Elukey: eventrouter: update Build-Depends to golang 1.19 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1171985
[08:43:32] <wikibugs>	 (03CR) 10Elukey: "https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1171985" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1131630 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff)
[08:44:03] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] eventrouter: update Build-Depends to golang 1.19 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1171985 (owner: 10Elukey)
[08:45:23] <wikibugs>	 (03CR) 10Elukey: [V:03+2 C:03+2] eventrouter: update Build-Depends to golang 1.19 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1171985 (owner: 10Elukey)
[08:46:12] <wikibugs>	 (03CR) 10Elukey: [V:03+2 C:03+2] Remove golang-1.17 and golang-1.18 images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1131630 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff)
[08:46:20] <wikibugs>	 (03CR) 10Elukey: Remove golang-1.17 and golang-1.18 images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1131630 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff)
[08:47:07] <wikibugs>	 (03PS3) 10Elukey: Remove golang-1.17 and golang-1.18 images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1131630 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff)
[08:53:22] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P79711 and previous config saved to /var/cache/conftool/dbconfig/20250723-085321-fceratto.json
[09:06:47] <wikibugs>	 06SRE: Add ability to validate JWTs in haproxy - https://phabricator.wikimedia.org/T400238 (10Joe) 03NEW
[09:08:29] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P79712 and previous config saved to /var/cache/conftool/dbconfig/20250723-090829-fceratto.json
[09:14:49] <topranks>	 seen !log drain cr2-codfw of traffic to execute juniper commands to resolve stats issue T400205
[09:14:50] <stashbot>	 T400205: Inaccurate stats reported by cr2-codfw - https://phabricator.wikimedia.org/T400205
[09:15:48] <logmsgbot>	 !log arnaudb@cumin1003 START - Cookbook sre.gitlab.failover Failover of gitlab from gitlab1004.wikimedia.org to gitlab1003.wikimedia.org
[09:20:09] <wikibugs>	 (03PS1) 10Gkyziridis: ml-services: Configure autoscaling for edit-check model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171991 (https://phabricator.wikimedia.org/T400162)
[09:23:37] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T399728)', diff saved to https://phabricator.wikimedia.org/P79714 and previous config saved to /var/cache/conftool/dbconfig/20250723-092336-fceratto.json
[09:23:42] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[09:23:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:23:53] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2169.codfw.wmnet with reason: Maintenance
[09:24:00] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2169 (T399728)', diff saved to https://phabricator.wikimedia.org/P79715 and previous config saved to /var/cache/conftool/dbconfig/20250723-092359-fceratto.json
[09:26:39] <jinxer-wm>	 FIRING: CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-codfw:9804&var-bgp_group=Confed_eqord&var-bgp_neighbor=cr2-eqord - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[09:26:42] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T399728)', diff saved to https://phabricator.wikimedia.org/P79716 and previous config saved to /var/cache/conftool/dbconfig/20250723-092641-fceratto.json
[09:27:55] <logmsgbot>	 !log gmodena@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply
[09:28:01] <logmsgbot>	 !log gmodena@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[09:29:43] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] thumbor: Lower thumbor_workers, more memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171982 (https://phabricator.wikimedia.org/T392348) (owner: 10Clément Goubert)
[09:29:44] <wikibugs>	 (03PS2) 10Gkyziridis: ml-services: Configure autoscaling for edit-check model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171991 (https://phabricator.wikimedia.org/T400162)
[09:31:39] <jinxer-wm>	 FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[09:33:44] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] Gitlab: switchover between gitlab-replica-a and gitlab-replica-b [puppet] - 10https://gerrit.wikimedia.org/r/1171539 (https://phabricator.wikimedia.org/T400121) (owner: 10Arnaudb)
[09:35:11] <wikibugs>	 (03CR) 10Gkyziridis: ml-services: Configure autoscaling for edit-check model. (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171991 (https://phabricator.wikimedia.org/T400162) (owner: 10Gkyziridis)
[09:37:43] <wikibugs>	 (03PS11) 10Effie Mouzeli: hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171562 (https://phabricator.wikimedia.org/T399211)
[09:38:30] <wikibugs>	 (03PS1) 10Hnowlan: thumbor: change haproxy load balancing algorithm to leastconn [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171996 (https://phabricator.wikimedia.org/T392348)
[09:38:35] <wikibugs>	 (03PS3) 10Gkyziridis: ml-services: Configure autoscaling for edit-check model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171991 (https://phabricator.wikimedia.org/T400162)
[09:40:22] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] thumbor: Lower thumbor_workers, more memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171982 (https://phabricator.wikimedia.org/T392348) (owner: 10Clément Goubert)
[09:41:49] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P79717 and previous config saved to /var/cache/conftool/dbconfig/20250723-094149-fceratto.json
[09:46:38] <claime>	 jouncebot: nowandnext
[09:46:38] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 13 minute(s)
[09:46:38] <jouncebot>	 In 0 hour(s) and 13 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250723T1000)
[09:47:45] <wikibugs>	 (03Merged) 10jenkins-bot: thumbor: Lower thumbor_workers, more memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171982 (https://phabricator.wikimedia.org/T392348) (owner: 10Clément Goubert)
[09:49:08] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply
[09:52:35] <wikibugs>	 (03PS1) 10Elukey: redfish: improve is_uefi for Supermicro [software/spicerack] - 10https://gerrit.wikimedia.org/r/1172000 (https://phabricator.wikimedia.org/T393948)
[09:54:14] <wikibugs>	 (03PS1) 10Ayounsi: k8s: replace legacy codfw vlans with future legacy eqiad vlans [puppet] - 10https://gerrit.wikimedia.org/r/1172001
[09:54:49] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'.
[09:56:42] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[09:56:42] <wikibugs>	 (03CR) 10Elukey: "Just realized that we are missing the test for dell, adding it." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1172000 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey)
[09:56:57] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P79718 and previous config saved to /var/cache/conftool/dbconfig/20250723-095656-fceratto.json
[09:59:02] <wikibugs>	 (03CR) 10Elukey: "Correction, we already have it, all good :)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1172000 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey)
[09:59:43] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[09:59:53] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250723T1000)
[10:01:02] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "LGTM, but let's wait somebody from ServiceOps to confirm!" [puppet] - 10https://gerrit.wikimedia.org/r/1172001 (owner: 10Ayounsi)
[10:01:13] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'.
[10:01:46] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[10:01:47] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release thumbor/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=thumbor - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[10:01:55] <claime>	 yeah that's me, on it
[10:01:57] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[10:06:47] <jinxer-wm>	 RESOLVED: HelmReleaseBadStatus: Helm release thumbor/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=thumbor - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[10:10:20] <wikibugs>	 (03PS1) 10Ayounsi: BGPPeers nodeSelector: remove old codfw rows, add future eqiad pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172004 (https://phabricator.wikimedia.org/T333948)
[10:11:20] <wikibugs>	 (03PS2) 10Ayounsi: BGPPeers nodeSelector: remove old codfw rows, add future eqiad pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172004 (https://phabricator.wikimedia.org/T333948)
[10:11:20] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1172000 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey)
[10:12:05] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T399728)', diff saved to https://phabricator.wikimedia.org/P79719 and previous config saved to /var/cache/conftool/dbconfig/20250723-101204-fceratto.json
[10:12:09] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[10:12:19] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2180.codfw.wmnet with reason: Maintenance
[10:12:27] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2180 (T399728)', diff saved to https://phabricator.wikimedia.org/P79720 and previous config saved to /var/cache/conftool/dbconfig/20250723-101226-fceratto.json
[10:13:59] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T399728)', diff saved to https://phabricator.wikimedia.org/P79721 and previous config saved to /var/cache/conftool/dbconfig/20250723-101358-fceratto.json
[10:16:04] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[10:17:41] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
[10:21:20] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] Gitlab: switchover between gitlab-replica-a and gitlab-replica-b [dns] - 10https://gerrit.wikimedia.org/r/1171537 (https://phabricator.wikimedia.org/T400121) (owner: 10Arnaudb)
[10:21:49] <logmsgbot>	 !log arnaudb@dns1004 START - running authdns-update
[10:22:25] <logmsgbot>	 arnaudb@cumin1003 failover (PID 3057135) is awaiting input
[10:22:30] <wikibugs>	 (03PS13) 10Elukey: sre.hosts.provision: add custom settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357)
[10:23:09] <logmsgbot>	 !log arnaudb@dns1004 END - running authdns-update
[10:23:28] <logmsgbot>	 !log arnaudb@cumin1003 START - Cookbook sre.dns.wipe-cache 'https://gitlab-replica-a.wikimedia.org/ https://gitlab-replica-b.wikimedia.org/' on all recursors
[10:23:32] <logmsgbot>	 !log arnaudb@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'https://gitlab-replica-a.wikimedia.org/ https://gitlab-replica-b.wikimedia.org/' on all recursors
[10:24:04] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Inaccurate stats reported by cr2-codfw - https://phabricator.wikimedia.org/T400205#11027210 (10cmooney) Ok so I drained cr2-codfw of traffic and tried issuing the commands.  Commands as supplied by Juniper aren't 100% correct either which is reassuring when medd...
[10:24:28] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Upgrade db1171 backup source MariaDB package to 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1172008 (https://phabricator.wikimedia.org/T399955)
[10:24:44] <jinxer-wm>	 RESOLVED: KubernetesDeploymentUnavailableReplicas: ...
[10:24:44] <jinxer-wm>	 Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica
[10:25:22] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "I'd suggest to decouple the admin_ng changes from the edit_check changes in separate patches as they refer to different deployments" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171991 (https://phabricator.wikimedia.org/T400162) (owner: 10Gkyziridis)
[10:26:39] <wikibugs>	 07Puppet, 06SRE, 10Beta-Cluster-Infrastructure: Puppet configures kernel.core_pattern |/usr/lib/systemd/systemd-coredump, but systemd-coredump is not installed - https://phabricator.wikimedia.org/T400247 (10Lucas_Werkmeister_WMDE) 03NEW
[10:27:50] <logmsgbot>	 !log arnaudb@cumin1003 END (PASS) - Cookbook sre.gitlab.failover (exit_code=0) Failover of gitlab from gitlab1004.wikimedia.org to gitlab1003.wikimedia.org
[10:28:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:29:00] <wikibugs>	 (03PS4) 10Gkyziridis: ml-services: Configure autoscaling for edit-check model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171991 (https://phabricator.wikimedia.org/T400162)
[10:29:05] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P79722 and previous config saved to /var/cache/conftool/dbconfig/20250723-102905-fceratto.json
[10:29:21] <logmsgbot>	 !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1171.eqiad.wmnet with reason: upgrade mariadb
[10:29:31] <wikibugs>	 06SRE: Add ability to validate JWTs in haproxy - https://phabricator.wikimedia.org/T400238#11027254 (10Vgutierrez) > For now, we might also want to check for a mw session token instead. Please correct me if I’m wrong, but in this case, validation is just a matter of checking whether the token is present or not....
[10:30:54] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): systemd::coredump: Install systemd-coredump iff enabled [puppet] - 10https://gerrit.wikimedia.org/r/1172010 (https://phabricator.wikimedia.org/T400247)
[10:31:32] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "LGTM, probably best to get someone more familiar with it to check too but it's simple enough." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172004 (https://phabricator.wikimedia.org/T333948) (owner: 10Ayounsi)
[10:33:22] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: update RRLA and RRML images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172011 (https://phabricator.wikimedia.org/T399437)
[10:34:25] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] mariadb: Upgrade db1171 backup source MariaDB package to 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1172008 (https://phabricator.wikimedia.org/T399955) (owner: 10Jcrespo)
[10:34:48] <wikibugs>	 (03PS14) 10Elukey: sre.hosts.provision: add custom settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357)
[10:35:01] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] thumbor: change haproxy load balancing algorithm to leastconn [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171996 (https://phabricator.wikimedia.org/T392348) (owner: 10Hnowlan)
[10:36:19] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[10:36:21] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[10:36:41] <wikibugs>	 (03Merged) 10jenkins-bot: thumbor: change haproxy load balancing algorithm to leastconn [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171996 (https://phabricator.wikimedia.org/T392348) (owner: 10Hnowlan)
[10:37:19] <wikibugs>	 (03PS15) 10Elukey: sre.hosts.provision: add custom settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357)
[10:37:37] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[10:38:45] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Upgrade db2198 backup source MariaDB package to 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1172013 (https://phabricator.wikimedia.org/T399955)
[10:38:55] <wikibugs>	 (03PS1) 10Volans: redfish: improve iDRAC 10 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/1172014
[10:39:05] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "Perhaps we were still using Debian 8 (Jessie) when this puppet class was first written? If I’m reading the Debian archives correctly, the " [puppet] - 10https://gerrit.wikimedia.org/r/1172010 (https://phabricator.wikimedia.org/T400247) (owner: 10Lucas Werkmeister (WMDE))
[10:39:59] <wikibugs>	 (03PS1) 10Jelto: gitlab failover: improve message for API token [cookbooks] - 10https://gerrit.wikimedia.org/r/1172015 (https://phabricator.wikimedia.org/T400121)
[10:40:03] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply
[10:40:10] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply
[10:41:04] <wikibugs>	 (03CR) 10Elukey: [C:03+1] redfish: improve iDRAC 10 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/1172014 (owner: 10Volans)
[10:42:59] <wikibugs>	 (03CR) 10Vgutierrez: hcaptcha::proxy: use mtail for nginx- metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1171562 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli)
[10:43:02] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply
[10:44:04] <logmsgbot>	 !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2198.codfw.wmnet with reason: upgrade mariadb
[10:44:13] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P79723 and previous config saved to /var/cache/conftool/dbconfig/20250723-104412-fceratto.json
[10:45:49] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[10:47:39] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[10:49:13] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[10:49:24] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] "LGTM, bare metal wikikube in codfw is completely migrated to the new switches. Maybe needs a check for the other clusters?" [puppet] - 10https://gerrit.wikimedia.org/r/1172001 (owner: 10Ayounsi)
[10:51:15] <logmsgbot>	 !log jynus@cumin1003 START - Cookbook sre.hosts.remove-downtime for db1171.eqiad.wmnet
[10:51:16] <logmsgbot>	 !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db1171.eqiad.wmnet
[10:51:39] <jinxer-wm>	 RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[10:53:22] <wikibugs>	 (03PS1) 10Máté Szabó: Enable wgWikimediaEventsCreateAccountInstrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172016 (https://phabricator.wikimedia.org/T394744)
[10:54:01] <wikibugs>	 (03CR) 10Bartosz Wójtowicz: ml-services: update RRLA and RRML images (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172011 (https://phabricator.wikimedia.org/T399437) (owner: 10Kevin Bazira)
[10:54:14] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[10:56:23] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
[10:56:56] <topranks>	 seen !log un-drain cr2-codfw of traffic after executing juniper commands to resolve stats issue T400205
[10:56:57] <stashbot>	 T400205: Inaccurate stats reported by cr2-codfw - https://phabricator.wikimedia.org/T400205
[10:57:35] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] mariadb: Upgrade db2198 backup source MariaDB package to 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1172013 (https://phabricator.wikimedia.org/T399955) (owner: 10Jcrespo)
[10:57:39] <wikibugs>	 (03CR) 10Kevin Bazira: ml-services: update RRLA and RRML images (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172011 (https://phabricator.wikimedia.org/T399437) (owner: 10Kevin Bazira)
[10:58:51] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] "LGTM, the "pod" naming, while apparently standard in networking (I read the task!), could get a little confusing wrt to kubernetes, but si" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172004 (https://phabricator.wikimedia.org/T333948) (owner: 10Ayounsi)
[10:59:20] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T399728)', diff saved to https://phabricator.wikimedia.org/P79724 and previous config saved to /var/cache/conftool/dbconfig/20250723-105919-fceratto.json
[10:59:24] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[10:59:35] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2193.codfw.wmnet with reason: Maintenance
[10:59:42] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2193 (T399728)', diff saved to https://phabricator.wikimedia.org/P79725 and previous config saved to /var/cache/conftool/dbconfig/20250723-105941-fceratto.json
[11:00:05] <jouncebot>	 mvolz: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250723T1100).
[11:02:18] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T399728)', diff saved to https://phabricator.wikimedia.org/P79726 and previous config saved to /var/cache/conftool/dbconfig/20250723-110217-fceratto.json
[11:02:42] <wikibugs>	 (03PS16) 10Elukey: sre.hosts.provision: add custom settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357)
[11:03:11] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[11:04:50] <wikibugs>	 (03PS17) 10Elukey: sre.hosts.provision: add custom settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357)
[11:04:50] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "> Perhaps we were still using Debian 8 (Jessie) when this puppet class was first written?" [puppet] - 10https://gerrit.wikimedia.org/r/1172010 (https://phabricator.wikimedia.org/T400247) (owner: 10Lucas Werkmeister (WMDE))
[11:06:29] <wikibugs>	 (03PS1) 10Majavah: team-wmcs: neutron: Stop using min_over_time [alerts] - 10https://gerrit.wikimedia.org/r/1172018 (https://phabricator.wikimedia.org/T399705)
[11:06:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:07:09] <logmsgbot>	 !log jynus@cumin1003 START - Cookbook sre.hosts.remove-downtime for db2198.codfw.wmnet
[11:07:10] <logmsgbot>	 !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2198.codfw.wmnet
[11:09:27] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[11:13:24] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[11:14:06] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Upgrade dbprov1006 and dbprov2006 to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1172021 (https://phabricator.wikimedia.org/T394487)
[11:14:46] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] dbbackups: Upgrade dbprov1006 and dbprov2006 to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1172021 (https://phabricator.wikimedia.org/T394487) (owner: 10Jcrespo)
[11:17:26] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P79727 and previous config saved to /var/cache/conftool/dbconfig/20250723-111725-fceratto.json
[11:23:40] <wikibugs>	 (03CR) 10Bartosz Wójtowicz: ml-services: update RRLA and RRML images (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172011 (https://phabricator.wikimedia.org/T399437) (owner: 10Kevin Bazira)
[11:32:33] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P79728 and previous config saved to /var/cache/conftool/dbconfig/20250723-113233-fceratto.json
[11:35:37] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] "looks good to me! thanks for the modification" [cookbooks] - 10https://gerrit.wikimedia.org/r/1172015 (https://phabricator.wikimedia.org/T400121) (owner: 10Jelto)
[11:45:47] <wikibugs>	 (03PS12) 10Effie Mouzeli: hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171562 (https://phabricator.wikimedia.org/T399211)
[11:47:41] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T399728)', diff saved to https://phabricator.wikimedia.org/P79729 and previous config saved to /var/cache/conftool/dbconfig/20250723-114740-fceratto.json
[11:47:45] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[11:47:56] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2197.codfw.wmnet with reason: Maintenance
[11:48:16] <wikibugs>	 (03CR) 10CI reject: [V:04-1] hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171562 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli)
[11:48:46] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2214.codfw.wmnet with reason: Maintenance
[11:48:54] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2214 (T399728)', diff saved to https://phabricator.wikimedia.org/P79730 and previous config saved to /var/cache/conftool/dbconfig/20250723-114853-fceratto.json
[11:51:38] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214 (T399728)', diff saved to https://phabricator.wikimedia.org/P79731 and previous config saved to /var/cache/conftool/dbconfig/20250723-115137-fceratto.json
[11:52:56] <wikibugs>	 (03CR) 10Jelto: [C:03+2] gitlab failover: improve message for API token [cookbooks] - 10https://gerrit.wikimedia.org/r/1172015 (https://phabricator.wikimedia.org/T400121) (owner: 10Jelto)
[11:54:35] <wikibugs>	 (03PS2) 10Ayounsi: k8s: replace legacy codfw vlans with future legacy eqiad vlans [puppet] - 10https://gerrit.wikimedia.org/r/1172001
[11:55:04] <icinga-wm>	 PROBLEM - Host an-worker1179 is DOWN: PING CRITICAL - Packet loss = 100%
[11:56:55] <wikibugs>	 (03CR) 10Kosta Harlan: [C:03+1] Enable wgWikimediaEventsCreateAccountInstrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172016 (https://phabricator.wikimedia.org/T394744) (owner: 10Máté Szabó)
[11:57:38] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] k8s: replace legacy codfw vlans with future legacy eqiad vlans (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1172001 (owner: 10Ayounsi)
[11:57:44] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T400061#11027513 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr
[11:58:56] <icinga-wm>	 PROBLEM - Host cp1106 is DOWN: PING CRITICAL - Packet loss = 100%
[11:59:06] <wikibugs>	 (03CR) 10Ayounsi: "The full list of hosts still on the old vlans are there : https://netbox.wikimedia.org/extras/scripts/results/221711/ from a quick look th" [puppet] - 10https://gerrit.wikimedia.org/r/1172001 (owner: 10Ayounsi)
[11:59:34] <icinga-wm>	 RECOVERY - Host an-worker1179 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms
[11:59:49] <wikibugs>	 (03Merged) 10jenkins-bot: gitlab failover: improve message for API token [cookbooks] - 10https://gerrit.wikimedia.org/r/1172015 (https://phabricator.wikimedia.org/T400121) (owner: 10Jelto)
[12:02:26] <icinga-wm>	 RECOVERY - Host cp1106 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms
[12:03:58] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp1106 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[12:04:20] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp1106 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[12:05:06] <icinga-wm>	 PROBLEM - haproxy process on cp1106 is CRITICAL: PROCS CRITICAL: 0 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy
[12:05:20] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp1106 is OK: SSL OK - Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2025-09-15 06:00:30 +0000 (expires in 53 days) https://wikitech.wikimedia.org/wiki/HTTPS
[12:05:58] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp1106 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-09-09 23:59:37 +0000 (expires in 48 days) https://wikitech.wikimedia.org/wiki/HTTPS
[12:06:06] <icinga-wm>	 RECOVERY - haproxy process on cp1106 is OK: PROCS OK: 2 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy
[12:06:45] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P79732 and previous config saved to /var/cache/conftool/dbconfig/20250723-120645-fceratto.json
[12:06:45] <wikibugs>	 (03PS1) 10Jelto: Gitlab: switchover from gitlab2002 to gitlab1004 [puppet] - 10https://gerrit.wikimedia.org/r/1172026 (https://phabricator.wikimedia.org/T400252)
[12:07:29] <wikibugs>	 (03CR) 10FNegri: [C:03+1] team-wmcs: neutron: Stop using min_over_time [alerts] - 10https://gerrit.wikimedia.org/r/1172018 (https://phabricator.wikimedia.org/T399705) (owner: 10Majavah)
[12:08:49] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1172026 (https://phabricator.wikimedia.org/T400252) (owner: 10Jelto)
[12:11:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:12:27] <wikibugs>	 (03CR) 10Majavah: [C:03+2] team-wmcs: neutron: Stop using min_over_time [alerts] - 10https://gerrit.wikimedia.org/r/1172018 (https://phabricator.wikimedia.org/T399705) (owner: 10Majavah)
[12:12:51] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Link errors: ssw1-d1-codfw <-> ssw1-f1-codfw - https://phabricator.wikimedia.org/T400253 (10cmooney) 03NEW p:05Triage→03Medium
[12:14:17] <wikibugs>	 (03Merged) 10jenkins-bot: team-wmcs: neutron: Stop using min_over_time [alerts] - 10https://gerrit.wikimedia.org/r/1172018 (https://phabricator.wikimedia.org/T399705) (owner: 10Majavah)
[12:15:04] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:21:53] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P79733 and previous config saved to /var/cache/conftool/dbconfig/20250723-122152-fceratto.json
[12:23:12] <wikibugs>	 (03CR) 10Jelto: [C:04-1] "thanks for the review! This should not be merged before the cookbook run" [puppet] - 10https://gerrit.wikimedia.org/r/1172026 (https://phabricator.wikimedia.org/T400252) (owner: 10Jelto)
[12:23:17] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: decom cloudsw2-d5-eqiad - https://phabricator.wikimedia.org/T400157#11027566 (10Jclark-ctr) 05Open→03Resolved
[12:23:50] <logmsgbot>	 !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[12:24:30] <logmsgbot>	 !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[12:25:44] <logmsgbot>	 !log gmodena@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply
[12:25:45] <logmsgbot>	 !log gmodena@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[12:27:03] <logmsgbot>	 !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[12:27:20] <logmsgbot>	 !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[12:27:35] <logmsgbot>	 !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[12:27:54] <logmsgbot>	 !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[12:28:37] <logmsgbot>	 !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[12:28:39] <logmsgbot>	 !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[12:28:45] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] haproxy: script to perform configuration validation [puppet] - 10https://gerrit.wikimedia.org/r/1170572 (https://phabricator.wikimedia.org/T399941) (owner: 10Fabfur)
[12:28:56] <logmsgbot>	 !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[12:29:14] <logmsgbot>	 !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[12:29:27] <logmsgbot>	 !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[12:31:46] <logmsgbot>	 !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[12:32:10] <logmsgbot>	 !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[12:36:03] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task - https://phabricator.wikimedia.org/T396065#11027638 (10Jclark-ctr) @VRiley-WMF  When you get a chance, can you update the ticket with the cable lengths you've come up with? Thanks!
[12:36:59] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Degraded RAID on an-worker1186 - https://phabricator.wikimedia.org/T399991#11027639 (10Jclark-ctr) Received replacement drive. Btullis is off tomorrow should be able to swap tomorrow
[12:37:00] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214 (T399728)', diff saved to https://phabricator.wikimedia.org/P79734 and previous config saved to /var/cache/conftool/dbconfig/20250723-123659-fceratto.json
[12:37:05] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[12:37:15] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2217.codfw.wmnet with reason: Maintenance
[12:37:23] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2217 (T399728)', diff saved to https://phabricator.wikimedia.org/P79735 and previous config saved to /var/cache/conftool/dbconfig/20250723-123722-fceratto.json
[12:40:03] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T399728)', diff saved to https://phabricator.wikimedia.org/P79736 and previous config saved to /var/cache/conftool/dbconfig/20250723-124003-fceratto.json
[12:45:43] <wikibugs>	 (03CR) 10Herron: [C:03+2] role::titan: install promtool [puppet] - 10https://gerrit.wikimedia.org/r/1171591 (https://phabricator.wikimedia.org/T349521) (owner: 10Herron)
[12:48:53] <wikibugs>	 (03PS1) 10Jelto: Gitlab: switchover from gitlab2002 to gitlab1004 [dns] - 10https://gerrit.wikimedia.org/r/1172029 (https://phabricator.wikimedia.org/T400252)
[12:51:07] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] Gitlab: switchover from gitlab2002 to gitlab1004 [dns] - 10https://gerrit.wikimedia.org/r/1172029 (https://phabricator.wikimedia.org/T400252) (owner: 10Jelto)
[12:51:19] <wikibugs>	 (03CR) 10Arnaudb: Gitlab: switchover from gitlab2002 to gitlab1004 [dns] - 10https://gerrit.wikimedia.org/r/1172029 (https://phabricator.wikimedia.org/T400252) (owner: 10Jelto)
[12:52:39] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] "lgtm, sorry for the accidental +2" [dns] - 10https://gerrit.wikimedia.org/r/1172029 (https://phabricator.wikimedia.org/T400252) (owner: 10Jelto)
[12:52:45] <wikibugs>	 (03CR) 10Jelto: [C:04-1] "This should not be merged before the cookbook run" [dns] - 10https://gerrit.wikimedia.org/r/1172029 (https://phabricator.wikimedia.org/T400252) (owner: 10Jelto)
[12:55:11] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P79737 and previous config saved to /var/cache/conftool/dbconfig/20250723-125510-fceratto.json
[12:57:38] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[12:57:50] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[12:58:44] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1013.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[13:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: That opportune time for a UTC afternoon backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250723T1300).
[13:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[13:00:07] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve1013.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[13:00:27] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "LGTM, you also need to add 64613 to config/sites.yaml codfw: -> customers:" [homer/public] - 10https://gerrit.wikimedia.org/r/1171621 (https://phabricator.wikimedia.org/T400037) (owner: 10Cathal Mooney)
[13:00:43] * Lucas_WMDE also sees no gerrit patches in the deployment calendar
[13:01:49] <James_F>	 jouncebot: refresh
[13:01:50] <jouncebot>	 I refreshed my knowledge about deployments.
[13:01:56] <James_F>	 jouncebot: nowandnext
[13:01:56] <jouncebot>	 For the next 0 hour(s) and 58 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250723T1300)
[13:01:56] <jouncebot>	 In 1 hour(s) and 28 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250723T1430)
[13:02:00] <James_F>	 Excellent.
[13:04:51] <wikibugs>	 (03PS13) 10Effie Mouzeli: hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171562 (https://phabricator.wikimedia.org/T399211)
[13:06:50] <wikibugs>	 (03PS14) 10Effie Mouzeli: hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171562 (https://phabricator.wikimedia.org/T399211)
[13:07:31] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] redfish: improve is_uefi for Supermicro [software/spicerack] - 10https://gerrit.wikimedia.org/r/1172000 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey)
[13:10:18] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P79738 and previous config saved to /var/cache/conftool/dbconfig/20250723-131018-fceratto.json
[13:10:51] <wikibugs>	 (03PS14) 10Fabfur: haproxy: script to perform configuration validation [puppet] - 10https://gerrit.wikimedia.org/r/1170572 (https://phabricator.wikimedia.org/T399941)
[13:11:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:12:45] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] haproxy: script to perform configuration validation [puppet] - 10https://gerrit.wikimedia.org/r/1170572 (https://phabricator.wikimedia.org/T399941) (owner: 10Fabfur)
[13:14:56] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] haproxy: script to perform configuration validation [puppet] - 10https://gerrit.wikimedia.org/r/1170572 (https://phabricator.wikimedia.org/T399941) (owner: 10Fabfur)
[13:15:04] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:16:07] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171562 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli)
[13:18:22] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] "I don't see any host from kubernetes clusters." [puppet] - 10https://gerrit.wikimedia.org/r/1172001 (owner: 10Ayounsi)
[13:20:20] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171562 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli)
[13:21:02] <wikibugs>	 10SRE-SLO, 10EditCheck, 10Editing-team (Kanban Board), 07Essential-Work: Fix EditCheck's SLO metrics and create a dashboard for it - https://phabricator.wikimedia.org/T395444#11027750 (10elukey) @DLynch Hi! Gentle ping :)
[13:22:40] <wikibugs>	 (03PS1) 10Federico Ceratto: Add wmfmariadbpy package generation [puppet] - 10https://gerrit.wikimedia.org/r/1172025 (https://phabricator.wikimedia.org/T397305)
[13:25:26] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T399728)', diff saved to https://phabricator.wikimedia.org/P79739 and previous config saved to /var/cache/conftool/dbconfig/20250723-132525-fceratto.json
[13:25:30] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[13:25:41] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2224.codfw.wmnet with reason: Maintenance
[13:25:50] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2224 (T399728)', diff saved to https://phabricator.wikimedia.org/P79740 and previous config saved to /var/cache/conftool/dbconfig/20250723-132548-fceratto.json
[13:26:18] <wikibugs>	 (03PS3) 10Federico Ceratto: Add MariaDB test-s8 section VMs [puppet] - 10https://gerrit.wikimedia.org/r/1171597 (https://phabricator.wikimedia.org/T390087)
[13:26:18] <wikibugs>	 (03CR) 10Federico Ceratto: "Prepare deployment of test DB hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1171597 (https://phabricator.wikimedia.org/T390087) (owner: 10Federico Ceratto)
[13:27:36] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox
[13:28:27] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:28:31] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T399728)', diff saved to https://phabricator.wikimedia.org/P79741 and previous config saved to /var/cache/conftool/dbconfig/20250723-132831-fceratto.json
[13:32:54] <mszabo>	 jouncebot: nowandnext
[13:32:54] <jouncebot>	 For the next 0 hour(s) and 27 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250723T1300)
[13:32:54] <jouncebot>	 In 0 hour(s) and 57 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250723T1430)
[13:33:19] <logmsgbot>	 ayounsi@cumin1003 netbox (PID 3084578) is awaiting input
[13:34:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172016 (https://phabricator.wikimedia.org/T394744) (owner: 10Máté Szabó)
[13:34:54] <wikibugs>	 (03PS1) 10Vgutierrez: site,lvs,cumin: Stop using lvs1013 as liberica canary instance [puppet] - 10https://gerrit.wikimedia.org/r/1172036 (https://phabricator.wikimedia.org/T400259)
[13:35:03] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ssw1-d1-eqiad mgmt - ayounsi@cumin1003"
[13:35:08] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ssw1-d1-eqiad mgmt - ayounsi@cumin1003"
[13:35:08] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:35:44] <wikibugs>	 (03Merged) 10jenkins-bot: Enable wgWikimediaEventsCreateAccountInstrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172016 (https://phabricator.wikimedia.org/T394744) (owner: 10Máté Szabó)
[13:35:50] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Homer: PyEz "ignore_warnings" does not work for port-block speed change warning - https://phabricator.wikimedia.org/T400261 (10cmooney) 03NEW p:05Triage→03Medium
[13:35:58] <jinxer-wm>	 FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[13:36:21] <logmsgbot>	 !log mszabo@deploy1003 Started scap sync-world: Backport for [[gerrit:1172016|Enable wgWikimediaEventsCreateAccountInstrumentation (T394744)]]
[13:36:26] <stashbot>	 T394744: Instrument account creation funnel (analytics for Special:CreateAccount) - https://phabricator.wikimedia.org/T394744
[13:37:41] <wikibugs>	 (03CR) 10CI reject: [V:04-1] site,lvs,cumin: Stop using lvs1013 as liberica canary instance [puppet] - 10https://gerrit.wikimedia.org/r/1172036 (https://phabricator.wikimedia.org/T400259) (owner: 10Vgutierrez)
[13:38:35] <logmsgbot>	 !log mszabo@deploy1003 mszabo: Backport for [[gerrit:1172016|Enable wgWikimediaEventsCreateAccountInstrumentation (T394744)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:39:21] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash2030.codfw.wmnet with OS bookworm
[13:40:03] <logmsgbot>	 !log mszabo@deploy1003 mszabo: Continuing with sync
[13:42:46] <wikibugs>	 (03PS1) 10Cathal Mooney: JunOS: pass ingore_warnings list to diff() and rollback() functions [software/homer] - 10https://gerrit.wikimedia.org/r/1172037 (https://phabricator.wikimedia.org/T400261)
[13:43:39] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P79743 and previous config saved to /var/cache/conftool/dbconfig/20250723-134338-fceratto.json
[13:44:58] <wikibugs>	 (03PS2) 10Cathal Mooney: JunOS: pass ingore_warnings list to diff() and rollback() functions [software/homer] - 10https://gerrit.wikimedia.org/r/1172037 (https://phabricator.wikimedia.org/T400261)
[13:45:53] <logmsgbot>	 !log mszabo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1172016|Enable wgWikimediaEventsCreateAccountInstrumentation (T394744)]] (duration: 09m 31s)
[13:45:58] <stashbot>	 T394744: Instrument account creation funnel (analytics for Special:CreateAccount) - https://phabricator.wikimedia.org/T394744
[13:47:23] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "lgtm!" [software/homer] - 10https://gerrit.wikimedia.org/r/1172037 (https://phabricator.wikimedia.org/T400261) (owner: 10Cathal Mooney)
[13:48:30] <wikibugs>	 10SRE-SLO, 10Observability-Metrics: Clear & Backfill Tonecheck Pyrra Metrics - https://phabricator.wikimedia.org/T400071#11027868 (10herron) This morning I've done:  ` herron@prometheus1005:~/tmp/backfill/tonecheck$ time promtool tsdb create-blocks-from rules --start=2025-07-01T00:00:00Z --end=2025-07-02T00:00...
[13:50:54] <wikibugs>	 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 07Python3-Porting: Puppet tox: properly lint both Py2 and Py3 files - https://phabricator.wikimedia.org/T184435#11027873 (10jijiki) Hey folks, I ran into this issue myself, having CI failing my patches over and over again.  ` py2-pep8: skipped because could...
[13:52:05] <wikibugs>	 (03CR) 10Kevin Bazira: ml-services: update RRLA and RRML images (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172011 (https://phabricator.wikimedia.org/T399437) (owner: 10Kevin Bazira)
[13:56:41] <wikibugs>	 06SRE: Add ability to validate JWTs in haproxy - https://phabricator.wikimedia.org/T400238#11027886 (10Joe) >>! In T400238#11027254, @Vgutierrez wrote: >> For now, we might also want to check for a mw session token instead. > Please correct me if I’m wrong, but in this case, validation is just a matter of checki...
[13:58:31] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash2030.codfw.wmnet with reason: host reimage
[13:58:39] <wikibugs>	 (03CR) 10CI reject: [V:04-1] JunOS: pass ingore_warnings list to diff() and rollback() functions [software/homer] - 10https://gerrit.wikimedia.org/r/1172037 (https://phabricator.wikimedia.org/T400261) (owner: 10Cathal Mooney)
[13:58:46] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P79745 and previous config saved to /var/cache/conftool/dbconfig/20250723-135846-fceratto.json
[13:59:08] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1013.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[13:59:50] <wikibugs>	 06SRE: Add ability to validate JWTs in haproxy - https://phabricator.wikimedia.org/T400238#11027907 (10Vgutierrez) it's not uncommon to have several keys in place at any given point in time, it should be fine in terms of performance as long as we keep it under control
[14:01:53] <XioNoX>	 swfrench-wmf, urandom, all smooth
[14:02:21] <logmsgbot>	 elukey@cumin1003 provision (PID 3087619) is awaiting input
[14:02:33] <swfrench-wmf>	 thanks, XioNoX!
[14:03:09] <urandom>	 thanks!
[14:03:09] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] "runbooks need some work but this can be merged (please fix the commit message typo)" [alerts] - 10https://gerrit.wikimedia.org/r/1171176 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur)
[14:03:15] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve1013.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[14:03:31] <wikibugs>	 (03PS18) 10Elukey: sre.hosts.provision: add custom settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357)
[14:03:32] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash2030.codfw.wmnet with reason: host reimage
[14:04:03] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1013.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[14:04:33] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve1013.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[14:05:16] <wikibugs>	 (03PS19) 10Elukey: sre.hosts.provision: add custom settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357)
[14:05:39] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1013.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[14:05:58] <jinxer-wm>	 RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[14:06:27] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:08:19] <wikibugs>	 (03PS20) 10Elukey: sre.hosts.provision: add custom settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357)
[14:10:17] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to nda & logstash for Novem Linguae - https://phabricator.wikimedia.org/T400176#11027982 (10Samwalton9-WMF) Novem is a productive and capable volunteer developer and I think he can be trusted with this access.
[14:13:54] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T399728)', diff saved to https://phabricator.wikimedia.org/P79746 and previous config saved to /var/cache/conftool/dbconfig/20250723-141353-fceratto.json
[14:14:00] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[14:16:57] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[14:17:18] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[14:18:48] <wikibugs>	 (03PS2) 10Elukey: redfish: improve is_uefi for Supermicro [software/spicerack] - 10https://gerrit.wikimedia.org/r/1172000 (https://phabricator.wikimedia.org/T393948)
[14:19:02] <wikibugs>	 (03CR) 10Elukey: redfish: improve is_uefi for Supermicro (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1172000 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey)
[14:19:54] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1172000 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey)
[14:21:36] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Good catch, and thank you for doing that!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1131630 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff)
[14:21:54] <wikibugs>	 10SRE-SLO, 10EditCheck, 10Lift-Wing, 06Machine-Learning-Team, 10Editing-team (Tracking): Create SLO dashboard for tone (peacock) check model - https://phabricator.wikimedia.org/T390706#11028049 (10elukey)
[14:22:39] <wikibugs>	 (03PS3) 10Cathal Mooney: JunOS: pass ingore_warnings list to diff() and rollback() functions [software/homer] - 10https://gerrit.wikimedia.org/r/1172037 (https://phabricator.wikimedia.org/T400261)
[14:25:53] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve1013.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[14:26:34] <wikibugs>	 06SRE, 10Hiddenparma, 06Traffic: Browser behaviour detection at the edge - https://phabricator.wikimedia.org/T400270 (10Joe) 03NEW
[14:26:58] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1013.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[14:27:11] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve1013.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[14:30:06] <jouncebot>	 Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250723T1430)
[14:31:50] <wikibugs>	 (03PS1) 10Bking: mw-content-history-reconcile-enrich: increase jobmanager.memory.off-heap.size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172053 (https://phabricator.wikimedia.org/T395984)
[14:34:21] <wikibugs>	 (03CR) 10CI reject: [V:04-1] JunOS: pass ingore_warnings list to diff() and rollback() functions [software/homer] - 10https://gerrit.wikimedia.org/r/1172037 (https://phabricator.wikimedia.org/T400261) (owner: 10Cathal Mooney)
[14:37:48] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash2030.codfw.wmnet with OS bookworm
[14:38:02] <wikibugs>	 (03CR) 10Elukey: [C:03+2] Remove golang-1.17 and golang-1.18 images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1131630 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff)
[14:38:12] <wikibugs>	 (03CR) 10Elukey: [V:03+2 C:03+2] Remove golang-1.17 and golang-1.18 images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1131630 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff)
[14:38:48] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash2031.codfw.wmnet with OS bookworm
[14:40:09] <wikibugs>	 (03PS11) 10Fabfur: traffic: new alerts for haproxykafka [alerts] - 10https://gerrit.wikimedia.org/r/1171176 (https://phabricator.wikimedia.org/T400039)
[14:40:20] <wikibugs>	 (03CR) 10Fabfur: "ack, added some extra info to that page too" [alerts] - 10https://gerrit.wikimedia.org/r/1171176 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur)
[14:40:24] <wikibugs>	 06SRE: Add ability to validate JWTs in haproxy - https://phabricator.wikimedia.org/T400238#11028148 (10Vgutierrez) @Tgr would it be possible to perform some lightweight validation of current MediaWiki session tokens? For example, checking whether the token has a specific length, or whether it's valid base64 / ba...
[14:41:52] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] traffic: new alerts for haproxykafka [alerts] - 10https://gerrit.wikimedia.org/r/1171176 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur)
[14:43:26] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11028165 (10Jhancock.wm) @Marostegui lemme know when you want to do es2036
[14:43:42] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#11028166 (10dancy) Thanks for the fixes @Scott_French !
[14:46:34] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[14:46:50] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[14:48:29] <wikibugs>	 (03PS2) 10Vgutierrez: site,lvs,cumin: Stop using lvs1013 as liberica canary instance [puppet] - 10https://gerrit.wikimedia.org/r/1172036 (https://phabricator.wikimedia.org/T400259)
[14:48:39] <wikibugs>	 (03CR) 10Gmodena: [C:03+1] mw-content-history-reconcile-enrich: increase jobmanager.memory.off-heap.size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172053 (https://phabricator.wikimedia.org/T395984) (owner: 10Bking)
[14:49:02] <wikibugs>	 (03PS10) 10Tiziano Fogli: nrpe wrapper: add wrapper to be invoked a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1168150 (https://phabricator.wikimedia.org/T395446)
[14:49:02] <wikibugs>	 (03CR) 10Tiziano Fogli: "This patch is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/1168150 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli)
[14:52:56] <wikibugs>	 (03CR) 10Herron: [C:03+1] nrpe wrapper: add wrapper to be invoked a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1168150 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli)
[14:54:14] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[14:55:02] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#11028177 (10Scott_French)
[14:58:28] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash2031.codfw.wmnet with reason: host reimage
[15:02:17] <wikibugs>	 06SRE, 10Hiddenparma, 06Traffic: Browser behaviour detection at the edge - https://phabricator.wikimedia.org/T400270#11028212 (10Joe)
[15:03:13] <wikibugs>	 (03PS1) 10CDobbins: dnsrecursor: add dynamic forward_zones to recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608)
[15:03:23] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] "LGTM, I would just mention in the commit message the change in the test but it's a [nit]" [puppet] - 10https://gerrit.wikimedia.org/r/1172036 (https://phabricator.wikimedia.org/T400259) (owner: 10Vgutierrez)
[15:03:34] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash2031.codfw.wmnet with reason: host reimage
[15:04:09] <wikibugs>	 (03PS2) 10CDobbins: dnsrecursor: add dynamic forward_zones to recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608)
[15:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:09:27] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[15:09:40] <wikibugs>	 (03PS3) 10CDobbins: dnsrecursor: add dynamic forward_zones to recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608)
[15:10:04] <wikibugs>	 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdata2002 & frmx2002 - https://phabricator.wikimedia.org/T400275 (10RobH) 03NEW
[15:10:28] <wikibugs>	 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdata2002 & frmx2002 - https://phabricator.wikimedia.org/T400275#11028250 (10RobH)
[15:11:27] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:11:42] <wikibugs>	 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdata2002 & frmx2002 - https://phabricator.wikimedia.org/T400275#11028268 (10RobH) a:03Jgreen @Jgreen,  As discussed in IRC, I'm assigning this over to you to double-check the assumed hostnames and update the racking details as you see f...
[15:12:54] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to SSH login to analytics clients with Hadoop access for ttaylor - https://phabricator.wikimedia.org/T400277 (10ttaylor) 03NEW
[15:12:58] <jinxer-wm>	 FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[15:13:05] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#11028283 (10Jhancock.wm) quick update on one of the last things in this list. Cyrus One is still working on getting us a badge reader for the door. I opened a ticket with them on the 15th. Th...
[15:13:19] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2148.codfw.wmnet with reason: Maintenance
[15:13:26] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2148 (T399728)', diff saved to https://phabricator.wikimedia.org/P79749 and previous config saved to /var/cache/conftool/dbconfig/20250723-151325-fceratto.json
[15:13:31] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[15:14:56] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to SSH login to analytics clients with Hadoop access for ttaylor - https://phabricator.wikimedia.org/T400277#11028287 (10calbon) I approve this request.
[15:15:02] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to SSH login to analytics clients with Hadoop access for ttaylor - https://phabricator.wikimedia.org/T400277#11028288 (10ttaylor) I probably have some of these perms/group memberships but not all of them, and I have a new ssh key for this purpose.
[15:16:27] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:16:30] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T399728)', diff saved to https://phabricator.wikimedia.org/P79750 and previous config saved to /var/cache/conftool/dbconfig/20250723-151630-fceratto.json
[15:16:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:17:19] <fabfur>	  !log restarted haproxykafka on cp3071 due to unavailability
[15:17:58] <jinxer-wm>	 RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[15:21:31] <wikibugs>	 10SRE-swift-storage, 06Commons: File on Commons lost: File:LAGUNA DE ORURIILO.jpg - https://phabricator.wikimedia.org/T399389#11028308 (10MatthewVernon) I don't think there's anything more I can do here, I'm afraid.
[15:30:00] <wikibugs>	 (03PS2) 10JHathaway: reposync: don't enforce ownership after init [puppet] - 10https://gerrit.wikimedia.org/r/993797
[15:30:40] <wikibugs>	 (03PS1) 10Fabfur: haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039)
[15:31:24] <wikibugs>	 06SRE, 06FR-donorrelations: Custom URL for survey pop-up - https://phabricator.wikimedia.org/T400278 (10EBrill-WMF) 03NEW
[15:31:38] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P79751 and previous config saved to /var/cache/conftool/dbconfig/20250723-153137-fceratto.json
[15:31:44] <wikibugs>	 (03PS2) 10Stevemunene: dse-k8s: deploy etcd service [puppet] - 10https://gerrit.wikimedia.org/r/1171584 (https://phabricator.wikimedia.org/T397293)
[15:31:50] <wikibugs>	 (03CR) 10CI reject: [V:04-1] haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur)
[15:33:13] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: Decommission frack hosts: frpig2001 pay-lvs2001 pay-lvs2002 - https://phabricator.wikimedia.org/T397868#11028369 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[15:33:48] <wikibugs>	 (03PS4) 10CDobbins: dnsrecursor: add dynamic forward_zones to recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608)
[15:36:20] <wikibugs>	 (03PS5) 10CDobbins: dnsrecursor: add dynamic forward_zones to recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608)
[15:37:04] <wikibugs>	 (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6402/co" [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins)
[15:37:40] <wikibugs>	 (03PS2) 10Fabfur: haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039)
[15:38:49] <wikibugs>	 (03CR) 10CI reject: [V:04-1] haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur)
[15:38:50] <wikibugs>	 (03PS6) 10CDobbins: dnsrecursor: add dynamic forward_zones to recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608)
[15:39:39] <wikibugs>	 (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6403/co" [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins)
[15:39:49] <wikibugs>	 (03PS2) 10Bking: mw-content-history-reconcile-enrich: increase jobmanager.memory.off-heap.size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172053 (https://phabricator.wikimedia.org/T397330)
[15:41:20] <wikibugs>	 (03PS7) 10CDobbins: dnsrecursor: add dynamic forward_zones to recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608)
[15:42:05] <wikibugs>	 (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6404/co" [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins)
[15:45:37] <wikibugs>	 (03PS3) 10Fabfur: haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039)
[15:46:29] <wikibugs>	 (03CR) 10Bking: [C:03+2] mw-content-history-reconcile-enrich: increase jobmanager.memory.off-heap.size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172053 (https://phabricator.wikimedia.org/T397330) (owner: 10Bking)
[15:46:45] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P79752 and previous config saved to /var/cache/conftool/dbconfig/20250723-154645-fceratto.json
[15:47:26] <wikibugs>	 (03PS8) 10CDobbins: dnsrecursor: add dynamic forward_zones to recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608)
[15:47:51] <wikibugs>	 (03CR) 10CI reject: [V:04-1] haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur)
[15:48:11] <wikibugs>	 (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6405/co" [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins)
[15:49:54] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06serviceops: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#11028456 (10Ottomata) Should we perhaps use `latest` tag for Gitlab CI images?  I suppose other things could break if the base image is silently upgraded between different pipeli...
[15:50:30] <wikibugs>	 (03PS4) 10Cathal Mooney: JunOS: pass ingore_warnings list to diff() and rollback() functions [software/homer] - 10https://gerrit.wikimedia.org/r/1172037 (https://phabricator.wikimedia.org/T400261)
[15:51:24] <wikibugs>	 (03CR) 10Ottomata: [C:03+2] eventbus: register with team-data-engineering. [alerts] - 10https://gerrit.wikimedia.org/r/1168119 (https://phabricator.wikimedia.org/T398437) (owner: 10Gmodena)
[15:51:33] <wikibugs>	 (03CR) 10Elukey: [C:03+2] redfish: improve is_uefi for Supermicro [software/spicerack] - 10https://gerrit.wikimedia.org/r/1172000 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey)
[15:51:34] <wikibugs>	 (03CR) 10Ottomata: [C:03+2] eventgate: alert on traffic deviation. [alerts] - 10https://gerrit.wikimedia.org/r/1167620 (https://phabricator.wikimedia.org/T398437) (owner: 10Gmodena)
[15:51:49] <wikibugs>	 (03PS4) 10Fabfur: haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039)
[15:52:07] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[15:52:34] <wikibugs>	 (03Merged) 10jenkins-bot: eventbus: register with team-data-engineering. [alerts] - 10https://gerrit.wikimedia.org/r/1168119 (https://phabricator.wikimedia.org/T398437) (owner: 10Gmodena)
[15:53:42] <wikibugs>	 (03Merged) 10jenkins-bot: eventgate: alert on traffic deviation. [alerts] - 10https://gerrit.wikimedia.org/r/1167620 (https://phabricator.wikimedia.org/T398437) (owner: 10Gmodena)
[15:53:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur)
[15:58:04] <wikibugs>	 (03CR) 10Vgutierrez: haproxykafka: fixed missing site in dashboard link (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur)
[15:59:55] <wikibugs>	 (03Merged) 10jenkins-bot: redfish: improve is_uefi for Supermicro [software/spicerack] - 10https://gerrit.wikimedia.org/r/1172000 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey)
[15:59:55] <wikibugs>	 (03PS9) 10CDobbins: dnsrecursor: add dynamic forward_zones to recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608)
[16:00:15] <wikibugs>	 (03PS2) 10Volans: redfish: improve iDRAC 10 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/1172014
[16:00:38] <wikibugs>	 (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6406/co" [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins)
[16:01:53] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T399728)', diff saved to https://phabricator.wikimedia.org/P79753 and previous config saved to /var/cache/conftool/dbconfig/20250723-160152-fceratto.json
[16:01:54] <wikibugs>	 (03CR) 10CI reject: [V:04-1] JunOS: pass ingore_warnings list to diff() and rollback() functions [software/homer] - 10https://gerrit.wikimedia.org/r/1172037 (https://phabricator.wikimedia.org/T400261) (owner: 10Cathal Mooney)
[16:01:58] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[16:02:08] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2175.codfw.wmnet with reason: Maintenance
[16:02:16] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2175 (T399728)', diff saved to https://phabricator.wikimedia.org/P79754 and previous config saved to /var/cache/conftool/dbconfig/20250723-160215-fceratto.json
[16:03:08] <wikibugs>	 (03PS5) 10Fabfur: haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039)
[16:03:52] <wikibugs>	 (03CR) 10Subramanya Sastry: "Works for me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165094 (https://phabricator.wikimedia.org/T363484) (owner: 10C. Scott Ananian)
[16:04:17] <wikibugs>	 (03CR) 10CI reject: [V:04-1] haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur)
[16:04:22] <wikibugs>	 (03CR) 10Subramanya Sastry: [C:03+1] Enable the "Report Visual Bug" feature of Extension:ParserMigration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170549 (https://phabricator.wikimedia.org/T365371) (owner: 10C. Scott Ananian)
[16:05:21] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T399728)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250723-160516-fceratto.json
[16:05:37] <wikibugs>	 (03PS10) 10CDobbins: dnsrecursor: add dynamic forward_zones to recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608)
[16:06:01] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#11028543 (10Scott_French) @Ottomata - So, image build workflows in CI that use the `latest` tag would still have been affected by this, but they would have...
[16:06:07] <wikibugs>	 (03PS6) 10Fabfur: haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039)
[16:06:22] <wikibugs>	 (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6407/co" [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins)
[16:06:42] <wikibugs>	 (03PS1) 10Ahmon Dancy: cli.py: The mode/action argument is required [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1172064
[16:07:19] <wikibugs>	 (03CR) 10CI reject: [V:04-1] haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur)
[16:07:56] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Install serial port breakout card on sretest2001 - https://phabricator.wikimedia.org/T400211#11028557 (10wiki_willy) a:05Papaul→03Jhancock.wm Hi @Jhancock.wm - since @Papaul is out on sabbatical, can you take a look at this one?  It's related...
[16:09:17] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cli.py: The mode/action argument is required [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1172064 (owner: 10Ahmon Dancy)
[16:11:13] <wikibugs>	 (03CR) 10Volans: [C:03+2] redfish: improve iDRAC 10 support [software/spicerack] - 10https://gerrit.wikimedia.org/r/1172014 (owner: 10Volans)
[16:11:19] <wikibugs>	 (03PS11) 10CDobbins: dnsrecursor: remove hardcoded values and tidy up [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608)
[16:12:03] <wikibugs>	 (03PS7) 10Fabfur: haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039)
[16:13:12] <wikibugs>	 (03CR) 10CI reject: [V:04-1] haproxykafka: fixed missing site in dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1172059 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur)
[16:14:31] <wikibugs>	 (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6408/co" [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins)
[16:17:01] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#11028586 (10Scott_French)
[16:20:29] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P79755 and previous config saved to /var/cache/conftool/dbconfig/20250723-162028-fceratto.json
[16:24:47] <wikibugs>	 (03PS12) 10CDobbins: dnsrecursor: remove hardcoded values and tidy up [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608)
[16:26:03] <wikibugs>	 (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins)
[16:29:22] <wikibugs>	 (03PS1) 10Ahmon Dancy: tox.ini: Pass --diff to black [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1172071
[16:29:39] <wikibugs>	 06SRE, 06Data-Engineering: WE 5.4 FY 25/26: Improve automata detection at the edge and pass it to the refinery pipeline - https://phabricator.wikimedia.org/T396562#11028642 (10Milimetric) Data Engineering is ready to do or help with this work whenever you need.
[16:30:14] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash2031.codfw.wmnet with OS bookworm
[16:31:11] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash2032.codfw.wmnet with OS bookworm
[16:31:27] <jinxer-wm>	 FIRING: ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:32:54] <wikibugs>	 (03PS2) 10Ahmon Dancy: cli.py: The mode/action argument is required [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1172064
[16:35:36] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P79756 and previous config saved to /var/cache/conftool/dbconfig/20250723-163536-fceratto.json
[16:35:49] <wikibugs>	 (03PS13) 10CDobbins: dnsrecursor: remove hardcoded values and tidy up [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608)
[16:36:27] <jinxer-wm>	 RESOLVED: ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:37:04] <wikibugs>	 (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins)
[16:38:56] <wikibugs>	 (03PS1) 10Ahmon Dancy: cli.py: Improve UX when config file does not exist [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1172072
[16:40:28] <wikibugs>	 (03PS5) 10Cathal Mooney: JunOS: pass ingore_warnings list to diff() and rollback() functions [software/homer] - 10https://gerrit.wikimedia.org/r/1172037 (https://phabricator.wikimedia.org/T400261)
[16:42:27] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:44:24] <wikibugs>	 (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins)
[16:50:37] <wikibugs>	 (03PS14) 10CDobbins: dnsrecursor: remove hardcoded values and tidy up [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608)
[16:50:44] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T399728)', diff saved to https://phabricator.wikimedia.org/P79757 and previous config saved to /var/cache/conftool/dbconfig/20250723-165043-fceratto.json
[16:50:49] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[16:50:59] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2189.codfw.wmnet with reason: Maintenance
[16:51:06] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2189 (T399728)', diff saved to https://phabricator.wikimedia.org/P79758 and previous config saved to /var/cache/conftool/dbconfig/20250723-165106-fceratto.json
[16:51:58] <jinxer-wm>	 FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[16:52:01] <wikibugs>	 (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins)
[16:53:31] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash2032.codfw.wmnet with reason: host reimage
[16:54:08] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T399728)', diff saved to https://phabricator.wikimedia.org/P79759 and previous config saved to /var/cache/conftool/dbconfig/20250723-165407-fceratto.json
[16:56:58] <jinxer-wm>	 RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[16:57:05] <wikibugs>	 (03PS27) 10CDobbins: dnsrecursor: add recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608)
[16:57:27] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:58:13] <wikibugs>	 (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins)
[16:58:38] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash2032.codfw.wmnet with reason: host reimage
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250723T1700)
[17:04:42] <wikibugs>	 (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins)
[17:04:50] <swfrench-wmf>	 I didn't get a chance to explicitly schedule it today, but I'll be deploying mediawiki shortly to pick up an image builder change
[17:08:28] <logmsgbot>	 !log swfrench@deploy1003 Started scap sync-world: Deploy to remove php-ldap from debug images
[17:09:15] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P79761 and previous config saved to /var/cache/conftool/dbconfig/20250723-170915-fceratto.json
[17:10:26] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#11028782 (10Ottomata) FTR, went with versioned tag for repeatability.
[17:10:36] <wikibugs>	 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove m-dot subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#11028781 (10Krinkle) a:03Krinkle
[17:11:09] <logmsgbot>	 !log swfrench@deploy1003 Finished scap sync-world: Deploy to remove php-ldap from debug images (duration: 03m 29s)
[17:11:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:12:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11028786 (10elukey) @Jclark-ctr we have found a workaround for provisioning and reimage that seems to have worked for ml-serve1012, I'll have to do more tests so for th...
[17:17:31] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Inbound errors on interface cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://phabricator.wikimedia.org/T399916#11028841 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[17:17:48] <wikibugs>	 (03PS7) 10Cathal Mooney: Capirca: handle script having no 'status' attribute gracefully [software/homer] - 10https://gerrit.wikimedia.org/r/1166373
[17:22:41] <swfrench-wmf>	 !log deleted tags for docker-registry.discovery.wmnet/mediawiki-httpd-bookworm - T378128
[17:22:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:22:46] <stashbot>	 T378128: Upgrade httpd images to bullseye or bookworm - https://phabricator.wikimedia.org/T378128
[17:23:59] <swfrench-wmf>	 !log deleted tags for docker-registry.discovery.wmnet/httpd-fcgi-bookworm - T378128
[17:24:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:24:23] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P79762 and previous config saved to /var/cache/conftool/dbconfig/20250723-172423-fceratto.json
[17:25:00] <swfrench-wmf>	 !log deleted tags for docker-registry.discovery.wmnet/httpd-bookworm - T378128
[17:25:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:26:27] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:27:39] <icinga-wm>	 PROBLEM - Disk space on an-worker1120 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/k 153346 MB (4% inode=99%): /var/lib/hadoop/data/m 156622 MB (4% inode=99%): /var/lib/hadoop/data/d 148261 MB (3% inode=99%): /var/lib/hadoop/data/b 153087 MB (4% inode=99%): /var/lib/hadoop/data/e 156701 MB (4% inode=99%): /var/lib/hadoop/data/g 157130 MB (4% inode=99%): /var/lib/hadoop/data/f 157566 MB (4% inode=99%): /var/lib/hadoop/data
[17:27:39] <icinga-wm>	 7 MB (4% inode=99%): /var/lib/hadoop/data/i 156062 MB (4% inode=99%): /var/lib/hadoop/data/j 158423 MB (4% inode=99%): /var/lib/hadoop/data/l 159446 MB (4% inode=99%): /var/lib/hadoop/data/c 153552 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1120&var-datasource=eqiad+prometheus/ops
[17:30:31] <icinga-wm>	 PROBLEM - Disk space on an-worker1128 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/c 173357 MB (4% inode=99%): /var/lib/hadoop/data/d 175314 MB (4% inode=99%): /var/lib/hadoop/data/j 167486 MB (4% inode=99%): /var/lib/hadoop/data/f 177228 MB (4% inode=99%): /var/lib/hadoop/data/g 186771 MB (4% inode=99%): /var/lib/hadoop/data/i 173587 MB (4% inode=99%): /var/lib/hadoop/data/b 186891 MB (4% inode=99%): /var/lib/hadoop/data
[17:30:31] <icinga-wm>	 0 MB (4% inode=99%): /var/lib/hadoop/data/e 182198 MB (4% inode=99%): /var/lib/hadoop/data/h 149613 MB (3% inode=99%): /var/lib/hadoop/data/k 170259 MB (4% inode=99%): /var/lib/hadoop/data/m 184454 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1128&var-datasource=eqiad+prometheus/ops
[17:36:27] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:37:02] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 63516
[17:37:52] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 63516
[17:39:31] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T399728)', diff saved to https://phabricator.wikimedia.org/P79763 and previous config saved to /var/cache/conftool/dbconfig/20250723-173930-fceratto.json
[17:39:36] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[17:39:46] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2197.codfw.wmnet with reason: Maintenance
[17:40:37] <wikibugs>	 (03PS1) 10Ottomata: eventgate-*-external - bump to 1.16.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172079 (https://phabricator.wikimedia.org/T376026)
[17:40:57] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2207.codfw.wmnet with reason: Maintenance
[17:41:04] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2207 (T399728)', diff saved to https://phabricator.wikimedia.org/P79764 and previous config saved to /var/cache/conftool/dbconfig/20250723-174104-fceratto.json
[17:41:13] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 8309
[17:42:34] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 8309
[17:43:27] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:44:05] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207 (T399728)', diff saved to https://phabricator.wikimedia.org/P79765 and previous config saved to /var/cache/conftool/dbconfig/20250723-174405-fceratto.json
[17:48:27] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:49:29] <wikibugs>	 (03PS2) 10Cathal Mooney: Add ASN mapping and import policy for dse-k8s codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1171621 (https://phabricator.wikimedia.org/T400037)
[17:49:40] <wikibugs>	 (03CR) 10Cathal Mooney: "Ah good spot thanks!" [homer/public] - 10https://gerrit.wikimedia.org/r/1171621 (https://phabricator.wikimedia.org/T400037) (owner: 10Cathal Mooney)
[17:51:57] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Add ASN mapping and import policy for dse-k8s codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1171621 (https://phabricator.wikimedia.org/T400037) (owner: 10Cathal Mooney)
[17:52:28] <wikibugs>	 (03Merged) 10jenkins-bot: Add ASN mapping and import policy for dse-k8s codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1171621 (https://phabricator.wikimedia.org/T400037) (owner: 10Cathal Mooney)
[17:58:20] <wikibugs>	 (03PS1) 10Dzahn: Copied the global build ARGs from upstream docker file: [container/codesearch] - 10https://gerrit.wikimedia.org/r/1172080 (https://phabricator.wikimedia.org/T268199)
[17:59:13] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207', diff saved to https://phabricator.wikimedia.org/P79766 and previous config saved to /var/cache/conftool/dbconfig/20250723-175912-fceratto.json
[17:59:44] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "this was copying global build ARGs from upstream docker file." [container/codesearch] - 10https://gerrit.wikimedia.org/r/1171715 (owner: 10Dzahn)
[18:00:04] <jouncebot>	 dduvall and dancy: #bothumor I � Unicode. All rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250723T1800).
[18:00:53] <wikibugs>	 (03Abandoned) 10Dzahn: use /sbin/tini as entrypoint [container/codesearch] - 10https://gerrit.wikimedia.org/r/1171630 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn)
[18:01:12] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] Copied the global build ARGs from upstream docker file: [container/codesearch] - 10https://gerrit.wikimedia.org/r/1172080 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn)
[18:01:26] <wikibugs>	 (03Merged) 10jenkins-bot: Copied the global build ARGs from upstream docker file: [container/codesearch] - 10https://gerrit.wikimedia.org/r/1172080 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn)
[18:03:52] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06SRE Observability: Logstash access - https://phabricator.wikimedia.org/T400288 (10HCoplin-WMF) 03NEW
[18:06:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:07:39] <icinga-wm>	 PROBLEM - Disk space on an-worker1120 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/k 155867 MB (4% inode=99%): /var/lib/hadoop/data/m 153377 MB (4% inode=99%): /var/lib/hadoop/data/d 151023 MB (4% inode=99%): /var/lib/hadoop/data/b 150224 MB (4% inode=99%): /var/lib/hadoop/data/e 152440 MB (4% inode=99%): /var/lib/hadoop/data/g 152904 MB (4% inode=99%): /var/lib/hadoop/data/f 153871 MB (4% inode=99%): /var/lib/hadoop/data
[18:07:39] <icinga-wm>	 4 MB (4% inode=99%): /var/lib/hadoop/data/i 152325 MB (4% inode=99%): /var/lib/hadoop/data/j 156451 MB (4% inode=99%): /var/lib/hadoop/data/l 154182 MB (4% inode=99%): /var/lib/hadoop/data/c 149200 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1120&var-datasource=eqiad+prometheus/ops
[18:11:15] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task - https://phabricator.wikimedia.org/T396065#11029024 (10VRiley-WMF) So, looking at this, I believe the cable lengths would be the following. @Jclark-ctr would you agree?   | Connection Type                      | Est. Length | Quantity...
[18:11:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:11:27] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:14:21] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207', diff saved to https://phabricator.wikimedia.org/P79767 and previous config saved to /var/cache/conftool/dbconfig/20250723-181420-fceratto.json
[18:18:10] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash2032.codfw.wmnet with OS bookworm
[18:19:02] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06SRE Observability: Logstash access for HCoplin - https://phabricator.wikimedia.org/T400288#11029050 (10Dzahn)
[18:21:27] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:21:46] <wikibugs>	 10SRE-SLO, 10Observability-Metrics, 13Patch-For-Review: Prometheus/Pyrra: establish backfill process for recording rules - https://phabricator.wikimedia.org/T349521#11029054 (10herron) >>! In T349521#9706188, @fgiunchedi wrote: > Following up from a chat yesterday: >  > The idea of creating backfilled blocks...
[18:21:46] <wikibugs>	 (03PS2) 10Ottomata: eventgate-*-external - bump to 1.17.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172079 (https://phabricator.wikimedia.org/T376026)
[18:21:58] <jinxer-wm>	 FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[18:22:09] <wikibugs>	 (03PS1) 10Kosta Harlan: AuthManager: Move temp account login to continueAuthentication [core] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1172082 (https://phabricator.wikimedia.org/T398270)
[18:26:43] <wikibugs>	 (03CR) 10Jforrester: [C:03+1] "Neat!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170616 (owner: 10Krinkle)
[18:27:27] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:29:28] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207 (T399728)', diff saved to https://phabricator.wikimedia.org/P79768 and previous config saved to /var/cache/conftool/dbconfig/20250723-182928-fceratto.json
[18:29:33] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[18:29:44] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2225.codfw.wmnet with reason: Maintenance
[18:29:51] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2225 (T399728)', diff saved to https://phabricator.wikimedia.org/P79769 and previous config saved to /var/cache/conftool/dbconfig/20250723-182951-fceratto.json
[18:32:55] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2225 (T399728)', diff saved to https://phabricator.wikimedia.org/P79770 and previous config saved to /var/cache/conftool/dbconfig/20250723-183254-fceratto.json
[18:38:14] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task - https://phabricator.wikimedia.org/T396065#11029120 (10Jclark-ctr) @VRiley-WMF I think that's a good start for cabling some might be short some might be a little long, but keep in mind that you lose over a meter in drop length from the...
[18:39:04] <wikibugs>	 (03CR) 10Ottomata: [C:03+2] eventgate-*-external - bump to 1.17.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172079 (https://phabricator.wikimedia.org/T376026) (owner: 10Ottomata)
[18:39:37] <wikibugs>	 (03CR) 10Ottomata: "Old patch, should we abandon?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/959184 (https://phabricator.wikimedia.org/T345244) (owner: 10Clément Goubert)
[18:40:52] <ottomata>	 dduvall: dancy hello again, is the train clear? :)
[18:41:16] <dduvall>	 ottomata: it is not. rolling now :)
[18:41:22] <ottomata>	 k will wait, ty!
[18:41:33] <wikibugs>	 (03Merged) 10jenkins-bot: eventgate-*-external - bump to 1.17.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172079 (https://phabricator.wikimedia.org/T376026) (owner: 10Ottomata)
[18:41:39] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 to 1.45.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172086 (https://phabricator.wikimedia.org/T396372)
[18:41:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.45.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172086 (https://phabricator.wikimedia.org/T396372) (owner: 10TrainBranchBot)
[18:42:32] <wikibugs>	 (03Merged) 10jenkins-bot: group1 to 1.45.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172086 (https://phabricator.wikimedia.org/T396372) (owner: 10TrainBranchBot)
[18:42:32] <ottomata>	 gonna get ahead and just do staging instances and some testing
[18:42:46] <logmsgbot>	 !log otto@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply
[18:42:51] <dduvall>	 sounds good
[18:43:17] <logmsgbot>	 !log otto@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply
[18:47:25] <logmsgbot>	 !log otto@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply
[18:47:27] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:47:55] <logmsgbot>	 !log otto@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply
[18:48:02] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2225', diff saved to https://phabricator.wikimedia.org/P79771 and previous config saved to /var/cache/conftool/dbconfig/20250723-184801-fceratto.json
[18:50:09] <logmsgbot>	 !log dduvall@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.11  refs T396372
[18:50:14] <stashbot>	 T396372: 1.45.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T396372
[18:51:52] <logmsgbot>	 !log bking@cumin2002 conftool action : set/pooled=false; selector: dnsdisc=search,name=eqiad
[18:52:36] <inflatador>	 !log depool eqiad in preparation for rolling restart T399162
[18:52:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:52:40] <stashbot>	 T399162: Regression: Cirrus exact string regexp search for insource:/"u.a."/ has stopped working - https://phabricator.wikimedia.org/T399162
[18:54:14] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[18:55:31] <dduvall>	 ottomata: all clear!
[18:57:27] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:58:23] <ottomata>	 ty!
[18:58:48] <logmsgbot>	 !log otto@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply
[18:59:29] <logmsgbot>	 !log otto@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply
[18:59:57] <logmsgbot>	 !log otto@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply
[19:00:25] <ottomata>	 !log deploying eventgate-analytics-external and eventgate-logging-external to get meta.dt logic change - T376026
[19:00:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:00:30] <stashbot>	 T376026: Update event-producing tools to overwrite `meta.dt` - https://phabricator.wikimedia.org/T376026
[19:01:12] <logmsgbot>	 !log otto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply
[19:01:28] <logmsgbot>	 !log bking@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: activate new plugins packages - bking@cumin1002 - T397227
[19:01:32] <stashbot>	 T397227: Build and deploy OpenSearch plugins package for updated regex search - https://phabricator.wikimedia.org/T397227
[19:02:03] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: security release 20250723
[19:02:19] <logmsgbot>	 !log otto@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply
[19:03:05] <logmsgbot>	 !log otto@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply
[19:03:10] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2225', diff saved to https://phabricator.wikimedia.org/P79772 and previous config saved to /var/cache/conftool/dbconfig/20250723-190309-fceratto.json
[19:04:30] <logmsgbot>	 !log otto@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply
[19:06:03] <logmsgbot>	 !log otto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply
[19:09:27] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[19:11:23] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: security release 20250723
[19:12:02] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#11029224 (10Jdforrester-WMF) >>! In T383557#11026158, @Scott_French wrote: > I'm no longer seeing any references to bullseye-backports in puppet, so I belie...
[19:12:19] <kostajh>	 ottomata: let me know when you’re done please, as I’d like to deploy a MediaWiki patch
[19:14:08] <logmsgbot>	 !log bking@cumin1002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: activate new plugins packages - bking@cumin1002 - T397227
[19:14:13] <stashbot>	 T397227: Build and deploy OpenSearch plugins package for updated regex search - https://phabricator.wikimedia.org/T397227
[19:14:35] <logmsgbot>	 !log bking@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: activate new plugins packages - bking@cumin1002 - T397227
[19:16:02] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: security release 20250723
[19:16:58] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#11029230 (10Scott_French) @Jdforrester-WMF - Basically, the rebuilds would need to start at the first image that depends on `docker-registry.discovery.wmnet...
[19:18:19] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2225 (T399728)', diff saved to https://phabricator.wikimedia.org/P79773 and previous config saved to /var/cache/conftool/dbconfig/20250723-191817-fceratto.json
[19:18:24] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[19:18:28] <wikibugs>	 (03PS1) 10RLazarus: deployment_server: Fix argparse double-dash handling in mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1172090 (https://phabricator.wikimedia.org/T341553)
[19:18:34] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2226.codfw.wmnet with reason: Maintenance
[19:18:42] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2226 (T399728)', diff saved to https://phabricator.wikimedia.org/P79774 and previous config saved to /var/cache/conftool/dbconfig/20250723-191841-fceratto.json
[19:19:04] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] wmnet: Remove maintenance.eqiad.wmnet record [dns] - 10https://gerrit.wikimedia.org/r/1171983 (https://phabricator.wikimedia.org/T397017) (owner: 10Clément Goubert)
[19:20:11] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#11029245 (10dancy) @Jdforrester-WMF I'll do the docker-pkg stuff and pass it by you for review.
[19:20:34] <logmsgbot>	 !log bking@cumin1002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: activate new plugins packages - bking@cumin1002 - T397227
[19:20:39] <stashbot>	 T397227: Build and deploy OpenSearch plugins package for updated regex search - https://phabricator.wikimedia.org/T397227
[19:21:37] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2226 (T399728)', diff saved to https://phabricator.wikimedia.org/P79775 and previous config saved to /var/cache/conftool/dbconfig/20250723-192136-fceratto.json
[19:21:58] <jinxer-wm>	 RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[19:24:01] <kostajh>	 jouncebot: nowandnext
[19:24:02] <jouncebot>	 For the next 0 hour(s) and 35 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250723T1800)
[19:24:02] <jouncebot>	 In 0 hour(s) and 35 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250723T2000)
[19:24:58] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: security release 20250723
[19:25:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1172082 (https://phabricator.wikimedia.org/T398270) (owner: 10Kosta Harlan)
[19:26:36] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: security release 20250723
[19:28:44] <logmsgbot>	 !log bking@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: activate new plugins packages - bking@cumin1002 - T397227
[19:28:48] <stashbot>	 T397227: Build and deploy OpenSearch plugins package for updated regex search - https://phabricator.wikimedia.org/T397227
[19:29:40] <wikibugs>	 (03Merged) 10jenkins-bot: AuthManager: Move temp account login to continueAuthentication [core] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1172082 (https://phabricator.wikimedia.org/T398270) (owner: 10Kosta Harlan)
[19:29:48] <mutante>	 !log gitlab-runner* - apt-get upgrade - upgrading gitlab-runner, libgnutls30, ca-certificates
[19:29:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:30:05] <logmsgbot>	 !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1172082|AuthManager: Move temp account login to continueAuthentication (T398270)]]
[19:30:10] <stashbot>	 T398270: Temp Account persists after logging in and out - https://phabricator.wikimedia.org/T398270
[19:32:18] <logmsgbot>	 !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1172082|AuthManager: Move temp account login to continueAuthentication (T398270)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[19:33:47] <wikibugs>	 (03CR) 10RLazarus: "This has the side effect that" [puppet] - 10https://gerrit.wikimedia.org/r/1172090 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus)
[19:34:23] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#11029288 (10dancy) docker-registry.wikimedia.org/python3-devel:latest is another image that needs a rebuild.
[19:34:58] <jinxer-wm>	 FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[19:36:06] <logmsgbot>	 !log kharlan@deploy1003 kharlan: Continuing with sync
[19:36:45] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2226', diff saved to https://phabricator.wikimedia.org/P79776 and previous config saved to /var/cache/conftool/dbconfig/20250723-193644-fceratto.json
[19:41:03] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash1023.eqiad.wmnet with OS bookworm
[19:41:09] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170549 (https://phabricator.wikimedia.org/T365371) (owner: 10C. Scott Ananian)
[19:41:11] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170549 (https://phabricator.wikimedia.org/T365371) (owner: 10C. Scott Ananian)
[19:41:13] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170549 (https://phabricator.wikimedia.org/T365371) (owner: 10C. Scott Ananian)
[19:41:20] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165094 (https://phabricator.wikimedia.org/T363484) (owner: 10C. Scott Ananian)
[19:41:44] <logmsgbot>	 !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1172082|AuthManager: Move temp account login to continueAuthentication (T398270)]] (duration: 11m 39s)
[19:41:50] <kostajh>	 Done deploying
[19:41:50] <stashbot>	 T398270: Temp Account persists after logging in and out - https://phabricator.wikimedia.org/T398270
[19:49:58] <jinxer-wm>	 RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[19:51:52] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2226', diff saved to https://phabricator.wikimedia.org/P79777 and previous config saved to /var/cache/conftool/dbconfig/20250723-195152-fceratto.json
[19:53:28] <logmsgbot>	 !log jhathaway@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on sretest2001.codfw.wmnet with reason: redfish-test
[19:53:56] <icinga-wm>	 RECOVERY - Disk space on stat1008 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1008&var-datasource=eqiad+prometheus/ops
[19:53:58] <jinxer-wm>	 FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[19:57:25] <logmsgbot>	 !log jhathaway@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ml-serve1012.eqiad.wmnet with reason: redfish-test
[19:57:55] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1023.eqiad.wmnet with reason: host reimage
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250723T2000).
[20:00:05] <jouncebot>	 danisztls and cscott: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:16] <danisztls>	 o/
[20:02:17] <cscott>	 o/
[20:02:21] <RoanKattouw>	 I can deploy but I need a few minutes 
[20:02:39] <cscott>	 my patches can be deployed together. i can spiderpig.
[20:02:47] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1023.eqiad.wmnet with reason: host reimage
[20:07:00] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2226 (T399728)', diff saved to https://phabricator.wikimedia.org/P79778 and previous config saved to /var/cache/conftool/dbconfig/20250723-200659-fceratto.json
[20:07:07] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[20:07:15] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2238.codfw.wmnet with reason: Maintenance
[20:07:23] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2238 (T399728)', diff saved to https://phabricator.wikimedia.org/P79779 and previous config saved to /var/cache/conftool/dbconfig/20250723-200722-fceratto.json
[20:10:26] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2238 (T399728)', diff saved to https://phabricator.wikimedia.org/P79780 and previous config saved to /var/cache/conftool/dbconfig/20250723-201025-fceratto.json
[20:11:16] <cscott>	 is anyone deploying right now?  i'm going to start spiderpigging my config patches if not.
[20:12:33] <wikibugs>	 (03PS1) 10DDesouza: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172106 (https://phabricator.wikimedia.org/T390007)
[20:12:41] <wikibugs>	 (03CR) 10CI reject: [V:04-1] miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172106 (https://phabricator.wikimedia.org/T390007) (owner: 10DDesouza)
[20:12:57] <RoanKattouw>	 I'm not doing anything yet, you can go for it as far as I'm concerned
[20:13:25] <RoanKattouw>	 Also I guess I could have deployed from my phone while eating lunch now that we have Spiderpig, but maybe better that I didn't :) 
[20:16:39] <jinxer-wm>	 FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1075-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[20:18:26] <logmsgbot>	 !log jhathaway@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on sretest1003.eqiad.wmnet with reason: redfish-test
[20:19:56] <cscott>	 ok, i'm going for it.
[20:20:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170549 (https://phabricator.wikimedia.org/T365371) (owner: 10C. Scott Ananian)
[20:20:46] <bd808>	 RoanKattouw: that's the sort of stress test that proves value though. ;)
[20:21:03] <wikibugs>	 (03PS2) 10DDesouza: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172106 (https://phabricator.wikimedia.org/T390007)
[20:21:12] <wikibugs>	 (03Merged) 10jenkins-bot: Enable the "Report Visual Bug" feature of Extension:ParserMigration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170549 (https://phabricator.wikimedia.org/T365371) (owner: 10C. Scott Ananian)
[20:21:23] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165094 (https://phabricator.wikimedia.org/T363484) (owner: 10C. Scott Ananian)
[20:21:35] <logmsgbot>	 !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1170549|Enable the "Report Visual Bug" feature of Extension:ParserMigration (T365371)]]
[20:21:40] <stashbot>	 T365371: ParserMigration: Add "report visual bug" link - https://phabricator.wikimedia.org/T365371
[20:23:43] <logmsgbot>	 !log cscott@deploy1003 cscott: Backport for [[gerrit:1170549|Enable the "Report Visual Bug" feature of Extension:ParserMigration (T365371)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:24:52] <wikibugs>	 (03CR) 10DDesouza: [C:03+2] miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172106 (https://phabricator.wikimedia.org/T390007) (owner: 10DDesouza)
[20:25:28] <wikibugs>	 (03PS1) 10C. Scott Ananian: Create "report visual bug" dialog [extensions/ParserMigration] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1172108 (https://phabricator.wikimedia.org/T365371)
[20:25:33] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2238', diff saved to https://phabricator.wikimedia.org/P79781 and previous config saved to /var/cache/conftool/dbconfig/20250723-202533-fceratto.json
[20:25:44] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/ParserMigration] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1172108 (https://phabricator.wikimedia.org/T365371) (owner: 10C. Scott Ananian)
[20:26:53] <logmsgbot>	 !log cscott@deploy1003 cscott: Continuing with sync
[20:26:56] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172106 (https://phabricator.wikimedia.org/T390007) (owner: 10DDesouza)
[20:28:44] <logmsgbot>	 !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply
[20:29:11] <logmsgbot>	 !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[20:29:12] <logmsgbot>	 !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[20:29:48] <logmsgbot>	 !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[20:29:49] <logmsgbot>	 !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply
[20:29:53] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): cirrussearch2089 (A4) and cirrussearch2091 (A7) possible hardware issues - https://phabricator.wikimedia.org/T400099#11029439 (10bking) @Jhancock.wm following up on our IRC discussion yesterday, I've already spent hours troublesho...
[20:30:15] <cscott>	 RoanKattouw: the hard part to do from your phone is X-Wikimedia-Debug, I expect.
[20:30:35] <logmsgbot>	 !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[20:30:59] <RoanKattouw>	 cscott: Yeah but if someone else requested the patch you can make them test it :) 
[20:32:08] <logmsgbot>	 !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1170549|Enable the "Report Visual Bug" feature of Extension:ParserMigration (T365371)]] (duration: 10m 32s)
[20:32:13] <stashbot>	 T365371: ParserMigration: Add "report visual bug" link - https://phabricator.wikimedia.org/T365371
[20:32:27] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:32:40] <cscott>	 i've got one more, hang on
[20:33:08] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [extensions/ParserMigration] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1172108 (https://phabricator.wikimedia.org/T365371) (owner: 10C. Scott Ananian)
[20:33:08] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165094 (https://phabricator.wikimedia.org/T363484) (owner: 10C. Scott Ananian)
[20:33:59] <wikibugs>	 (03Merged) 10jenkins-bot: Disable ParserMigration indicator and user notice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165094 (https://phabricator.wikimedia.org/T363484) (owner: 10C. Scott Ananian)
[20:34:11] <wikibugs>	 (03Merged) 10jenkins-bot: Create "report visual bug" dialog [extensions/ParserMigration] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1172108 (https://phabricator.wikimedia.org/T365371) (owner: 10C. Scott Ananian)
[20:34:35] <logmsgbot>	 !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1172108|Create "report visual bug" dialog (T365371)]], [[gerrit:1165094|Disable ParserMigration indicator and user notice (T363484 T363472)]]
[20:34:44] <stashbot>	 T363484: Update ParserMigration notice - https://phabricator.wikimedia.org/T363484
[20:34:45] <stashbot>	 T363472: MinT MVP: Support gradual deployments - https://phabricator.wikimedia.org/T363472
[20:36:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.326s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[20:37:03] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:37:09] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:37:41] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash1023.eqiad.wmnet with OS bookworm
[20:38:58] <jinxer-wm>	 RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[20:39:17] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:40:02] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] Gitlab: switchover from gitlab2002 to gitlab1004 [dns] - 10https://gerrit.wikimedia.org/r/1172029 (https://phabricator.wikimedia.org/T400252) (owner: 10Jelto)
[20:40:09] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 06 Oct 2025 08:56:14 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:40:41] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2238', diff saved to https://phabricator.wikimedia.org/P79783 and previous config saved to /var/cache/conftool/dbconfig/20250723-204041-fceratto.json
[20:41:16] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.047s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[20:43:14] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172110
[20:43:48] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Homer: PyEz "ignore_warnings" does not work for port-block speed change warning - https://phabricator.wikimedia.org/T400261#11029477 (10cmooney)
[20:44:17] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:45:27] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:47:11] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 06 Oct 2025 08:56:14 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:47:36] <danisztls>	 RoanKattouw: can you deploy mine?
[20:47:55] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.230 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:48:01] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54225 bytes in 0.220 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:50:28] <cscott>	 sorry i forgot that one of my backlogs touches localization and so it will Take Forever to rebuild the container images
[20:50:48] <cscott>	 i should have let danisztls slip in ahead of me
[20:51:32] <danisztls>	 cscott: no problem, I can wait
[20:55:49] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2238 (T399728)', diff saved to https://phabricator.wikimedia.org/P79784 and previous config saved to /var/cache/conftool/dbconfig/20250723-205548-fceratto.json
[20:55:54] <stashbot>	 T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728
[20:58:49] <logmsgbot>	 !log cscott@deploy1003 cscott: Backport for [[gerrit:1172108|Create "report visual bug" dialog (T365371)]], [[gerrit:1165094|Disable ParserMigration indicator and user notice (T363484 T363472)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:58:56] <stashbot>	 T365371: ParserMigration: Add "report visual bug" link - https://phabricator.wikimedia.org/T365371
[20:58:57] <stashbot>	 T363484: Update ParserMigration notice - https://phabricator.wikimedia.org/T363484
[20:58:57] <stashbot>	 T363472: MinT MVP: Support gradual deployments - https://phabricator.wikimedia.org/T363472
[21:00:05] <wikibugs>	 (03PS1) 10Xcollazo: analytics: Remove rsync scripts that import Dumps 1 XML into HDFS [puppet] - 10https://gerrit.wikimedia.org/r/1172113 (https://phabricator.wikimedia.org/T396031)
[21:00:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250723T2100)
[21:00:32] <wikibugs>	 (03CR) 10CI reject: [V:04-1] analytics: Remove rsync scripts that import Dumps 1 XML into HDFS [puppet] - 10https://gerrit.wikimedia.org/r/1172113 (https://phabricator.wikimedia.org/T396031) (owner: 10Xcollazo)
[21:01:20] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "<3" [software/homer] - 10https://gerrit.wikimedia.org/r/1171160 (owner: 10Volans)
[21:02:44] <logmsgbot>	 !log cscott@deploy1003 cscott: Continuing with sync
[21:02:57] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:04:54] <wikibugs>	 (03CR) 10Xcollazo: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1172113 (https://phabricator.wikimedia.org/T396031) (owner: 10Xcollazo)
[21:06:11] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#11029488 (10Scott_French)
[21:06:52] <wikibugs>	 (03CR) 10Xcollazo: "Hmm.. tests are failing but the logs don't say why." [puppet] - 10https://gerrit.wikimedia.org/r/1172113 (https://phabricator.wikimedia.org/T396031) (owner: 10Xcollazo)
[21:08:17] <wikibugs>	 (03CR) 10Volans: [C:03+2] setup.py: pin prospector [software/homer] - 10https://gerrit.wikimedia.org/r/1171160 (owner: 10Volans)
[21:11:39] <jinxer-wm>	 RESOLVED: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1075-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[21:11:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:11:48] <jinxer-wm>	 FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[21:15:32] <logmsgbot>	 !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1172108|Create "report visual bug" dialog (T365371)]], [[gerrit:1165094|Disable ParserMigration indicator and user notice (T363484 T363472)]] (duration: 40m 57s)
[21:15:40] <stashbot>	 T365371: ParserMigration: Add "report visual bug" link - https://phabricator.wikimedia.org/T365371
[21:15:41] <stashbot>	 T363484: Update ParserMigration notice - https://phabricator.wikimedia.org/T363484
[21:15:41] <stashbot>	 T363472: MinT MVP: Support gradual deployments - https://phabricator.wikimedia.org/T363472
[21:21:48] <jinxer-wm>	 RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[21:22:34] <wikibugs>	 (03Merged) 10jenkins-bot: setup.py: pin prospector [software/homer] - 10https://gerrit.wikimedia.org/r/1171160 (owner: 10Volans)
[21:24:20] <wikibugs>	 (03CR) 10JHathaway: "looks good overall, just a few questions and ideas" [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357) (owner: 10Elukey)
[21:27:35] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170760 (https://phabricator.wikimedia.org/T399736) (owner: 10DDesouza)
[21:37:33] <cscott>	 I'm done, sorry didn't immediately say that here.
[21:37:49] <cscott>	 RoanKattouw were you doing to do danisztls' patch?
[21:39:09] <RoanKattouw>	 cscott: I was an hour ago but I am busy now, sorry
[21:39:31] <cscott>	 danisztls: if you're available to test, i'm happy to run spiderpig for you.
[21:43:48] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thanks, Reuven! Also wow, I definitely learned something from the sleuthing you did on why `REMAINDER` isn't documented." [puppet] - 10https://gerrit.wikimedia.org/r/1172090 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus)
[21:50:30] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thanks, Ahmon!" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1172071 (owner: 10Ahmon Dancy)
[21:52:08] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.dns.netbox
[21:52:46] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thanks, Ahmon!" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1172064 (owner: 10Ahmon Dancy)
[21:53:26] <wikibugs>	 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10Infrastructure Security, and 2 others: Re-opening our DMarcian Trial - https://phabricator.wikimedia.org/T394788#11029549 (10nisrael) Great thank you Jesse! Just want to confirm, am I safe toinstruct our rep at DMarcian to restart our free trial?
[21:54:43] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thanks, Ahmon!" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1172072 (owner: 10Ahmon Dancy)
[21:55:04] <wikibugs>	 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10Infrastructure Security, and 2 others: Re-opening our DMarcian Trial - https://phabricator.wikimedia.org/T394788#11029551 (10jhathaway) >>! In T394788#11029549, @nisrael wrote: > Great thank you Jesse! Just want to confirm, am I safe toinstruct our re...
[21:55:29] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt clouddb1022 - vriley@cumin1002"
[21:55:33] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt clouddb1022 - vriley@cumin1002"
[21:55:34] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:56:05] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host clouddb1022
[21:57:25] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host clouddb1022
[22:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250723T2200)
[22:05:13] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host clouddb1022.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[22:11:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:14:04] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Infrastructure-Foundations, 10Mail: Access Request to DMarcDigests - https://phabricator.wikimedia.org/T399976#11029580 (10jhathaway) @nisrael I sent you an invite, let me know if you can get in.
[22:14:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11029581 (10VRiley-WMF)
[22:16:03] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:16:31] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1100 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:16:33] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1114 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:16:35] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1083 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:16:35] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1117 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:16:35] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1120 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:16:35] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1121 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:16:35] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1081 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:16:36] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1070 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:16:36] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1110 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:16:37] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1090 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:16:37] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1080 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:16:38] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1097 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:16:38] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1082 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:16:39] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1089 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:16:40] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.eqiad.wmnet:9243/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.eqiad.wmnet, port=9243): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:16:40] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1116 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:16:41] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1107 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:16:41] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1119 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:16:42] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1109 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:16:44] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#11029582 (10Scott_French)
[22:16:47] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1074 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:16:47] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1102 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:16:47] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1078 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:16:47] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1091 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:16:47] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1088 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:16:49] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1099 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:16:49] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1125 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:16:53] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1077 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:16:53] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1072 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:16:53] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1112 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:16:53] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1103 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:17:01] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1086 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:17:01] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1108 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:17:01] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host clouddb1022.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[22:17:03] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1084 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:17:03] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1069 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:17:03] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1095 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:17:03] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1124 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:17:03] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1101 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:17:04] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1115 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:17:07] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1118 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:17:07] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1111 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:17:13] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1068 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:17:13] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1123 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:17:13] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1094 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:17:13] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1093 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:17:23] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1113 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:17:27] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1098 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:17:27] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1076 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:17:27] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1087 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:17:27] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1079 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:17:27] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1085 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:17:28] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1073 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:17:28] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1075 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:17:29] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1092 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:17:29] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1096 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:17:30] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1071 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:17:45] <inflatador>	 ^^ on it
[22:17:52] <inflatador>	 eqiad is depooled so no user impact
[22:18:12] <swfrench-wmf>	 thanks, inflatador!
[22:18:25] <swfrench-wmf>	 saw your depool earlier, but was just about to ask :)
[22:23:32] <wikibugs>	 (03PS2) 10Ahmon Dancy: tox.ini: Pass --diff to black [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1172071
[22:24:09] <wikibugs>	 (03CR) 10Ahmon Dancy: tox.ini: Pass --diff to black (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1172071 (owner: 10Ahmon Dancy)
[22:25:11] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1094 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 59,
[22:25:11] <icinga-wm>	 of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 2246, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:25:13] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1093 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 63,
[22:25:13] <icinga-wm>	 of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 4116, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:25:15] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1113 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 63,
[22:25:15] <icinga-wm>	 of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 4774, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:25:15] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1123 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 63,
[22:25:15] <icinga-wm>	 of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 4934, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:25:15] <inflatador>	 swfrench-wmf np. we really need to figure out why quorum is an issue with this one particular cluster ;(
[22:25:19] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1076 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 2
[22:25:19] <icinga-wm>	 _of_in_flight_fetch: 55, task_max_waiting_in_queue_millis: 939, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:25:19] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1071 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 2
[22:25:19] <icinga-wm>	 _of_in_flight_fetch: 55, task_max_waiting_in_queue_millis: 941, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:25:19] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1085 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 2
[22:25:20] <icinga-wm>	 _of_in_flight_fetch: 55, task_max_waiting_in_queue_millis: 950, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:25:20] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1096 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 2
[22:25:21] <icinga-wm>	 _of_in_flight_fetch: 55, task_max_waiting_in_queue_millis: 961, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:25:21] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1098 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 2
[22:25:22] <icinga-wm>	 _of_in_flight_fetch: 55, task_max_waiting_in_queue_millis: 961, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:25:22] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1075 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 2
[22:25:23] <icinga-wm>	 _of_in_flight_fetch: 55, task_max_waiting_in_queue_millis: 966, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:25:23] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1087 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 2
[22:25:24] <icinga-wm>	 _of_in_flight_fetch: 55, task_max_waiting_in_queue_millis: 966, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:25:24] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1092 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 2
[22:25:25] <icinga-wm>	 _of_in_flight_fetch: 55, task_max_waiting_in_queue_millis: 963, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:25:25] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1079 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 2
[22:25:26] <icinga-wm>	 _of_in_flight_fetch: 55, task_max_waiting_in_queue_millis: 974, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:25:26] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1073 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 2
[22:25:27] <icinga-wm>	 _of_in_flight_fetch: 55, task_max_waiting_in_queue_millis: 1002, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:26:39] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1078 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 3944, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 417, delayed_unassigned_shards: 0, number_of_pendi
[22:26:39] <icinga-wm>	 : 3, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 61228, active_shards_percent_as_number: 90.43797294198578 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:26:39] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1088 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 3944, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 417, delayed_unassigned_shards: 0, number_of_pendi
[22:26:39] <icinga-wm>	 : 3, number_of_in_flight_fetch: 55, task_max_waiting_in_queue_millis: 61237, active_shards_percent_as_number: 90.43797294198578 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:26:39] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1091 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 3944, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 417, delayed_unassigned_shards: 0, number_of_pendi
[22:26:40] <icinga-wm>	 : 3, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 61256, active_shards_percent_as_number: 90.43797294198578 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:26:40] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1102 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 3944, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 417, delayed_unassigned_shards: 0, number_of_pendi
[22:26:41] <icinga-wm>	 : 3, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 61253, active_shards_percent_as_number: 90.43797294198578 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:26:41] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1074 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 3944, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 417, delayed_unassigned_shards: 0, number_of_pendi
[22:26:42] <icinga-wm>	 : 3, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 61271, active_shards_percent_as_number: 90.43797294198578 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:26:42] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1099 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4090, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 269, delayed_unassigned_shards: 0, number_of_pendi
[22:26:42] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] deployment_server: Fix argparse double-dash handling in mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1172090 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus)
[22:26:43] <icinga-wm>	 : 59, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 64005, active_shards_percent_as_number: 93.7858289383169 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:26:43] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1125 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4090, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 269, delayed_unassigned_shards: 0, number_of_pendi
[22:26:44] <icinga-wm>	 : 59, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 64018, active_shards_percent_as_number: 93.7858289383169 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:26:45] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1072 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4314, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 42, delayed_unassigned_shards: 0, number_of_pendin
[22:26:45] <icinga-wm>	  2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 68059, active_shards_percent_as_number: 98.92226553542766 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:26:45] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1112 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4314, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 42, delayed_unassigned_shards: 0, number_of_pendin
[22:26:46] <icinga-wm>	  2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 68066, active_shards_percent_as_number: 98.92226553542766 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:26:46] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1077 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4314, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 42, delayed_unassigned_shards: 0, number_of_pendin
[22:26:47] <icinga-wm>	  2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 68081, active_shards_percent_as_number: 98.92226553542766 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:26:47] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1103 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4314, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 42, delayed_unassigned_shards: 0, number_of_pendin
[22:26:48] <icinga-wm>	  2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 68101, active_shards_percent_as_number: 98.92226553542766 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:26:53] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1086 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: green, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4361, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_
[22:26:53] <icinga-wm>	 , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:26:53] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1108 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: green, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4361, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_
[22:26:53] <icinga-wm>	 , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:26:55] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1095 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: green, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4361, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_
[22:26:55] <icinga-wm>	 , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:26:55] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1101 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: green, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4361, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_
[22:26:55] <icinga-wm>	 , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:26:55] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1124 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: green, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4361, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_
[22:26:56] <icinga-wm>	 , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:26:56] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1115 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: green, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4361, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_
[22:26:57] <icinga-wm>	 , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:26:57] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1084 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: green, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4361, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_
[22:26:58] <icinga-wm>	 , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:26:58] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1069 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: green, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4361, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_
[22:26:59] <icinga-wm>	 , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:26:59] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1111 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: green, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4361, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_
[22:27:00] <icinga-wm>	 , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:27:00] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1118 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: green, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4361, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_
[22:27:01] <icinga-wm>	 , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:27:07] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1068 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: green, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4361, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_
[22:27:07] <icinga-wm>	 , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:27:23] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1100 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: green, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4361, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_
[22:27:23] <icinga-wm>	 , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:27:25] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1114 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: green, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4361, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_
[22:27:25] <icinga-wm>	 , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:27:27] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1081 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: green, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4361, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_
[22:27:27] <icinga-wm>	 , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:27:27] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1120 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: green, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4361, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_
[22:27:27] <icinga-wm>	 , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:27:27] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1090 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: green, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4361, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_
[22:27:28] <icinga-wm>	 , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:27:28] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1110 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: green, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4361, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_
[22:38:57] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1122 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:01] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1086 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:01] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1108 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:03] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1095 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:05] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1115 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:05] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1084 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:05] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1124 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:05] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1101 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:05] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1069 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:06] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1111 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:17] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1123 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:17] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1093 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:17] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1094 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:17] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1068 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:23] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1113 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:27] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1071 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:27] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1079 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:27] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1073 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:27] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1096 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:28] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1076 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:28] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1092 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:28] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1085 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:28] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1087 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:29] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1075 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:29] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1098 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:31] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1100 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:33] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1114 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:35] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1110 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:35] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1070 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:35] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1090 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:35] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1083 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:35] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1120 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:35] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1121 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:35] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1117 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:36] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1089 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:36] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1097 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:37] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1080 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:37] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1082 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:38] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.eqiad.wmnet:9243/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.eqiad.wmnet, port=9243): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:39] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1116 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:39] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1107 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:40] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1109 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:47] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1091 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:47] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1078 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:47] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1074 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:47] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1102 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:47] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1088 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:49] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1099 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:49] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1125 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:53] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1077 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:53] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1072 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:53] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1112 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:39:53] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1103 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:41:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[22:43:55] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1086 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1404, active_shards: 4280, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 81, delayed_unassigned_shards: 81, number_of_pending_
[22:43:55] <icinga-wm>	 4, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 1274, active_shards_percent_as_number: 98.14262783765192 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:43:55] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1122 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1404, active_shards: 4280, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 81, delayed_unassigned_shards: 81, number_of_pending_
[22:43:55] <icinga-wm>	 4, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 1528, active_shards_percent_as_number: 98.14262783765192 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:43:55] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1108 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1404, active_shards: 4280, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 81, delayed_unassigned_shards: 81, number_of_pending_
[22:43:55] <icinga-wm>	 4, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 2046, active_shards_percent_as_number: 98.14262783765192 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:43:57] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1084 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1403, active_shards: 4118, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 243, delayed_unassigned_shards: 162, number_of_pendin
[22:43:57] <icinga-wm>	  46, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 3837, active_shards_percent_as_number: 94.42788351295575 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:43:57] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1115 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1403, active_shards: 4118, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 243, delayed_unassigned_shards: 162, number_of_pendin
[22:43:57] <icinga-wm>	  46, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 3846, active_shards_percent_as_number: 94.42788351295575 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:43:57] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1124 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1403, active_shards: 4118, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 243, delayed_unassigned_shards: 162, number_of_pendin
[22:43:58] <icinga-wm>	  46, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 3842, active_shards_percent_as_number: 94.42788351295575 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:43:59] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1095 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1403, active_shards: 4118, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 243, delayed_unassigned_shards: 162, number_of_pendin
[22:43:59] <icinga-wm>	  46, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 3843, active_shards_percent_as_number: 94.42788351295575 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:43:59] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1101 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1403, active_shards: 4118, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 243, delayed_unassigned_shards: 162, number_of_pendin
[22:44:00] <icinga-wm>	  46, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 3847, active_shards_percent_as_number: 94.42788351295575 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:44:00] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1111 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1403, active_shards: 4118, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 243, delayed_unassigned_shards: 162, number_of_pendin
[22:44:01] <icinga-wm>	  48, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 5691, active_shards_percent_as_number: 94.42788351295575 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:44:01] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1069 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1403, active_shards: 4118, relocating_shards: 8, initializing_shards: 0, unassigned_shards: 243, delayed_unassigned_shards: 162, number_of_pendin
[22:44:02] <icinga-wm>	  48, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 5717, active_shards_percent_as_number: 94.42788351295575 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:44:09] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1123 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4120, relocating_shards: 7, initializing_shards: 0, unassigned_shards: 241, delayed_unassigned_shards: 161, number_of_pen
[22:44:09] <icinga-wm>	 ks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 94.47374455400137 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:44:09] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1093 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4120, relocating_shards: 7, initializing_shards: 0, unassigned_shards: 241, delayed_unassigned_shards: 161, number_of_pen
[22:44:09] <icinga-wm>	 ks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 94.47374455400137 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:44:09] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1068 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4120, relocating_shards: 7, initializing_shards: 0, unassigned_shards: 241, delayed_unassigned_shards: 161, number_of_pen
[22:44:09] <icinga-wm>	 ks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 94.47374455400137 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:44:09] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1094 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4120, relocating_shards: 7, initializing_shards: 0, unassigned_shards: 241, delayed_unassigned_shards: 161, number_of_pen
[22:44:10] <icinga-wm>	 ks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 94.47374455400137 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:44:15] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1113 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4120, relocating_shards: 7, initializing_shards: 0, unassigned_shards: 241, delayed_unassigned_shards: 161, number_of_pen
[22:44:15] <icinga-wm>	 ks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 94.47374455400137 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:44:19] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1087 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4120, relocating_shards: 7, initializing_shards: 0, unassigned_shards: 241, delayed_unassigned_shards: 161, number_of_pen
[22:44:19] <icinga-wm>	 ks: 4, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 262, active_shards_percent_as_number: 94.47374455400137 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:44:19] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1092 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4120, relocating_shards: 7, initializing_shards: 0, unassigned_shards: 241, delayed_unassigned_shards: 161, number_of_pen
[22:44:19] <icinga-wm>	 ks: 4, number_of_in_flight_fetch: 55, task_max_waiting_in_queue_millis: 274, active_shards_percent_as_number: 94.47374455400137 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:44:19] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1075 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4120, relocating_shards: 7, initializing_shards: 0, unassigned_shards: 241, delayed_unassigned_shards: 161, number_of_pen
[22:44:19] <icinga-wm>	 ks: 4, number_of_in_flight_fetch: 55, task_max_waiting_in_queue_millis: 274, active_shards_percent_as_number: 94.47374455400137 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:44:20] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1076 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4120, relocating_shards: 7, initializing_shards: 0, unassigned_shards: 241, delayed_unassigned_shards: 161, number_of_pen
[22:44:20] <icinga-wm>	 ks: 4, number_of_in_flight_fetch: 330, task_max_waiting_in_queue_millis: 286, active_shards_percent_as_number: 94.47374455400137 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:44:20] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1073 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4120, relocating_shards: 7, initializing_shards: 0, unassigned_shards: 241, delayed_unassigned_shards: 161, number_of_pen
[22:44:21] <icinga-wm>	 ks: 4, number_of_in_flight_fetch: 330, task_max_waiting_in_queue_millis: 286, active_shards_percent_as_number: 94.47374455400137 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:44:21] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1098 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4120, relocating_shards: 7, initializing_shards: 0, unassigned_shards: 241, delayed_unassigned_shards: 161, number_of_pen
[22:44:22] <icinga-wm>	 ks: 4, number_of_in_flight_fetch: 495, task_max_waiting_in_queue_millis: 295, active_shards_percent_as_number: 94.47374455400137 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:44:22] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1079 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4120, relocating_shards: 7, initializing_shards: 0, unassigned_shards: 241, delayed_unassigned_shards: 161, number_of_pen
[22:44:23] <icinga-wm>	 ks: 4, number_of_in_flight_fetch: 880, task_max_waiting_in_queue_millis: 307, active_shards_percent_as_number: 94.47374455400137 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:44:23] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1096 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4120, relocating_shards: 7, initializing_shards: 0, unassigned_shards: 241, delayed_unassigned_shards: 161, number_of_pen
[22:44:24] <icinga-wm>	 ks: 4, number_of_in_flight_fetch: 880, task_max_waiting_in_queue_millis: 307, active_shards_percent_as_number: 94.47374455400137 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:44:24] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1071 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4120, relocating_shards: 7, initializing_shards: 0, unassigned_shards: 241, delayed_unassigned_shards: 161, number_of_pen
[22:44:25] <icinga-wm>	 ks: 4, number_of_in_flight_fetch: 1045, task_max_waiting_in_queue_millis: 321, active_shards_percent_as_number: 94.47374455400137 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:44:25] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1085 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4120, relocating_shards: 7, initializing_shards: 0, unassigned_shards: 241, delayed_unassigned_shards: 161, number_of_pen
[22:44:26] <icinga-wm>	 ks: 5, number_of_in_flight_fetch: 1265, task_max_waiting_in_queue_millis: 336, active_shards_percent_as_number: 94.47374455400137 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:44:26] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1100 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4120, relocating_shards: 7, initializing_shards: 0, unassigned_shards: 241, delayed_unassigned_shards: 161, number_of_pen
[22:44:27] <icinga-wm>	 ks: 39, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 3728, active_shards_percent_as_number: 94.47374455400137 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:44:27] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1114 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4176, relocating_shards: 7, initializing_shards: 4, unassigned_shards: 181, delayed_unassigned_shards: 121, number_of_pen
[22:53:43] <logmsgbot>	 !log bking@cumin1002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: activate new plugins packages - bking@cumin1002 - T397227
[22:53:48] <stashbot>	 T397227: Build and deploy OpenSearch plugins package for updated regex search - https://phabricator.wikimedia.org/T397227
[22:54:14] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[22:58:48] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#11029630 (10Scott_French)
[23:01:49] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1122 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f048ae0e1c0: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitec
[23:01:49] <icinga-wm>	 dia.org/wiki/Search%23Administration
[23:01:53] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1077 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:01:53] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1072 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:01:53] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1112 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:01:53] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1103 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:01] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1086 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:01] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1108 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:05] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1095 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:07] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1115 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:07] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1101 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:07] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1084 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:07] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1124 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:07] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1111 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:08] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1118 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:08] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1069 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:14] <inflatador>	 ^^ we're still testing this, I think we have a root cause now
[23:02:19] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1094 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:19] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1093 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:19] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1123 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:19] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1068 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:23] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1113 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:27] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1076 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:27] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1085 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:27] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1079 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:27] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1087 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:27] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1098 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:28] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1092 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:28] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1071 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:29] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1073 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:29] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1075 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:30] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1096 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:31] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1100 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:33] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1114 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:35] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1070 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:35] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1120 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:35] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1121 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:35] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1083 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:35] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1081 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:36] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1090 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:36] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1117 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:37] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1110 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:37] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1097 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:38] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1080 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:38] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1089 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:39] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1082 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:40] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.eqiad.wmnet:9243/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.eqiad.wmnet, port=9243): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:40] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1116 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:41] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1107 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:41] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1119 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:42] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1109 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:47] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1078 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:47] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1088 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:47] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1102 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:47] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1091 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:47] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1074 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:51] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1099 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:51] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1125 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:05:13] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1094 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 56,
[23:05:13] <icinga-wm>	 of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 1811, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:05:13] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1068 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 57,
[23:05:13] <icinga-wm>	 of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 2788, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:05:15] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1093 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 57,
[23:05:15] <icinga-wm>	 of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 3232, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:05:15] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1123 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 57,
[23:05:15] <icinga-wm>	 of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 3949, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:05:17] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1113 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 53, number_of_data_nodes: 53, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 4
[23:05:17] <icinga-wm>	 _of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 6873, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:05:19] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1076 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 2
[23:05:19] <icinga-wm>	 _of_in_flight_fetch: 54, task_max_waiting_in_queue_millis: 1260, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:05:19] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1079 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 2
[23:05:19] <icinga-wm>	 _of_in_flight_fetch: 54, task_max_waiting_in_queue_millis: 1255, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:05:19] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1073 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 2
[23:05:19] <icinga-wm>	 _of_in_flight_fetch: 54, task_max_waiting_in_queue_millis: 1261, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:06:39] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1088 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 3878, relocating_shards: 0, initializing_shards: 37, unassigned_shards: 446, delayed_unassigned_shards: 0, number_of_pend
[23:06:39] <icinga-wm>	 s: 63, number_of_in_flight_fetch: 55, task_max_waiting_in_queue_millis: 76567, active_shards_percent_as_number: 88.92455858747994 https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:06:39] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1091 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 3878, relocating_shards: 0, initializing_shards: 37, unassigned_shards: 446, delayed_unassigned_shards: 0, number_of_pend
[23:06:39] <icinga-wm>	 s: 63, number_of_in_flight_fetch: 55, task_max_waiting_in_queue_millis: 76575, active_shards_percent_as_number: 88.92455858747994 https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:06:39] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1078 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 3878, relocating_shards: 0, initializing_shards: 37, unassigned_shards: 446, delayed_unassigned_shards: 0, number_of_pend
[23:06:39] <icinga-wm>	 s: 63, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 76596, active_shards_percent_as_number: 88.92455858747994 https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:06:39] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1102 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 3878, relocating_shards: 0, initializing_shards: 37, unassigned_shards: 446, delayed_unassigned_shards: 0, number_of_pend
[23:06:40] <icinga-wm>	 s: 22, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 76597, active_shards_percent_as_number: 88.92455858747994 https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:06:40] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1074 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 3899, relocating_shards: 0, initializing_shards: 16, unassigned_shards: 446, delayed_unassigned_shards: 0, number_of_pend
[23:06:41] <icinga-wm>	 s: 22, number_of_in_flight_fetch: 55, task_max_waiting_in_queue_millis: 76612, active_shards_percent_as_number: 89.40609951845907 https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:06:43] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1099 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4062, relocating_shards: 0, initializing_shards: 7, unassigned_shards: 292, delayed_unassigned_shards: 0, number_of_pendi
[23:06:43] <icinga-wm>	 : 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 80124, active_shards_percent_as_number: 93.14377436367806 https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:06:43] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1125 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4062, relocating_shards: 0, initializing_shards: 7, unassigned_shards: 292, delayed_unassigned_shards: 0, number_of_pendi
[23:06:43] <icinga-wm>	 : 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 80136, active_shards_percent_as_number: 93.14377436367806 https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:06:45] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1072 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4165, relocating_shards: 0, initializing_shards: 7, unassigned_shards: 189, delayed_unassigned_shards: 0, number_of_pendi
[23:06:45] <icinga-wm>	 : 22, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 83002, active_shards_percent_as_number: 95.50561797752809 https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:06:45] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1112 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4165, relocating_shards: 0, initializing_shards: 7, unassigned_shards: 189, delayed_unassigned_shards: 0, number_of_pendi
[23:06:45] <icinga-wm>	 : 24, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 83009, active_shards_percent_as_number: 95.50561797752809 https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:06:45] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1077 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4165, relocating_shards: 0, initializing_shards: 7, unassigned_shards: 189, delayed_unassigned_shards: 0, number_of_pendi
[23:06:46] <icinga-wm>	 : 23, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 83006, active_shards_percent_as_number: 95.50561797752809 https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:06:46] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1103 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4165, relocating_shards: 0, initializing_shards: 7, unassigned_shards: 189, delayed_unassigned_shards: 0, number_of_pendi
[23:06:47] <icinga-wm>	 : 31, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 83023, active_shards_percent_as_number: 95.50561797752809 https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:06:49] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1122 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4282, relocating_shards: 0, initializing_shards: 10, unassigned_shards: 69, delayed_unassigned_shards: 0, number_of_pendi
[23:06:49] <icinga-wm>	 : 12, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 86669, active_shards_percent_as_number: 98.18848887869754 https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:06:53] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1086 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4333, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 24, delayed_unassigned_shards: 0, number_of_pendin
[23:06:53] <icinga-wm>	  1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.35794542536117 https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:06:53] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1108 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4333, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 24, delayed_unassigned_shards: 0, number_of_pendin
[23:06:53] <icinga-wm>	  1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.35794542536117 https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:06:59] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1069 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4347, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 9, delayed_unassigned_shards: 0, number_of_pending
[23:06:59] <icinga-wm>	 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.67897271268058 https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:06:59] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1101 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4347, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 9, delayed_unassigned_shards: 0, number_of_pending
[23:06:59] <icinga-wm>	 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.67897271268058 https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:06:59] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1095 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4347, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 9, delayed_unassigned_shards: 0, number_of_pending
[23:06:59] <icinga-wm>	 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.67897271268058 https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:06:59] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1115 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4347, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 9, delayed_unassigned_shards: 0, number_of_pending
[23:07:00] <icinga-wm>	 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.67897271268058 https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:07:00] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1084 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4347, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 9, delayed_unassigned_shards: 0, number_of_pending
[23:07:01] <icinga-wm>	 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.67897271268058 https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:07:01] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1124 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4347, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 9, delayed_unassigned_shards: 0, number_of_pending
[23:07:02] <icinga-wm>	 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.67897271268058 https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:07:02] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1111 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4347, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 9, delayed_unassigned_shards: 0, number_of_pending
[23:07:03] <icinga-wm>	 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.67897271268058 https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:07:03] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1118 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4347, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 9, delayed_unassigned_shards: 0, number_of_pending
[23:07:04] <icinga-wm>	 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.67897271268058 https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:07:19] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1096 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4357, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending
[23:07:19] <icinga-wm>	 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.90827791790873 https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:07:19] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1071 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4357, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending
[23:07:19] <icinga-wm>	 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.90827791790873 https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:07:19] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1085 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4357, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending
[23:07:20] <icinga-wm>	 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.90827791790873 https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:07:20] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1087 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4357, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending
[23:07:21] <icinga-wm>	 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.90827791790873 https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:07:21] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1092 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4357, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending
[23:07:22] <icinga-wm>	 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.90827791790873 https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:07:22] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1075 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4357, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending
[23:07:23] <icinga-wm>	 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.90827791790873 https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:07:23] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1098 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4357, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending
[23:07:24] <icinga-wm>	 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.90827791790873 https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:07:24] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1100 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4357, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending
[23:07:25] <icinga-wm>	 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.90827791790873 https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:07:25] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1114 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4357, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending
[23:07:26] <icinga-wm>	 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.90827791790873 https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:07:27] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1121 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4357, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending
[23:07:27] <icinga-wm>	 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.90827791790873 https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:07:27] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1090 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4357, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending
[23:08:57] <logmsgbot>	 !log bking@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 55 hosts with reason: testing cluster quorum
[23:09:27] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[23:11:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:14:38] <logmsgbot>	 !log bking@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=search,name=eqiad
[23:15:03] <inflatador>	 !log pool cirrussearch eqiad, will resume investigations tomorrow T400160
[23:15:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:15:08] <stashbot>	 T400160: Investigate eqiad cluster quorum failure issues - https://phabricator.wikimedia.org/T400160
[23:16:03] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[23:38:05] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1172124
[23:38:05] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1172124 (owner: 10TrainBranchBot)
[23:39:44] <logmsgbot>	 dzahn@cumin2002 dzahn: The backup on gitlab2002 is complete, ready to proceed with upgrade.
[23:42:44] <logmsgbot>	 dzahn@cumin2002 upgrade (PID 1166963) is awaiting input
[23:46:09] <logmsgbot>	 !log ryankemper@cumin1002 conftool action : set/pooled=false; selector: dnsdisc=search,name=codfw
[23:46:35] <ryankemper>	 !log [Cirrus] Depooled codfw in anticipation of rolling restart. Hopefully minimal noise on this one :)
[23:46:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:48:58] <logmsgbot>	 !log ryankemper@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: activate new plugins packages - ryankemper@cumin1002 - T397227
[23:49:03] <stashbot>	 T397227: Build and deploy OpenSearch plugins package for updated regex search - https://phabricator.wikimedia.org/T397227
[23:51:48] <jinxer-wm>	 RESOLVED: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[23:53:46] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1172124 (owner: 10TrainBranchBot)
[23:54:09] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: security release 20250723
[23:54:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:55:03] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:55:11] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:56:10] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[23:56:39] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[23:59:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed