[00:04:39] FIRING: [6x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [00:09:10] RESOLVED: [2x] SystemdUnitFailed: opensearch_2@cloudelastic-psi-eqiad.servic.service on cloudelastic1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:09:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [00:11:45] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [00:14:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [00:19:34] FIRING: [220x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [00:21:53] FIRING: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [00:21:59] CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [00:22:02] FIRING: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [00:22:08] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [00:24:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [00:26:45] RESOLVED: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [00:26:45] CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [00:26:53] RESOLVED: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [00:26:59] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [00:29:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [00:32:04] (03CR) 10RLazarus: "Yikes, the diff is definitely a mess." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277518 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [00:39:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [00:44:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [00:49:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [00:53:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:59:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [00:59:39] FIRING: [8x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:06:45] FIRING: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [01:06:51] CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [01:06:53] FIRING: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [01:06:59] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [01:09:16] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.46.0-wmf.26 [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1277808 (https://phabricator.wikimedia.org/T423877) [01:09:18] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.46.0-wmf.26 [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1277808 (https://phabricator.wikimedia.org/T423877) (owner: 10TrainBranchBot) [01:09:26] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1277809 [01:09:26] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1277809 (owner: 10TrainBranchBot) [01:09:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:11:45] RESOLVED: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [01:11:45] CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [01:11:53] RESOLVED: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [01:11:59] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [01:18:25] FIRING: [16x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:19:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:20:00] (03Merged) 10jenkins-bot: Branch commit for wmf/1.46.0-wmf.26 [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1277808 (https://phabricator.wikimedia.org/T423877) (owner: 10TrainBranchBot) [01:20:23] FIRING: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [01:20:29] CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [01:20:32] FIRING: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [01:20:38] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [01:21:19] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1277809 (owner: 10TrainBranchBot) [01:24:39] FIRING: [8x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:29:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:34:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:34:39] FIRING: [11x] CertAlmostExpired: Certificate for service wdqs1015:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:39:39] FIRING: [12x] CertAlmostExpired: Certificate for service wdqs1015:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:49:12] 10ops-eqiad, 06DC-Ops: verify if cable is connected or not - https://phabricator.wikimedia.org/T424601 (10Jhancock.wm) 03NEW [01:53:27] 10ops-eqiad, 06DC-Ops: verify cables - https://phabricator.wikimedia.org/T424601#11863891 (10Jhancock.wm) [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260428T0200) [02:01:10] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:04:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:05:20] FIRING: [3x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in 5d 11h 49m 25s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [02:07:26] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 15s) [02:09:19] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:09:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:10:15] RESOLVED: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [02:10:21] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [02:14:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:15:15] FIRING: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [02:15:15] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [02:20:15] RESOLVED: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [02:20:21] CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [02:20:23] RESOLVED: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [02:20:24] (03CR) 10BPirkle: [C:03+1] Add wikibase.v1 module to the sandbox were it is present [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276814 (https://phabricator.wikimedia.org/T422403) (owner: 10Aaron Schulz) [02:20:29] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [02:29:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:31:15] FIRING: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [02:31:21] CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [02:31:32] FIRING: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [02:31:38] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [02:34:19] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:34:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:39:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:41:15] RESOLVED: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [02:41:21] CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [02:41:32] RESOLVED: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [02:41:38] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [02:41:50] RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [02:44:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:49:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:00:04] Deploy window Automatic deployment of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260428T0300) [03:01:54] (03PS1) 10TrainBranchBot: testwikis to 1.46.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277863 (https://phabricator.wikimedia.org/T423877) [03:01:57] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by mwpresync@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277863 (https://phabricator.wikimedia.org/T423877) (owner: 10TrainBranchBot) [03:02:50] (03Merged) 10jenkins-bot: testwikis to 1.46.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277863 (https://phabricator.wikimedia.org/T423877) (owner: 10TrainBranchBot) [03:03:16] !log mwpresync@deploy1003 Started scap sync-world: testwikis to 1.46.0-wmf.26 refs T423877 [03:03:21] T423877: 1.46.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T423877 [03:07:19] 06SRE, 06Traffic, 10WMF-General-or-Unknown: API rate limit triggered for regular user - https://phabricator.wikimedia.org/T424588#11864010 (10Bugreporter) [03:09:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:09:45] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [03:14:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:14:39] FIRING: [12x] CertAlmostExpired: Certificate for service wdqs1015:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:18:25] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [03:24:45] RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [03:28:25] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [03:29:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:34:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:38:50] !log mwpresync@deploy1003 Finished scap sync-world: testwikis to 1.46.0-wmf.26 refs T423877 (duration: 35m 34s) [03:38:55] T423877: 1.46.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T423877 [03:39:39] FIRING: [12x] CertAlmostExpired: Certificate for service wdqs1015:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:50:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 16.46% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:54:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:55:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.66% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:59:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.28% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260428T0400) [04:04:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.45% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:04:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [04:05:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 20.76% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:14:30] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 21.97% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:19:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 21.28% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:24:39] FIRING: [13x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [04:29:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [04:37:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:et-1/1/5 (Transport: cr2-codfw:et-0/1/4 (Lumen, 449169461) {#changeme_lumen_patch}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:39:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [04:42:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:44:34] FIRING: [222x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [04:49:34] FIRING: [222x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [04:53:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:54:39] FIRING: [13x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [04:57:22] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN:New switch setup/configuration - https://phabricator.wikimedia.org/T418439#11864039 (10Papaul) [04:59:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [04:59:39] FIRING: [11x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:03:22] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 22 hosts with reason: Primary switchover s6 T424522 [05:03:26] T424522: Switchover s6 master (db1173 -> db1201) - https://phabricator.wikimedia.org/T424522 [05:03:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set db1201 with weight 0 T424522', diff saved to https://phabricator.wikimedia.org/P91697 and previous config saved to /var/cache/conftool/dbconfig/20260428-050328-marostegui.json [05:04:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:04:45] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1201 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1277551 (https://phabricator.wikimedia.org/T424522) (owner: 10Gerrit maintenance bot) [05:06:28] (03PS1) 10Kevin Bazira: ml-services: bump up k8s resources in llm ns to enable gpt isvc deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277934 (https://phabricator.wikimedia.org/T418350) [05:06:52] !log Starting s6 eqiad failover from db1173 to db1201 - T424522 [05:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:07:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set s6 eqiad as read-only for maintenance - T424522', diff saved to https://phabricator.wikimedia.org/P91698 and previous config saved to /var/cache/conftool/dbconfig/20260428-050714-marostegui.json [05:07:17] marostegui@cumin1003: Failed to log message to wiki. Somebody should check the error logs. [05:07:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote db1201 to s6 primary and set section read-write T424522', diff saved to https://phabricator.wikimedia.org/P91699 and previous config saved to /var/cache/conftool/dbconfig/20260428-050742-marostegui.json [05:08:37] (03CR) 10Marostegui: [C:03+2] wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1277552 (https://phabricator.wikimedia.org/T424522) (owner: 10Gerrit maintenance bot) [05:08:44] !log marostegui@dns1005 START - running authdns-update [05:09:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:09:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1173 T424522', diff saved to https://phabricator.wikimedia.org/P91700 and previous config saved to /var/cache/conftool/dbconfig/20260428-050938-marostegui.json [05:09:44] T424522: Switchover s6 master (db1173 -> db1201) - https://phabricator.wikimedia.org/T424522 [05:10:06] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1173: Repooling after switchover [05:10:15] !log marostegui@dns1005 END - running authdns-update [05:10:28] !log marostegui@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool db1173: Repooling after switchover [05:10:38] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1173: Repooling after switchover [05:11:13] PROBLEM - orchestrator resolve cache non-FQDNs on dborch1002 is CRITICAL: CRITICAL: 2 non-FQDN entries in orchestrator resolve cache: https://wikitech.wikimedia.org/wiki/Orchestrator [05:12:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.24% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [05:17:12] RECOVERY - orchestrator resolve cache non-FQDNs on dborch1002 is OK: OK: all orchestrator resolve cache entries are FQDNs https://wikitech.wikimedia.org/wiki/Orchestrator [05:17:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.24% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [05:17:50] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 135 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:18:25] FIRING: [16x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:19:39] FIRING: [10x] CertAlmostExpired: Certificate for service wdqs1015:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:20:12] PROBLEM - orchestrator resolve cache non-FQDNs on dborch1002 is CRITICAL: CRITICAL: 2 non-FQDN entries in orchestrator resolve cache: https://wikitech.wikimedia.org/wiki/Orchestrator [05:21:33] (03PS1) 10Marostegui: db1173: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1278038 (https://phabricator.wikimedia.org/T422777) [05:24:01] (03CR) 10Marostegui: [C:03+2] db1173: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1278038 (https://phabricator.wikimedia.org/T422777) (owner: 10Marostegui) [05:25:32] (03PS1) 10Marostegui: db1161: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1278175 (https://phabricator.wikimedia.org/T424323) [05:26:15] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on 14 hosts with reason: Sanitarium: reimage to Debian Trixie [05:26:28] (03CR) 10Marostegui: [C:03+2] db1161: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1278175 (https://phabricator.wikimedia.org/T424323) (owner: 10Marostegui) [05:26:47] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1161.eqiad.wmnet with reason: Reimage to Trixie [05:26:52] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1161: Reimage to Trixie [05:29:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:29:52] (03PS1) 10Kevin Bazira: ml-services: deploy gpt isvc in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278182 (https://phabricator.wikimedia.org/T418350) [05:30:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.14% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [05:35:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.38% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [05:36:22] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1161: Reimage to Trixie [05:37:13] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1161.eqiad.wmnet with OS trixie [05:41:24] (03PS1) 10Marostegui: db2192: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1278190 (https://phabricator.wikimedia.org/T424323) [05:41:56] (03CR) 10Marostegui: [C:03+2] db2192: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1278190 (https://phabricator.wikimedia.org/T424323) (owner: 10Marostegui) [05:42:08] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2192.codfw.wmnet with reason: Reimage to Trixie [05:42:14] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2192: Reimage to Trixie [05:42:53] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2192: Reimage to Trixie [05:44:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:45:33] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2192.codfw.wmnet with OS trixie [05:49:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:50:02] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1161.eqiad.wmnet with reason: host reimage [05:53:31] FIRING: Outbound discards: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [05:54:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:55:46] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1161.eqiad.wmnet with reason: host reimage [05:56:06] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1173: Repooling after switchover [05:59:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260428T0600) [06:00:05] marostegui, Amir1, and federico3: It is that lovely time of the day again! You are hereby commanded to deploy Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260428T0600). [06:00:29] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1173.eqiad.wmnet with reason: Reimage to Trixie [06:00:34] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1173: Reimage to Trixie [06:01:02] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1173: Reimage to Trixie [06:02:12] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1173.eqiad.wmnet with OS trixie [06:03:31] RESOLVED: Outbound discards: Device asw2-b-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [06:05:20] FIRING: [3x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in 5d 7h 49m 25s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [06:07:29] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2192.codfw.wmnet with reason: host reimage [06:09:05] (03PS1) 10Marostegui: Revert "db1161: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1278209 [06:10:12] (03CR) 10Marostegui: [C:03+2] Revert "db1161: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1278209 (owner: 10Marostegui) [06:11:00] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ncredir5001.eqsin.wmnet [06:11:36] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2192.codfw.wmnet with reason: host reimage [06:11:56] (03PS1) 10Muehlenhoff: Remove ncredir5001/5002 from conf-tool [puppet] - 10https://gerrit.wikimedia.org/r/1278213 (https://phabricator.wikimedia.org/T421863) [06:17:00] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [06:17:51] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1173.eqiad.wmnet with reason: host reimage [06:18:33] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1161.eqiad.wmnet with OS trixie [06:19:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:19:59] (03PS1) 10Marostegui: Revert "db1173: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1278221 [06:20:13] FIRING: BFDdown: BFD session down between cr2-eqdfw and fe80::b6f9:5dff:fe30:e538 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [06:21:44] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1161: after reimage to trixie [06:22:43] jmm@cumin2002 decommission (PID 3343980) is awaiting input [06:23:55] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1173.eqiad.wmnet with reason: host reimage [06:24:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:24:47] (03PS1) 10Marostegui: Revert "db2192: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1278224 [06:25:13] RESOLVED: BFDdown: BFD session down between cr2-eqdfw and fe80::b6f9:5dff:fe30:e538 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [06:26:33] (03CR) 10Muehlenhoff: "Looks good, two comments inline to questions raised on the task." [puppet] - 10https://gerrit.wikimedia.org/r/1269672 (https://phabricator.wikimedia.org/T422888) (owner: 10Dwisehaupt) [06:27:39] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir5001.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [06:28:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir5001.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [06:28:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:28:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir5001.eqsin.wmnet [06:28:31] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11864192 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ncredir5001.eqsin.wmnet` - ncredir5001.eqsin.wmnet (**PASS**... [06:28:38] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ncredir5002.eqsin.wmnet [06:29:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:29:45] (03CR) 10Marostegui: [C:03+2] Revert "db2192: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1278224 (owner: 10Marostegui) [06:33:48] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [06:34:39] FIRING: [8x] CertAlmostExpired: Certificate for service wdqs1015:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:35:50] (03CR) 10Filippo Giunchedi: [C:03+1] Designate: use zookeeper as the tooz backend, everywhere [puppet] - 10https://gerrit.wikimedia.org/r/1277747 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [06:36:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:36:26] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2192.codfw.wmnet with OS trixie [06:38:18] (03PS2) 10Muehlenhoff: Avoid false positive alerts after Ganeti master failover [puppet] - 10https://gerrit.wikimedia.org/r/1272701 [06:38:56] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2192: after reimage to trixie [06:39:03] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir5002.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [06:39:08] (03CR) 10Marostegui: [C:03+2] Revert "db1173: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1278221 (owner: 10Marostegui) [06:39:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:41:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:41:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir5002.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [06:41:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:41:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir5002.eqsin.wmnet [06:42:04] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11864203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ncredir5002.eqsin.wmnet` - ncredir5002.eqsin.wmnet (**PASS**... [06:44:49] (03CR) 10Muehlenhoff: [C:03+2] Remove ncredir5001/5002 from conf-tool [puppet] - 10https://gerrit.wikimedia.org/r/1278213 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [06:45:37] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11864206 (10MoritzMuehlenhoff) [06:46:39] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11864207 (10MoritzMuehlenhoff) [06:47:34] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5005.eqsin.wmnet [06:47:59] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11864208 (10ops-monitoring-bot) Draining ganeti5005.eqsin.wmnet of running VMs [06:48:25] (03CR) 10JMeybohm: [C:03+2] admin_ng: Move all clusters to the pki discovery2026 intermediate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277622 (https://phabricator.wikimedia.org/T420993) (owner: 10JMeybohm) [06:49:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5005.eqsin.wmnet [06:49:26] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1173.eqiad.wmnet with OS trixie [06:49:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:50:59] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of install5003.wikimedia.org to plain [06:51:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 13.26% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:51:17] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1173: after reimage to trixie [06:52:09] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11864210 (10ops-monitoring-bot) VM install5003.wikimedia.org switching disk type to plain [06:52:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install5003.wikimedia.org to plain [06:54:36] 10ops-eqiad, 06DC-Ops: Alert for device ps1-d1-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T424614 (10phaultfinder) 03NEW [06:54:37] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of prometheus5002.eqsin.wmnet to plain [06:56:35] (03Merged) 10jenkins-bot: admin_ng: Move all clusters to the pki discovery2026 intermediate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277622 (https://phabricator.wikimedia.org/T420993) (owner: 10JMeybohm) [06:56:44] (03PS1) 10Arnaudb: gerrit: add paging blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/1278238 (https://phabricator.wikimedia.org/T423035) [06:56:48] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11864235 (10ops-monitoring-bot) VM prometheus5002.eqsin.wmnet switching disk type to plain [06:57:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of prometheus5002.eqsin.wmnet to plain [06:59:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:00:05] Amir1, Urbanecm, and awight: Your horoscope predicts another UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260428T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:04:16] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of durum5001.eqsin.wmnet to plain [07:04:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:04:49] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11864244 (10ops-monitoring-bot) VM durum5001.eqsin.wmnet switching disk type to plain [07:05:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum5001.eqsin.wmnet to plain [07:07:10] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1161: after reimage to trixie [07:07:28] PROBLEM - Bird Internet Routing Daemon on durum5001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [07:09:28] RECOVERY - Bird Internet Routing Daemon on durum5001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [07:09:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:12:06] (03CR) 10DCausse: [C:03+1] profile::opensearch::cirrus::server: move to a new pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277508 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [07:13:06] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of durum5002.eqsin.wmnet to plain [07:13:44] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11864246 (10ops-monitoring-bot) VM durum5002.eqsin.wmnet switching disk type to plain [07:14:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum5002.eqsin.wmnet to plain [07:14:34] FIRING: [220x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:16:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 20.97% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:17:30] PROBLEM - Bird Internet Routing Daemon on durum5002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [07:18:30] RECOVERY - Bird Internet Routing Daemon on durum5002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [07:19:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:21:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:24:11] !log switching cfss-issuer instances on all clusters to use discovery2026 - T420993 [07:24:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:16] T420993: Rotate discovery intermediate certificate (expires 2026-05-03) - https://phabricator.wikimedia.org/T420993 [07:24:21] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2192: after reimage to trixie [07:24:30] !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [07:24:33] !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [07:24:34] FIRING: [222x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:24:38] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [07:24:42] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of hcaptcha-proxy5001.wikimedia.org to plain [07:24:43] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [07:24:48] !log jayme@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [07:24:52] !log jayme@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [07:24:56] !log jayme@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [07:25:00] (03PS1) 10Effie Mouzeli: data.yaml: remove old keys [puppet] - 10https://gerrit.wikimedia.org/r/1278249 [07:25:03] !log jayme@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [07:25:08] !log jayme@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [07:25:11] !log jayme@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [07:25:16] !log jayme@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [07:25:19] !log jayme@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [07:25:23] !log jayme@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [07:25:30] !log jayme@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [07:25:33] !log jayme@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [07:25:37] !log jayme@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [07:25:41] !log jayme@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. [07:25:47] !log jayme@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. [07:27:42] jmm@cumin2002 changedisk (PID 3391408) is awaiting input [07:29:04] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11864282 (10ops-monitoring-bot) VM hcaptcha-proxy5001.wikimedia.org switching disk type to plain [07:29:06] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1278249 (owner: 10Effie Mouzeli) [07:29:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of hcaptcha-proxy5001.wikimedia.org to plain [07:29:34] FIRING: [222x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:32:28] PROBLEM - Bird Internet Routing Daemon on hcaptcha-proxy5001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [07:33:28] RECOVERY - Bird Internet Routing Daemon on hcaptcha-proxy5001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [07:36:41] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1173: after reimage to trixie [07:36:48] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of hcaptcha-proxy5002.wikimedia.org to plain [07:37:21] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11864287 (10ops-monitoring-bot) VM hcaptcha-proxy5002.wikimedia.org switching disk type to plain [07:37:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of hcaptcha-proxy5002.wikimedia.org to plain [07:39:18] PROBLEM - Bird Internet Routing Daemon on hcaptcha-proxy5002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [07:39:49] (03PS1) 10Muehlenhoff: Switch remaining Ganeti clusters to discovery2026 intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1278256 (https://phabricator.wikimedia.org/T420993) [07:40:13] FIRING: [3x] BFDdown: BFD session down between cr2-eqsin and 103.102.166.13 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [07:41:18] RECOVERY - Bird Internet Routing Daemon on hcaptcha-proxy5002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [07:45:13] RESOLVED: [4x] BFDdown: BFD session down between cr2-eqsin and 103.102.166.13 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [07:50:21] (03PS3) 10Bartosz Dziewoński: Move privileged global and local group handling to WikimediaCustomizations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271969 (https://phabricator.wikimedia.org/T418507) [07:50:34] (03PS1) 10Urbanecm: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278260 (https://phabricator.wikimedia.org/T415312) [07:51:01] (03CR) 10Urbanecm: [C:03+2] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278260 (https://phabricator.wikimedia.org/T415312) (owner: 10Urbanecm) [07:53:06] (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278260 (https://phabricator.wikimedia.org/T415312) (owner: 10Urbanecm) [07:54:34] !log urbanecm@deploy1003 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [07:54:57] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance [07:55:04] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2150 (T419961)', diff saved to https://phabricator.wikimedia.org/P91720 and previous config saved to /var/cache/conftool/dbconfig/20260428-075504-fceratto.json [07:55:50] !log started renewal of certificates on codfw kubernetes clusters - T420993 [07:55:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:55] T420993: Rotate discovery intermediate certificate (expires 2026-05-03) - https://phabricator.wikimedia.org/T420993 [07:56:16] !log urbanecm@deploy1003 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [07:57:02] Emperor, XioNoX: I'm renewing all discovery certificates on all k8s clusters in codfw during the next 30min [07:57:06] eqiad will follow after [07:57:32] if you see something funny, feel free to ping me [07:58:09] (03PS1) 10Urbanecm: Revert "linkrecommendation: Bump version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278263 (https://phabricator.wikimedia.org/T415312) [07:58:20] (03CR) 10Urbanecm: [C:03+2] Revert "linkrecommendation: Bump version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278263 (https://phabricator.wikimedia.org/T415312) (owner: 10Urbanecm) [07:58:37] jayme: something funny like this https://www.reddit.com/r/funny/comments/1sigdya/15_points_to_microsoft/ ? [07:59:16] XioNoX: eheh, yes. Stuff like that exactly [07:59:34] FIRING: [220x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:59:39] FIRING: [5x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:00:34] (03Merged) 10jenkins-bot: Revert "linkrecommendation: Bump version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278263 (https://phabricator.wikimedia.org/T415312) (owner: 10Urbanecm) [08:01:39] !log urbanecm@deploy1003 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [08:02:00] !log urbanecm@deploy1003 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [08:03:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T419961)', diff saved to https://phabricator.wikimedia.org/P91721 and previous config saved to /var/cache/conftool/dbconfig/20260428-080337-fceratto.json [08:04:04] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: wikikube-ctrl100[45] implementation tracking - https://phabricator.wikimedia.org/T418920#11864380 (10MLechvien-WMF) 05Stalled→03Open [08:04:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:05:22] (03CR) 10Brouberol: [C:03+1] deployment_server: move charlie/admin_ng to debian package [puppet] - 10https://gerrit.wikimedia.org/r/1277471 (https://phabricator.wikimedia.org/T423078) (owner: 10Atsuko) [08:09:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:10:18] RECOVERY - orchestrator resolve cache non-FQDNs on dborch1002 is OK: OK: all orchestrator resolve cache entries are FQDNs https://wikitech.wikimedia.org/wiki/Orchestrator [08:13:12] !log installing e2fsprogs updates from trixie point release [08:13:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:47] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P91722 and previous config saved to /var/cache/conftool/dbconfig/20260428-081346-fceratto.json [08:14:34] FIRING: [221x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:15:35] (03PS1) 10MVernon: swift: restore 3 reimaged hosts, drain next 2 [puppet] - 10https://gerrit.wikimedia.org/r/1278274 (https://phabricator.wikimedia.org/T421719) [08:18:06] (03CR) 10Marostegui: [C:03+1] swift: restore 3 reimaged hosts, drain next 2 [puppet] - 10https://gerrit.wikimedia.org/r/1278274 (https://phabricator.wikimedia.org/T421719) (owner: 10MVernon) [08:18:14] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.3 point update - https://phabricator.wikimedia.org/T414179#11864434 (10MoritzMuehlenhoff) [08:19:34] FIRING: [213x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:20:14] (03CR) 10MVernon: [C:03+2] swift: restore 3 reimaged hosts, drain next 2 [puppet] - 10https://gerrit.wikimedia.org/r/1278274 (https://phabricator.wikimedia.org/T421719) (owner: 10MVernon) [08:23:37] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db2150: Repooling [08:23:46] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2150: Repooling [08:24:34] FIRING: [205x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:27:48] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2157.codfw.wmnet with reason: Maintenance [08:27:56] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2157 (T419635)', diff saved to https://phabricator.wikimedia.org/P91724 and previous config saved to /var/cache/conftool/dbconfig/20260428-082756-fceratto.json [08:28:01] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [08:29:29] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2159.codfw.wmnet with reason: Maintenance [08:29:34] FIRING: [191x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:29:38] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2159 (T419961)', diff saved to https://phabricator.wikimedia.org/P91725 and previous config saved to /var/cache/conftool/dbconfig/20260428-082937-fceratto.json [08:31:28] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T419635)', diff saved to https://phabricator.wikimedia.org/P91726 and previous config saved to /var/cache/conftool/dbconfig/20260428-083127-fceratto.json [08:33:20] (03PS1) 10Brouberol: kafka-jumbo: deploy kafka 3.7 to all brokers [puppet] - 10https://gerrit.wikimedia.org/r/1278355 (https://phabricator.wikimedia.org/T424527) [08:34:00] (03PS2) 10Brouberol: kafka-jumbo: deploy kafka 3.7 to all brokers [puppet] - 10https://gerrit.wikimedia.org/r/1278355 (https://phabricator.wikimedia.org/T424527) [08:34:34] FIRING: [183x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:35:07] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1278355 (https://phabricator.wikimedia.org/T424527) (owner: 10Brouberol) [08:35:50] (03PS1) 10Marostegui: db2238: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1278357 (https://phabricator.wikimedia.org/T424615) [08:36:24] (03CR) 10Atsuko: [C:03+2] deployment_server: move charlie/admin_ng to debian package [puppet] - 10https://gerrit.wikimedia.org/r/1277471 (https://phabricator.wikimedia.org/T423078) (owner: 10Atsuko) [08:37:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T419961)', diff saved to https://phabricator.wikimedia.org/P91727 and previous config saved to /var/cache/conftool/dbconfig/20260428-083706-fceratto.json [08:37:18] (03CR) 10Marostegui: [C:03+2] db2238: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1278357 (https://phabricator.wikimedia.org/T424615) (owner: 10Marostegui) [08:39:17] (03CR) 10Muehlenhoff: [C:03+2] Switch remaining Ganeti clusters to discovery2026 intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1278256 (https://phabricator.wikimedia.org/T420993) (owner: 10Muehlenhoff) [08:39:34] FIRING: [176x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:40:37] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2238.codfw.wmnet with reason: Reimage to Trixie [08:40:43] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2238: Reimage to Trixie [08:41:01] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2238: Reimage to Trixie [08:41:29] (03CR) 10Majavah: "in cloud vps, `CACHES` will contain the shared (project-proxy) http reverse proxiese, where all requests would be coming from, so we can m" [puppet] - 10https://gerrit.wikimedia.org/r/1276526 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [08:41:36] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P91729 and previous config saved to /var/cache/conftool/dbconfig/20260428-084135-fceratto.json [08:42:20] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2238.codfw.wmnet with OS trixie [08:42:21] !log started renewal of certificates on eqiad kubernetes clusters - T420993 [08:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:26] T420993: Rotate discovery intermediate certificate (expires 2026-05-03) - https://phabricator.wikimedia.org/T420993 [08:44:19] FIRING: JobUnavailable: Reduced availability for job gnmic in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:44:34] FIRING: [165x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:44:39] FIRING: [5x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:44:41] !log migrate Ganeti clusters to the new discovery2026 intermediate, starting for the edges T420993 [08:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:15] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P91730 and previous config saved to /var/cache/conftool/dbconfig/20260428-084714-fceratto.json [08:49:19] RESOLVED: JobUnavailable: Reduced availability for job gnmic in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:49:34] FIRING: [163x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:51:01] (03CR) 10Jelto: [C:03+1] "lgtm and similar settings to the Phabricator blackbox check. At some point it would be nice to not duplicate the check and generate twice " [puppet] - 10https://gerrit.wikimedia.org/r/1278238 (https://phabricator.wikimedia.org/T423035) (owner: 10Arnaudb) [08:51:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P91731 and previous config saved to /var/cache/conftool/dbconfig/20260428-085142-fceratto.json [08:53:35] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [08:53:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:54:34] FIRING: [163x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:55:49] (03PS8) 10Dpogorzelski: amg-gpu: Set up explicit GPU partitioning [puppet] - 10https://gerrit.wikimedia.org/r/1269344 (https://phabricator.wikimedia.org/T420507) [08:56:35] (03CR) 10Dpogorzelski: "borrowed same while loop" [puppet] - 10https://gerrit.wikimedia.org/r/1269344 (https://phabricator.wikimedia.org/T420507) (owner: 10Dpogorzelski) [08:57:22] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P91732 and previous config saved to /var/cache/conftool/dbconfig/20260428-085722-fceratto.json [08:58:03] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2238.codfw.wmnet with reason: host reimage [08:59:34] FIRING: [163x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:01:18] (03CR) 10Arnaudb: [C:03+2] gerrit: add paging blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/1278238 (https://phabricator.wikimedia.org/T423035) (owner: 10Arnaudb) [09:01:51] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T419635)', diff saved to https://phabricator.wikimedia.org/P91733 and previous config saved to /var/cache/conftool/dbconfig/20260428-090150-fceratto.json [09:02:00] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [09:02:07] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2171.codfw.wmnet with reason: Maintenance [09:02:16] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2171 (T419635)', diff saved to https://phabricator.wikimedia.org/P91734 and previous config saved to /var/cache/conftool/dbconfig/20260428-090215-fceratto.json [09:03:22] (03PS1) 10VadymTS1: enwikiversity: Add some user rights to the curator user group on English Wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278363 (https://phabricator.wikimedia.org/T424445) [09:04:34] FIRING: [160x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:04:39] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2238.codfw.wmnet with reason: host reimage [09:04:47] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T419635)', diff saved to https://phabricator.wikimedia.org/P91735 and previous config saved to /var/cache/conftool/dbconfig/20260428-090446-fceratto.json [09:07:31] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T419961)', diff saved to https://phabricator.wikimedia.org/P91736 and previous config saved to /var/cache/conftool/dbconfig/20260428-090730-fceratto.json [09:07:52] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance [09:08:00] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2168 (T419961)', diff saved to https://phabricator.wikimedia.org/P91737 and previous config saved to /var/cache/conftool/dbconfig/20260428-090759-fceratto.json [09:09:34] FIRING: [145x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:12:08] (03PS1) 10Marostegui: Revert "db2238: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1278364 [09:13:10] (03PS2) 10Arnaudb: envoy: configure listener buffer and fast open queue length [puppet] - 10https://gerrit.wikimedia.org/r/1277503 (https://phabricator.wikimedia.org/T421827) [09:14:34] FIRING: [140x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:14:54] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P91738 and previous config saved to /var/cache/conftool/dbconfig/20260428-091454-fceratto.json [09:15:28] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T419961)', diff saved to https://phabricator.wikimedia.org/P91739 and previous config saved to /var/cache/conftool/dbconfig/20260428-091527-fceratto.json [09:15:50] (03CR) 10Marostegui: [C:03+2] Revert "db2238: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1278364 (owner: 10Marostegui) [09:17:56] (03CR) 10Klausman: [C:03+1] amg-gpu: Set up explicit GPU partitioning [puppet] - 10https://gerrit.wikimedia.org/r/1269344 (https://phabricator.wikimedia.org/T420507) (owner: 10Dpogorzelski) [09:18:25] FIRING: [16x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:19:37] PROBLEM - Thanos swift https on thanos-fe1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Thanos [09:20:24] (03PS9) 10Dpogorzelski: amg-gpu: Set up explicit GPU partitioning [puppet] - 10https://gerrit.wikimedia.org/r/1269344 (https://phabricator.wikimedia.org/T420507) [09:21:30] !log migrate eqiad/codfw Ganeti clusters to the new discovery2026 intermediate T420993 [09:21:33] (03CR) 10Isabelle Hurbain-Palatin: [C:03+1] Deploy PRV to 10 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277770 (https://phabricator.wikimedia.org/T424590) (owner: 10Arlolra) [09:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:34] T420993: Rotate discovery intermediate certificate (expires 2026-05-03) - https://phabricator.wikimedia.org/T420993 [09:22:27] RECOVERY - Thanos swift https on thanos-fe1007 is OK: HTTP OK: HTTP/1.1 200 OK - 279 bytes in 0.081 second response time https://wikitech.wikimedia.org/wiki/Thanos [09:25:03] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P91740 and previous config saved to /var/cache/conftool/dbconfig/20260428-092502-fceratto.json [09:25:35] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P91741 and previous config saved to /var/cache/conftool/dbconfig/20260428-092534-fceratto.json [09:27:28] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.4 point update - https://phabricator.wikimedia.org/T420240#11864687 (10MoritzMuehlenhoff) [09:28:26] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2238.codfw.wmnet with OS trixie [09:29:34] FIRING: [108x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:31:48] (03PS1) 10Blake: k8s: Remove support for k8s versions before 1.31 [puppet] - 10https://gerrit.wikimedia.org/r/1278370 (https://phabricator.wikimedia.org/T423251) [09:32:20] (03CR) 10Dpogorzelski: [C:03+2] amg-gpu: Set up explicit GPU partitioning [puppet] - 10https://gerrit.wikimedia.org/r/1269344 (https://phabricator.wikimedia.org/T420507) (owner: 10Dpogorzelski) [09:32:29] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2238: after reimage to trixie [09:34:20] (03CR) 10Btullis: [C:03+1] "Nice." [puppet] - 10https://gerrit.wikimedia.org/r/1278355 (https://phabricator.wikimedia.org/T424527) (owner: 10Brouberol) [09:34:34] FIRING: [100x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:34:39] FIRING: [9x] CertAlmostExpired: Certificate for service wdqs1015:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:35:11] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T419635)', diff saved to https://phabricator.wikimedia.org/P91743 and previous config saved to /var/cache/conftool/dbconfig/20260428-093510-fceratto.json [09:35:15] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [09:35:28] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2178.codfw.wmnet with reason: Maintenance [09:35:36] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2178 (T419635)', diff saved to https://phabricator.wikimedia.org/P91744 and previous config saved to /var/cache/conftool/dbconfig/20260428-093535-fceratto.json [09:35:43] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P91745 and previous config saved to /var/cache/conftool/dbconfig/20260428-093543-fceratto.json [09:36:41] !log installing openjdk-21 security updates [09:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:34] FIRING: [100x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:39:39] FIRING: [9x] CertAlmostExpired: Certificate for service wdqs1015:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:40:41] !log btullis@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-ipoid: apply [09:40:48] !log btullis@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-ipoid: apply [09:41:00] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid: apply [09:41:04] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid: apply [09:41:34] !log btullis@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-semantic-search-test: apply [09:41:41] !log btullis@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search-test: apply [09:41:51] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search-test: apply [09:41:55] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search-test: apply [09:42:12] !log btullis@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [09:42:19] !log btullis@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [09:42:29] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [09:42:33] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [09:42:41] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T419635)', diff saved to https://phabricator.wikimedia.org/P91746 and previous config saved to /var/cache/conftool/dbconfig/20260428-094240-fceratto.json [09:42:45] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [09:45:52] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T419961)', diff saved to https://phabricator.wikimedia.org/P91747 and previous config saved to /var/cache/conftool/dbconfig/20260428-094551-fceratto.json [09:46:14] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance [09:46:22] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2182 (T419961)', diff saved to https://phabricator.wikimedia.org/P91748 and previous config saved to /var/cache/conftool/dbconfig/20260428-094621-fceratto.json [09:49:34] FIRING: [100x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:49:39] FIRING: [7x] CertAlmostExpired: Certificate for service wdqs1015:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:52:49] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P91750 and previous config saved to /var/cache/conftool/dbconfig/20260428-095248-fceratto.json [09:53:40] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T419961)', diff saved to https://phabricator.wikimedia.org/P91751 and previous config saved to /var/cache/conftool/dbconfig/20260428-095340-fceratto.json [09:54:34] FIRING: [100x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:55:24] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [09:55:33] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [09:56:08] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [09:57:16] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [09:59:34] FIRING: [100x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260428T1000) [10:02:57] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P91753 and previous config saved to /var/cache/conftool/dbconfig/20260428-100256-fceratto.json [10:03:48] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P91754 and previous config saved to /var/cache/conftool/dbconfig/20260428-100348-fceratto.json [10:05:20] FIRING: [3x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in 5d 3h 49m 25s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [10:08:36] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [10:08:43] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [10:09:19] FIRING: JobUnavailable: Reduced availability for job gnmic in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:11:18] (03PS1) 10STran: Update action parameter for bulk blocking instrumented events [extensions/CheckUser] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1278380 (https://phabricator.wikimedia.org/T420517) [10:12:45] (03PS1) 10STran: Update action parameter for bulk blocking instrumented events [extensions/CheckUser] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1278381 (https://phabricator.wikimedia.org/T420517) [10:13:05] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T419635)', diff saved to https://phabricator.wikimedia.org/P91755 and previous config saved to /var/cache/conftool/dbconfig/20260428-101304-fceratto.json [10:13:10] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [10:13:10] 06SRE, 06Infrastructure-Foundations, 10netops: Network QoS: expand support to Nokia switches - https://phabricator.wikimedia.org/T424639 (10cmooney) 03NEW p:05Triage→03Medium [10:13:22] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2192.codfw.wmnet with reason: Maintenance [10:13:31] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2192 (T419635)', diff saved to https://phabricator.wikimedia.org/P91756 and previous config saved to /var/cache/conftool/dbconfig/20260428-101330-fceratto.json [10:13:57] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P91757 and previous config saved to /var/cache/conftool/dbconfig/20260428-101356-fceratto.json [10:14:19] RESOLVED: JobUnavailable: Reduced availability for job gnmic in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:14:50] (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: switch to Apr2026 rate limit policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277709 (https://phabricator.wikimedia.org/T417779) (owner: 10Daniel Kinzler) [10:16:59] (03Merged) 10jenkins-bot: rest-gateway: switch to Apr2026 rate limit policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277709 (https://phabricator.wikimedia.org/T417779) (owner: 10Daniel Kinzler) [10:17:54] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2238: after reimage to trixie [10:19:06] !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:20:01] !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:20:12] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T419635)', diff saved to https://phabricator.wikimedia.org/P91759 and previous config saved to /var/cache/conftool/dbconfig/20260428-102011-fceratto.json [10:20:16] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [10:24:05] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T419961)', diff saved to https://phabricator.wikimedia.org/P91760 and previous config saved to /var/cache/conftool/dbconfig/20260428-102404-fceratto.json [10:24:21] !log daniel@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [10:24:34] FIRING: [100x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:25:07] !log daniel@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [10:26:07] (03PS1) 10VadymTS1: mediawikiwiki: Changetags right only for bots and administrators in MediaWiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278382 (https://phabricator.wikimedia.org/T355445) [10:27:26] (03PS1) 10Muehlenhoff: Revert "Depool puppetserver1002" [dns] - 10https://gerrit.wikimedia.org/r/1278384 [10:28:01] (03CR) 10Tchanders: [C:03+1] Update action parameter for bulk blocking instrumented events [extensions/CheckUser] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1278381 (https://phabricator.wikimedia.org/T420517) (owner: 10STran) [10:28:23] (03CR) 10Tchanders: [C:03+1] Update action parameter for bulk blocking instrumented events [extensions/CheckUser] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1278380 (https://phabricator.wikimedia.org/T420517) (owner: 10STran) [10:29:27] (03PS1) 10Urbanecm: linkrecommendation: Bump version #2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278385 (https://phabricator.wikimedia.org/T415312) [10:29:34] FIRING: [100x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:29:52] (03CR) 10Muehlenhoff: [C:03+2] Revert "Depool puppetserver1002" [dns] - 10https://gerrit.wikimedia.org/r/1278384 (owner: 10Muehlenhoff) [10:29:57] !log jmm@dns1004 START - running authdns-update [10:30:16] (03CR) 10Urbanecm: [C:03+2] linkrecommendation: Bump version #2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278385 (https://phabricator.wikimedia.org/T415312) (owner: 10Urbanecm) [10:30:20] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P91761 and previous config saved to /var/cache/conftool/dbconfig/20260428-103019-fceratto.json [10:31:30] !log jmm@dns1004 END - running authdns-update [10:32:10] !log daniel@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [10:32:13] (03Merged) 10jenkins-bot: linkrecommendation: Bump version #2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278385 (https://phabricator.wikimedia.org/T415312) (owner: 10Urbanecm) [10:32:29] !log daniel@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [10:34:27] 06SRE, 06Infrastructure-Foundations, 10netops: Network QoS: use the 'CS1' DSCP code point for low-priority instead of AF41 - https://phabricator.wikimedia.org/T424640 (10cmooney) 03NEW p:05Triage→03Medium [10:34:33] 06SRE, 06Infrastructure-Foundations, 10netops: Network QoS: use the 'CS1' DSCP code point for low-priority instead of AF41 - https://phabricator.wikimedia.org/T424640#11865094 (10cmooney) [10:34:35] 06SRE, 06Infrastructure-Foundations, 10netops: Network QoS: expand support to Nokia switches - https://phabricator.wikimedia.org/T424639#11865095 (10cmooney) [10:39:21] !log daniel@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [10:39:26] (03PS1) 10Btullis: opensearch-cluster: Add a -bulk suffix to the list of SANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278389 (https://phabricator.wikimedia.org/T424007) [10:39:34] FIRING: [100x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:39:36] (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add a -bulk suffix to the list of SANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278389 (https://phabricator.wikimedia.org/T424007) (owner: 10Btullis) [10:39:40] (03PS2) 10Btullis: opensearch-cluster: Add a -bulk suffix to the list of SANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278389 (https://phabricator.wikimedia.org/T424007) [10:39:54] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [10:39:58] !log daniel@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [10:40:23] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [10:40:28] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P91762 and previous config saved to /var/cache/conftool/dbconfig/20260428-104027-fceratto.json [10:40:32] (03PS1) 10Ayounsi: gNMIc: use collect mode [puppet] - 10https://gerrit.wikimedia.org/r/1278390 (https://phabricator.wikimedia.org/T416360) [10:42:23] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [10:42:41] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [10:44:34] FIRING: [100x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:44:59] (03CR) 10Btullis: [C:03+2] opensearch-cluster: Add a -bulk suffix to the list of SANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278389 (https://phabricator.wikimedia.org/T424007) (owner: 10Btullis) [10:45:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/CheckUser] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1278381 (https://phabricator.wikimedia.org/T420517) (owner: 10STran) [10:45:42] 06SRE, 06Infrastructure-Foundations, 10netops: Network QoS: use the 'CS1' DSCP code point for low-priority instead of AF41 - https://phabricator.wikimedia.org/T424640#11865146 (10cmooney) [10:47:12] (03Merged) 10jenkins-bot: opensearch-cluster: Add a -bulk suffix to the list of SANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278389 (https://phabricator.wikimedia.org/T424007) (owner: 10Btullis) [10:48:13] 06SRE, 06Infrastructure-Foundations, 10netops: Network QoS: use the 'CS1' DSCP code point for low-priority instead of AF41 - https://phabricator.wikimedia.org/T424640#11865157 (10cmooney) [10:48:41] (03PS1) 10Muehlenhoff: debmonitor: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1278391 (https://phabricator.wikimedia.org/T420993) [10:48:58] 06SRE, 06Infrastructure-Foundations, 10netops: Network QoS: use the 'CS1' DSCP code point for low-priority instead of AF41 - https://phabricator.wikimedia.org/T424640#11865159 (10cmooney) [10:49:34] FIRING: [100x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:50:36] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T419635)', diff saved to https://phabricator.wikimedia.org/P91763 and previous config saved to /var/cache/conftool/dbconfig/20260428-105035-fceratto.json [10:50:45] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [10:50:53] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2201.codfw.wmnet with reason: Maintenance [10:53:06] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:57:11] (03PS2) 10Ayounsi: gNMIc: use collect mode [puppet] - 10https://gerrit.wikimedia.org/r/1278390 (https://phabricator.wikimedia.org/T416360) [10:57:19] (03PS3) 10Ayounsi: gNMIc: use collect mode [puppet] - 10https://gerrit.wikimedia.org/r/1278390 (https://phabricator.wikimedia.org/T416360) [10:57:23] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1278390 (https://phabricator.wikimedia.org/T416360) (owner: 10Ayounsi) [10:57:26] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2211.codfw.wmnet with reason: Maintenance [10:57:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2211 (T419635)', diff saved to https://phabricator.wikimedia.org/P91764 and previous config saved to /var/cache/conftool/dbconfig/20260428-105733-fceratto.json [10:57:36] (03PS4) 10Ayounsi: gNMIc: use collect mode [puppet] - 10https://gerrit.wikimedia.org/r/1278390 (https://phabricator.wikimedia.org/T416360) [10:57:39] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [10:57:39] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1278390 (https://phabricator.wikimedia.org/T416360) (owner: 10Ayounsi) [10:59:34] FIRING: [100x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:00:19] (03CR) 10Ayounsi: [C:03+1] "lgtm but will leave the last call to o11y" [alerts] - 10https://gerrit.wikimedia.org/r/1277472 (owner: 10Cathal Mooney) [11:04:34] FIRING: [100x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:05:43] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T419635)', diff saved to https://phabricator.wikimedia.org/P91765 and previous config saved to /var/cache/conftool/dbconfig/20260428-110543-fceratto.json [11:05:48] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [11:09:28] (03PS2) 10Jelto: miscweb: add volumeMounts for wmf-navigator secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276737 (https://phabricator.wikimedia.org/T414405) [11:11:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set db2192 with weight 0 T424521', diff saved to https://phabricator.wikimedia.org/P91766 and previous config saved to /var/cache/conftool/dbconfig/20260428-111110-marostegui.json [11:11:15] T424521: Switchover s5 master (db2213 -> db2192) - https://phabricator.wikimedia.org/T424521 [11:11:24] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2192 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1277550 (https://phabricator.wikimedia.org/T424521) (owner: 10Gerrit maintenance bot) [11:11:29] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 23 hosts with reason: Primary switchover s5 T424521 [11:14:12] (03CR) 10Muehlenhoff: [C:03+2] debmonitor: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1278391 (https://phabricator.wikimedia.org/T420993) (owner: 10Muehlenhoff) [11:14:34] FIRING: [100x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:15:51] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P91767 and previous config saved to /var/cache/conftool/dbconfig/20260428-111551-fceratto.json [11:16:33] !log Starting s5 codfw failover from db2213 to db2192 - T424521 [11:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:37] T424521: Switchover s5 master (db2213 -> db2192) - https://phabricator.wikimedia.org/T424521 [11:16:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote db2192 to s5 primary T424521', diff saved to https://phabricator.wikimedia.org/P91768 and previous config saved to /var/cache/conftool/dbconfig/20260428-111658-marostegui.json [11:17:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2213 T424521', diff saved to https://phabricator.wikimedia.org/P91769 and previous config saved to /var/cache/conftool/dbconfig/20260428-111740-marostegui.json [11:19:34] FIRING: [100x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:20:17] (03CR) 10Jelto: [C:03+2] miscweb: add volumeMounts for wmf-navigator secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276737 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [11:20:18] (03PS1) 10Marostegui: db2213: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1278394 (https://phabricator.wikimedia.org/T424323) [11:20:20] PROBLEM - orchestrator resolve cache non-FQDNs on dborch1002 is CRITICAL: CRITICAL: 2 non-FQDN entries in orchestrator resolve cache: https://wikitech.wikimedia.org/wiki/Orchestrator [11:20:40] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2213.codfw.wmnet with reason: Reimage to Trixie [11:20:46] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2213: Reimage to Trixie [11:20:54] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2213: Reimage to Trixie [11:21:11] (03CR) 10Marostegui: [C:03+2] db2213: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1278394 (https://phabricator.wikimedia.org/T424323) (owner: 10Marostegui) [11:21:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [11:22:53] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2213.codfw.wmnet with OS trixie [11:22:54] (03Merged) 10jenkins-bot: miscweb: add volumeMounts for wmf-navigator secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276737 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [11:24:34] FIRING: [100x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:25:59] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P91770 and previous config saved to /var/cache/conftool/dbconfig/20260428-112558-fceratto.json [11:29:34] FIRING: [100x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:34:34] FIRING: [100x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:34:36] (03PS1) 10Marostegui: Revert "db2213: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1278401 [11:34:39] FIRING: [4x] CertAlmostExpired: Certificate for service wdqs1015:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:36:07] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T419635)', diff saved to https://phabricator.wikimedia.org/P91772 and previous config saved to /var/cache/conftool/dbconfig/20260428-113606-fceratto.json [11:36:12] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [11:36:15] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2223.codfw.wmnet with reason: Maintenance [11:36:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2223 (T419635)', diff saved to https://phabricator.wikimedia.org/P91773 and previous config saved to /var/cache/conftool/dbconfig/20260428-113623-fceratto.json [11:38:18] !log btullis@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-test: apply [11:38:25] !log btullis@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [11:38:43] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [11:38:47] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [11:39:18] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [11:39:48] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [11:39:50] !log btullis@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [11:39:56] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [11:39:56] !log btullis@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [11:40:08] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [11:40:12] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [11:40:23] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [11:40:26] !log btullis@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-ipoid: apply [11:40:33] !log btullis@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-ipoid: apply [11:43:15] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2213.codfw.wmnet with reason: host reimage [11:44:35] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2223 (T419635)', diff saved to https://phabricator.wikimedia.org/P91774 and previous config saved to /var/cache/conftool/dbconfig/20260428-114434-fceratto.json [11:44:39] FIRING: [3x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:44:40] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [11:49:19] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2213.codfw.wmnet with reason: host reimage [11:53:06] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:54:34] FIRING: [100x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:54:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2223', diff saved to https://phabricator.wikimedia.org/P91775 and previous config saved to /var/cache/conftool/dbconfig/20260428-115442-fceratto.json [11:57:08] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [11:57:15] (03PS1) 10Gkyziridis: ml-services: Remove unused models from experimental staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278418 [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260428T1200) [12:00:39] (03CR) 10Marostegui: [C:03+2] Revert "db2213: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1278401 (owner: 10Marostegui) [12:01:45] (03CR) 10Gkyziridis: [C:03+2] ml-services: Remove unused models from experimental staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278418 (owner: 10Gkyziridis) [12:03:39] (03Merged) 10jenkins-bot: ml-services: Remove unused models from experimental staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278418 (owner: 10Gkyziridis) [12:03:54] (03PS1) 10Dpogorzelski: ml-serve: fix config flag [puppet] - 10https://gerrit.wikimedia.org/r/1278424 [12:04:21] (03CR) 10Dpogorzelski: [C:03+2] ml-serve: fix config flag [puppet] - 10https://gerrit.wikimedia.org/r/1278424 (owner: 10Dpogorzelski) [12:04:37] !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [12:04:50] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2223', diff saved to https://phabricator.wikimedia.org/P91776 and previous config saved to /var/cache/conftool/dbconfig/20260428-120450-fceratto.json [12:11:52] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2213.codfw.wmnet with OS trixie [12:12:05] (03PS1) 10Muehlenhoff: apt/staging: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1278426 (https://phabricator.wikimedia.org/T420993) [12:13:42] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts doh5001.wikimedia.org [12:14:06] (03PS1) 10Dpogorzelski: amd-gpu: move unit template to the right location [puppet] - 10https://gerrit.wikimedia.org/r/1278427 [12:14:25] (03CR) 10Dpogorzelski: [C:03+2] amd-gpu: move unit template to the right location [puppet] - 10https://gerrit.wikimedia.org/r/1278427 (owner: 10Dpogorzelski) [12:14:26] (03PS1) 10Marostegui: db1259,db2226: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1278428 (https://phabricator.wikimedia.org/T424615) [12:14:34] FIRING: [100x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [12:14:58] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2223 (T419635)', diff saved to https://phabricator.wikimedia.org/P91777 and previous config saved to /var/cache/conftool/dbconfig/20260428-121458-fceratto.json [12:15:03] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [12:15:13] (03CR) 10Marostegui: [C:03+2] db1259,db2226: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1278428 (https://phabricator.wikimedia.org/T424615) (owner: 10Marostegui) [12:15:14] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2213: after reimage to trixie [12:15:16] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2228.codfw.wmnet with reason: Maintenance [12:15:23] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2226.codfw.wmnet with reason: Reimage to Trixie [12:15:25] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1259.eqiad.wmnet with reason: Reimage to Trixie [12:15:30] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2226: Reimage to Trixie [12:15:30] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1259: Reimage to Trixie [12:15:30] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2228 (T419635)', diff saved to https://phabricator.wikimedia.org/P91779 and previous config saved to /var/cache/conftool/dbconfig/20260428-121530-fceratto.json [12:15:58] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1259: Reimage to Trixie [12:15:58] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2226: Reimage to Trixie [12:17:00] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid: apply [12:17:04] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid: apply [12:17:19] PROBLEM - librenms.wikimedia.org tls expiry on netmon2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [12:17:21] PROBLEM - SSH on netmon2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:17:21] PROBLEM - librenms.wikimedia.org requires authentication on netmon2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [12:18:03] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2226.codfw.wmnet with OS trixie [12:18:20] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1259.eqiad.wmnet with OS trixie [12:18:33] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [12:19:34] FIRING: [100x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [12:20:09] RECOVERY - librenms.wikimedia.org tls expiry on netmon2002 is OK: OK - Certificate librenms.wikimedia.org will expire on Sun 12 Jul 2026 02:51:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [12:20:13] FIRING: [4x] BFDdown: BFD session down between cr2-eqsin and 103.102.166.14 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:20:13] RECOVERY - SSH on netmon2002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u9 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:20:13] RECOVERY - librenms.wikimedia.org requires authentication on netmon2002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 701 bytes in 0.138 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [12:22:12] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2228 (T419635)', diff saved to https://phabricator.wikimedia.org/P91782 and previous config saved to /var/cache/conftool/dbconfig/20260428-122211-fceratto.json [12:22:17] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [12:22:32] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh5001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [12:22:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh5001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [12:22:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:22:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh5001.wikimedia.org [12:23:13] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11865543 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `doh5001.wikimedia.org` - doh5001.wikimedia.org (**PASS**)... [12:24:25] FIRING: SystemdUnitFailed: amd-smi-gpu-partition.service on ml-serve1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:24:34] FIRING: [100x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [12:26:55] (03PS1) 10Btullis: opensearch-cluster: Fix the SANs of the opensearch-wmf certificate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278430 (https://phabricator.wikimedia.org/T424007) [12:27:05] (03CR) 10CI reject: [V:04-1] opensearch-cluster: Fix the SANs of the opensearch-wmf certificate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278430 (https://phabricator.wikimedia.org/T424007) (owner: 10Btullis) [12:27:06] (03PS2) 10Btullis: opensearch-cluster: Fix the SANs of the opensearch-wmf certificate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278430 (https://phabricator.wikimedia.org/T424007) [12:27:28] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [12:29:34] FIRING: [100x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [12:30:40] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [12:30:50] !log installing openjdk-21 security updates [12:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:59] PROBLEM - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1199 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 2 UGood : virtual_disk: 1 OfLn : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [12:31:01] ACKNOWLEDGEMENT - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1199 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 2 UGood : virtual_disk: 1 OfLn : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T424654 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [12:31:06] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1199 - https://phabricator.wikimedia.org/T424654 (10ops-monitoring-bot) 03NEW [12:32:21] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2228', diff saved to https://phabricator.wikimedia.org/P91785 and previous config saved to /var/cache/conftool/dbconfig/20260428-123220-fceratto.json [12:34:14] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdev1003 - https://phabricator.wikimedia.org/T418928#11865616 (10Jgreen) >>! In T418928#11855299, @Jclark-ctr wrote: > https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/1271631 > > It looks like we might need to us... [12:34:15] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2226.codfw.wmnet with reason: host reimage [12:34:54] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1259.eqiad.wmnet with reason: host reimage [12:36:28] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdev1003 - https://phabricator.wikimedia.org/T418928#11865618 (10Jclark-ctr) I think the option is up to you I believe they are looking at rootwmf or something along those lines [12:36:47] (03CR) 10Brouberol: [C:03+1] opensearch-cluster: Fix the SANs of the opensearch-wmf certificate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278430 (https://phabricator.wikimedia.org/T424007) (owner: 10Btullis) [12:38:07] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2226.codfw.wmnet with reason: host reimage [12:39:14] (03PS1) 10Atsuko: admin: enable kerberos for daniel [puppet] - 10https://gerrit.wikimedia.org/r/1278433 [12:39:25] FIRING: [2x] SystemdUnitFailed: amd-smi-gpu-partition.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:40:25] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [12:40:35] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:40:43] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1158 (T419961)', diff saved to https://phabricator.wikimedia.org/P91786 and previous config saved to /var/cache/conftool/dbconfig/20260428-124042-fceratto.json [12:40:50] (03PS1) 10Dpogorzelski: amd-gpu: fix partitioning script and hiera [puppet] - 10https://gerrit.wikimedia.org/r/1278434 [12:41:00] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1259.eqiad.wmnet with reason: host reimage [12:41:05] (03CR) 10Dpogorzelski: [C:03+2] amd-gpu: fix partitioning script and hiera [puppet] - 10https://gerrit.wikimedia.org/r/1278434 (owner: 10Dpogorzelski) [12:41:13] (03CR) 10Brouberol: [C:03+1] "IIRC you will still need to manually provision the kerberos user on the krb hosts with the `manage_principals.py create` command" [puppet] - 10https://gerrit.wikimedia.org/r/1278433 (owner: 10Atsuko) [12:41:40] (03PS1) 10Novem Linguae: testwiki: allow sysops to add/remove electionadmin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278435 (https://phabricator.wikimedia.org/T423962) [12:42:28] (03CR) 10Jcrespo: [C:03+1] mariadb: Set db2141 as a spare for decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/1276407 (https://phabricator.wikimedia.org/T418979) (owner: 10Jcrespo) [12:42:29] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2228', diff saved to https://phabricator.wikimedia.org/P91787 and previous config saved to /var/cache/conftool/dbconfig/20260428-124228-fceratto.json [12:42:33] (03CR) 10CI reject: [V:04-1] testwiki: allow sysops to add/remove electionadmin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278435 (https://phabricator.wikimedia.org/T423962) (owner: 10Novem Linguae) [12:42:34] (03PS9) 10Jcrespo: mariadb: Set db2141 as a spare for decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/1276407 (https://phabricator.wikimedia.org/T418979) [12:42:57] (03CR) 10Jcrespo: [C:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1276407 (https://phabricator.wikimedia.org/T418979) (owner: 10Jcrespo) [12:43:59] (03CR) 10Btullis: [C:03+2] opensearch-cluster: Fix the SANs of the opensearch-wmf certificate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278430 (https://phabricator.wikimedia.org/T424007) (owner: 10Btullis) [12:44:25] FIRING: [3x] SystemdUnitFailed: amd-smi-gpu-partition.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:44:57] (03CR) 10Atsuko: [C:03+2] admin: enable kerberos for daniel [puppet] - 10https://gerrit.wikimedia.org/r/1278433 (owner: 10Atsuko) [12:46:12] (03Merged) 10jenkins-bot: opensearch-cluster: Fix the SANs of the opensearch-wmf certificate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278430 (https://phabricator.wikimedia.org/T424007) (owner: 10Btullis) [12:48:59] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T419961)', diff saved to https://phabricator.wikimedia.org/P91789 and previous config saved to /var/cache/conftool/dbconfig/20260428-124859-fceratto.json [12:49:25] FIRING: [4x] SystemdUnitFailed: amd-smi-gpu-partition.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:49:34] FIRING: [100x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [12:49:39] FIRING: [4x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [12:51:11] (03CR) 10Ottomata: [C:03+1] "One naming thought but otherwise LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/1276704 (https://phabricator.wikimedia.org/T423996) (owner: 10JavierMonton) [12:51:19] (03CR) 10Brouberol: [V:03+1 C:03+2] kafka-jumbo: deploy kafka 3.7 to all brokers [puppet] - 10https://gerrit.wikimedia.org/r/1278355 (https://phabricator.wikimedia.org/T424527) (owner: 10Brouberol) [12:52:37] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2228 (T419635)', diff saved to https://phabricator.wikimedia.org/P91790 and previous config saved to /var/cache/conftool/dbconfig/20260428-125236-fceratto.json [12:52:42] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [12:52:59] (03PS1) 10Dreamy Jazz: Resources: Define required message for 'oojs-ui-windows' module [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1278442 (https://phabricator.wikimedia.org/T424653) [12:53:28] jouncebot: nowandnext [12:53:28] For the next 0 hour(s) and 6 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260428T1200) [12:53:28] In 0 hour(s) and 6 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260428T1300) [12:53:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:53:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1278442 (https://phabricator.wikimedia.org/T424653) (owner: 10Dreamy Jazz) [12:54:34] FIRING: [100x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [12:54:39] FIRING: [5x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [12:55:01] (03CR) 10JavierMonton: alerts: mw-page-html-content-change-enrich (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1276704 (https://phabricator.wikimedia.org/T423996) (owner: 10JavierMonton) [12:55:13] FIRING: [4x] BFDdown: BFD session down between cr2-eqsin and 103.102.166.14 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:56:48] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts doh5002.wikimedia.org [12:56:50] (03PS1) 10Marostegui: Revert "db1259,db2226: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1278444 [12:57:38] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#11865724 (10Ladsgroup) [12:57:40] (03CR) 10Marostegui: [C:03+2] Revert "db1259,db2226: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1278444 (owner: 10Marostegui) [12:59:08] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P91791 and previous config saved to /var/cache/conftool/dbconfig/20260428-125907-fceratto.json [12:59:34] FIRING: [100x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260428T1300). [13:00:05] Tran and Dreamy_Jazz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:09] o/ [13:00:10] o/ [13:00:13] RESOLVED: [4x] BFDdown: BFD session down between cr2-eqsin and 103.102.166.14 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:00:15] I am also deploying on behalf of Dreamy_Jazz [13:00:24] Thanks again, off to walk the puppy [13:00:28] (03PS2) 10Novem Linguae: testwiki: allow sysops to add/remove electionadmin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278435 (https://phabricator.wikimedia.org/T423962) [13:00:38] Lucas_WMDE: I can just deploy? [13:00:39] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2213: after reimage to trixie [13:00:43] Tran: go ahead :) [13:01:07] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: wikikube-ctrl100[45] implementation tracking - https://phabricator.wikimedia.org/T418920#11865767 (10MLechvien-WMF) a:05Clement_Goubert→03None This is unstalled now and ready to be picked up this quarter [13:01:22] (03CR) 10STran: [C:03+1] Resources: Define required message for 'oojs-ui-windows' module [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1278442 (https://phabricator.wikimedia.org/T424653) (owner: 10Dreamy Jazz) [13:01:28] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [13:01:30] !log urbanecm@deploy1003 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [13:01:43] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2226.codfw.wmnet with OS trixie [13:02:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy1003 using scap backport" [extensions/CheckUser] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1278381 (https://phabricator.wikimedia.org/T420517) (owner: 10STran) [13:02:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1278442 (https://phabricator.wikimedia.org/T424653) (owner: 10Dreamy Jazz) [13:02:43] !log urbanecm@deploy1003 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [13:03:13] !log urbanecm@deploy1003 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply [13:03:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:03:47] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1259.eqiad.wmnet with OS trixie [13:04:03] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2226: after reimage to trixie [13:04:34] FIRING: [100x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:04:34] (03Merged) 10jenkins-bot: Update action parameter for bulk blocking instrumented events [extensions/CheckUser] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1278381 (https://phabricator.wikimedia.org/T420517) (owner: 10STran) [13:05:05] !log urbanecm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply [13:05:13] FIRING: [8x] BFDdown: BFD session down between cr2-eqsin and 103.102.166.14 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:05:20] (03CR) 10Jcrespo: [C:03+2] mariadb: Set db2141 as a spare for decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/1276407 (https://phabricator.wikimedia.org/T418979) (owner: 10Jcrespo) [13:05:22] !log urbanecm@deploy1003 helmfile [codfw] START helmfile.d/services/linkrecommendation: apply [13:06:20] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh5002.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [13:06:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh5002.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [13:06:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:06:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh5002.wikimedia.org [13:06:49] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11865807 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `doh5002.wikimedia.org` - doh5002.wikimedia.org (**PASS**)... [13:07:16] !log urbanecm@deploy1003 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: apply [13:07:18] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1259: after reimage to trixie [13:07:49] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2141.codfw.wmnet with reason: decom [13:08:14] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11865813 (10MoritzMuehlenhoff) [13:09:17] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P91795 and previous config saved to /var/cache/conftool/dbconfig/20260428-130915-fceratto.json [13:09:25] FIRING: [3x] SystemdUnitFailed: amd-smi-gpu-partition.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:09:31] (03CR) 10Sohom Datta: [C:03+1] testwiki: allow sysops to add/remove electionadmin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278435 (https://phabricator.wikimedia.org/T423962) (owner: 10Novem Linguae) [13:11:37] Any chance of sneaking into the deployment window with https://gerrit.wikimedia.org/r/1278435 ? :) [13:12:38] !log remove ganeti5005 from eqsin cluster T421863 [13:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:42] T421863: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863 [13:13:23] !log jynus@cumin1003 START - Cookbook sre.hosts.decommission for hosts db2141.codfw.wmnet [13:13:38] !log fceratto@cumin1003 START - Cookbook sre.hosts.decommission for hosts dborch1001.wikimedia.org [13:13:57] Sohom_Datta: feel free to add it [13:14:09] I’m on the fence about the “alphabetize” part but otherwise this sounds like it should be okay to deploy [13:14:18] (03PS1) 10Muehlenhoff: Remove ganeti5005 from the eqsin01 cluster [puppet] - 10https://gerrit.wikimedia.org/r/1278451 (https://phabricator.wikimedia.org/T421863) [13:14:25] FIRING: [3x] SystemdUnitFailed: amd-smi-gpu-partition.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:14:54] (03Merged) 10jenkins-bot: Resources: Define required message for 'oojs-ui-windows' module [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1278442 (https://phabricator.wikimedia.org/T424653) (owner: 10Dreamy Jazz) [13:14:56] PROBLEM - ganeti-confd running on ganeti5005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 109 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [13:14:56] PROBLEM - ganeti-noded running on ganeti5005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [13:15:13] RESOLVED: [4x] BFDdown: BFD session down between cr2-eqsin and 103.102.166.5 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:15:56] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-experimental: apply [13:16:00] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply [13:16:04] (03CR) 10Ayounsi: [C:03+1] Remove ganeti5005 from the eqsin01 cluster [puppet] - 10https://gerrit.wikimedia.org/r/1278451 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [13:16:10] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [13:16:16] !log stran@deploy1003 Started scap sync-world: Backport for [[gerrit:1278381|Update action parameter for bulk blocking instrumented events (T420517)]], [[gerrit:1278442|Resources: Define required message for 'oojs-ui-windows' module (T424653)]] [13:16:22] T420517: Instrument bulk blocking of connected temporary accounts - https://phabricator.wikimedia.org/T420517 [13:16:23] T424653: "⧼ooui-dialog-process-back⧽" on hCaptcha error page - https://phabricator.wikimedia.org/T424653 [13:16:34] (03CR) 10Bking: [C:03+2] profile::opensearch::cirrus::server: move to a new pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277508 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [13:16:49] jynus@cumin1003 decommission (PID 2114464) is awaiting input [13:16:56] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [13:17:03] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [13:17:17] (03PS7) 10Elukey: profile::opensearch::cirrus::server: move to a new pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277508 (https://phabricator.wikimedia.org/T420993) [13:17:26] (03CR) 10Bking: [C:03+2] profile::opensearch::cirrus::server: move to a new pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277508 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [13:17:33] Lucas_WMDE: Added! [13:17:46] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [13:18:10] FIRING: [17x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:18:45] !log fceratto@cumin1003 START - Cookbook sre.dns.netbox [13:19:23] (03PS1) 10Btullis: opensearch-cluster: Add a new destinationrule for the bulk indexing service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278453 (https://phabricator.wikimedia.org/T424007) [13:19:25] RESOLVED: [3x] SystemdUnitFailed: amd-smi-gpu-partition.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:19:26] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T419961)', diff saved to https://phabricator.wikimedia.org/P91797 and previous config saved to /var/cache/conftool/dbconfig/20260428-131924-fceratto.json [13:19:32] (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add a new destinationrule for the bulk indexing service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278453 (https://phabricator.wikimedia.org/T424007) (owner: 10Btullis) [13:19:34] (03PS2) 10Btullis: opensearch-cluster: Add a new destinationrule for the bulk indexing service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278453 (https://phabricator.wikimedia.org/T424007) [13:19:46] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [13:19:54] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1170 (T419961)', diff saved to https://phabricator.wikimedia.org/P91798 and previous config saved to /var/cache/conftool/dbconfig/20260428-131953-fceratto.json [13:20:25] (03PS1) 10Dpogorzelski: amd-gpu: handle partitionining across dependencies [puppet] - 10https://gerrit.wikimedia.org/r/1278454 [13:20:37] (03PS4) 10JavierMonton: alerts: mw-page-html-content-change-enrich [alerts] - 10https://gerrit.wikimedia.org/r/1276704 (https://phabricator.wikimedia.org/T423996) [13:20:50] (03CR) 10Dpogorzelski: [C:03+2] amd-gpu: handle partitionining across dependencies [puppet] - 10https://gerrit.wikimedia.org/r/1278454 (owner: 10Dpogorzelski) [13:22:05] !log stran@deploy1003 dreamyjazz, stran: Backport for [[gerrit:1278381|Update action parameter for bulk blocking instrumented events (T420517)]], [[gerrit:1278442|Resources: Define required message for 'oojs-ui-windows' module (T424653)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:22:16] T420517: Instrument bulk blocking of connected temporary accounts - https://phabricator.wikimedia.org/T420517 [13:22:16] T424653: "⧼ooui-dialog-process-back⧽" on hCaptcha error page - https://phabricator.wikimedia.org/T424653 [13:22:29] testing now [13:23:06] (03PS1) 10Jcrespo: mariadb: Remove the last references to db2141 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1278457 (https://phabricator.wikimedia.org/T424327) [13:23:43] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] testwiki: allow sysops to add/remove electionadmin (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278435 (https://phabricator.wikimedia.org/T423962) (owner: 10Novem Linguae) [13:23:55] !log jynus@cumin1003 START - Cookbook sre.dns.netbox [13:24:37] fceratto@cumin1003 decommission (PID 2115038) is awaiting input [13:25:23] tests look good, proceeding [13:25:28] !log stran@deploy1003 dreamyjazz, stran: Continuing with deployment [13:26:37] !log jynus@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:26:38] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2141.codfw.wmnet [13:26:50] !log fceratto@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dborch1001.wikimedia.org decommissioned, removing all IPs except the asset tag one - fceratto@cumin1003" [13:28:08] (03CR) 10Jcrespo: [C:03+2] mariadb: Remove the last references to db2141 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1278457 (https://phabricator.wikimedia.org/T424327) (owner: 10Jcrespo) [13:28:23] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T419961)', diff saved to https://phabricator.wikimedia.org/P91800 and previous config saved to /var/cache/conftool/dbconfig/20260428-132822-fceratto.json [13:28:33] !log fceratto@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dborch1001.wikimedia.org decommissioned, removing all IPs except the asset tag one - fceratto@cumin1003" [13:28:33] !log fceratto@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:28:34] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dborch1001.wikimedia.org [13:29:34] FIRING: [100x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:31:26] !log stran@deploy1003 Finished scap sync-world: Backport for [[gerrit:1278381|Update action parameter for bulk blocking instrumented events (T420517)]], [[gerrit:1278442|Resources: Define required message for 'oojs-ui-windows' module (T424653)]] (duration: 15m 10s) [13:31:37] T420517: Instrument bulk blocking of connected temporary accounts - https://phabricator.wikimedia.org/T420517 [13:31:38] T424653: "⧼ooui-dialog-process-back⧽" on hCaptcha error page - https://phabricator.wikimedia.org/T424653 [13:31:54] done, all you Sohom_Datta [13:32:45] I mean, I would need somebody to deploy for me :) I can't deploy on my own :( [13:33:07] I can deploy ^^ [13:33:18] thanks Tran! [13:33:28] thank you! 👋 [13:33:31] (03CR) 10Muehlenhoff: [C:03+2] Remove ganeti5005 from the eqsin01 cluster [puppet] - 10https://gerrit.wikimedia.org/r/1278451 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [13:33:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278435 (https://phabricator.wikimedia.org/T423962) (owner: 10Novem Linguae) [13:34:14] (03CR) 10JavierMonton: alerts: mw-page-html-content-change-enrich (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1276704 (https://phabricator.wikimedia.org/T423996) (owner: 10JavierMonton) [13:34:25] (03CR) 10JavierMonton: [C:03+2] alerts: mw-page-html-content-change-enrich [alerts] - 10https://gerrit.wikimedia.org/r/1276704 (https://phabricator.wikimedia.org/T423996) (owner: 10JavierMonton) [13:34:34] FIRING: [100x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:34:35] (03Merged) 10jenkins-bot: testwiki: allow sysops to add/remove electionadmin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278435 (https://phabricator.wikimedia.org/T423962) (owner: 10Novem Linguae) [13:34:39] FIRING: [9x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:34:58] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1278435|testwiki: allow sysops to add/remove electionadmin (T423962)]] [13:35:03] T423962: Admins should be able to grant election-admin role on testwiki - https://phabricator.wikimedia.org/T423962 [13:35:40] (03CR) 10Ssingh: [C:03+2] profile::hcaptcha: move to the discovery2026 pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277509 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [13:36:03] (03Merged) 10jenkins-bot: alerts: mw-page-html-content-change-enrich [alerts] - 10https://gerrit.wikimedia.org/r/1276704 (https://phabricator.wikimedia.org/T423996) (owner: 10JavierMonton) [13:36:46] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, novemlinguae: Backport for [[gerrit:1278435|testwiki: allow sysops to add/remove electionadmin (T423962)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:36:47] (03PS1) 10Filippo Giunchedi: aptrepo: add updates for trixie-wikimedia + osbpo [puppet] - 10https://gerrit.wikimedia.org/r/1278472 (https://phabricator.wikimedia.org/T423598) [13:36:57] Sohom_Datta: please test [13:36:58] (03CR) 10Klausman: [C:03+1] ml-services: bump up k8s resources in llm ns to enable gpt isvc deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277934 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [13:38:24] (03PS3) 10Btullis: opensearch-cluster: Add a new destinationrule for the bulk indexing service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278453 (https://phabricator.wikimedia.org/T424007) [13:38:31] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P91803 and previous config saved to /var/cache/conftool/dbconfig/20260428-133830-fceratto.json [13:38:32] (03PS4) 10Btullis: opensearch-cluster: Add a new destinationrule for the bulk indexing service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278453 (https://phabricator.wikimedia.org/T424007) [13:38:33] (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add a new destinationrule for the bulk indexing service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278453 (https://phabricator.wikimedia.org/T424007) (owner: 10Btullis) [13:38:42] Yep works! [13:38:49] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, novemlinguae: Continuing with deployment [13:38:53] thanks! [13:38:54] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdev1003 - https://phabricator.wikimedia.org/T418928#11866001 (10Jgreen) @Jclark-ctr I thought I had the user/pass for this host but I'm unable to log in. Is it something other than ADMIN and the usual setup password? I get "unkno... [13:39:11] (03PS1) 10Andrew Bogott: Add openstack repos from debian.net to reprepro, take two [puppet] - 10https://gerrit.wikimedia.org/r/1278473 (https://phabricator.wikimedia.org/T423598) [13:39:31] (03PS7) 10Elukey: profile::mediabackup: move to the discovery2026 pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277506 (https://phabricator.wikimedia.org/T420993) [13:39:34] FIRING: [96x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:39:39] FIRING: [9x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:40:14] (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add a new destinationrule for the bulk indexing service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278453 (https://phabricator.wikimedia.org/T424007) (owner: 10Btullis) [13:42:31] (03CR) 10Andrew Bogott: [C:03+1] aptrepo: add updates for trixie-wikimedia + osbpo [puppet] - 10https://gerrit.wikimedia.org/r/1278472 (https://phabricator.wikimedia.org/T423598) (owner: 10Filippo Giunchedi) [13:42:35] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1278435|testwiki: allow sysops to add/remove electionadmin (T423962)]] (duration: 07m 37s) [13:42:40] T423962: Admins should be able to grant election-admin role on testwiki - https://phabricator.wikimedia.org/T423962 [13:42:50] (03Abandoned) 10Andrew Bogott: Add openstack repos from debian.net to reprepro, take two [puppet] - 10https://gerrit.wikimedia.org/r/1278473 (https://phabricator.wikimedia.org/T423598) (owner: 10Andrew Bogott) [13:43:10] (03CR) 10Muehlenhoff: [C:03+2] apt/staging: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1278426 (https://phabricator.wikimedia.org/T420993) (owner: 10Muehlenhoff) [13:43:13] !log UTC afternoon backport+config window done [13:43:15] (03CR) 10Kevin Bazira: [C:03+2] ml-services: bump up k8s resources in llm ns to enable gpt isvc deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277934 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [13:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:05] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-d1-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T424614#11866024 (10Jclark-ctr) a:03Jclark-ctr [13:44:08] (03CR) 10Andrew Bogott: [C:03+2] aptrepo: add updates for trixie-wikimedia + osbpo [puppet] - 10https://gerrit.wikimedia.org/r/1278472 (https://phabricator.wikimedia.org/T423598) (owner: 10Filippo Giunchedi) [13:44:52] (03PS5) 10Btullis: opensearch-cluster: Add a new destinationrule for the bulk indexing service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278453 (https://phabricator.wikimedia.org/T424007) [13:45:00] (03PS6) 10Btullis: opensearch-cluster: Add a new destinationrule for the bulk indexing service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278453 (https://phabricator.wikimedia.org/T424007) [13:45:02] (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add a new destinationrule for the bulk indexing service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278453 (https://phabricator.wikimedia.org/T424007) (owner: 10Btullis) [13:46:10] 🎉 Thank you for doing the deployment Lucas_WMDE ! [13:47:13] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Degraded RAID on an-worker1199 - https://phabricator.wikimedia.org/T424654#11866032 (10Jclark-ctr) [13:47:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Degraded RAID on an-worker1199 - https://phabricator.wikimedia.org/T424654#11866034 (10Jclark-ctr) a:03Jclark-ctr [13:47:52] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Degraded RAID on an-worker1199 - https://phabricator.wikimedia.org/T424654#11866044 (10Jclark-ctr) @btullis when is a good time to swap this disk [13:48:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P91804 and previous config saved to /var/cache/conftool/dbconfig/20260428-134838-fceratto.json [13:49:01] !log aokoth@cumin1003 START - Cookbook sre.hosts.reimage for host phab2003.codfw.wmnet with OS bullseye [13:49:30] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2226: after reimage to trixie [13:49:34] FIRING: [94x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:50:39] (03Merged) 10jenkins-bot: ml-services: bump up k8s resources in llm ns to enable gpt isvc deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277934 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [13:52:12] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618#11866061 (10Papaul) [13:52:25] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-d1-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T424614#11866063 (10Jclark-ctr) Sensor: Phase, BA:L1-L2, Active Power Value: 1.705 kW (power) Thresholds: High: 1650 [13:52:43] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1259: after reimage to trixie [13:53:10] FIRING: [17x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:53:53] (03CR) 10Bartosz Wójtowicz: [C:03+1] "Thank you for deploying!!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278182 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [13:54:56] !log klausman@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [13:54:57] !log klausman@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [13:57:04] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-d1-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T424614#11866076 (10Jclark-ctr) Rebalanced pdu [13:58:48] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T419961)', diff saved to https://phabricator.wikimedia.org/P91807 and previous config saved to /var/cache/conftool/dbconfig/20260428-135847-fceratto.json [13:58:54] (03PS1) 10Ottomata: EventStreamConfig - Declare .v1 streams for html content and feature counts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278476 (https://phabricator.wikimedia.org/T423920) [13:59:50] (03CR) 10CI reject: [V:04-1] EventStreamConfig - Declare .v1 streams for html content and feature counts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278476 (https://phabricator.wikimedia.org/T423920) (owner: 10Ottomata) [14:00:05] Deploy window Test Kitchen UI Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260428T1400) [14:00:11] (03PS2) 10Ottomata: EventStreamConfig - Declare .v1 streams for html content and feature counts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278476 (https://phabricator.wikimedia.org/T423920) [14:00:41] (03PS3) 10Ottomata: EventStreamConfig - Declare .v1 streams for html content and feature counts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278476 (https://phabricator.wikimedia.org/T423920) [14:00:53] (03PS1) 10Dpogorzelski: amd-gpu: handle service deadlock [puppet] - 10https://gerrit.wikimedia.org/r/1278477 [14:01:34] (03CR) 10CI reject: [V:04-1] EventStreamConfig - Declare .v1 streams for html content and feature counts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278476 (https://phabricator.wikimedia.org/T423920) (owner: 10Ottomata) [14:01:38] (03CR) 10Dpogorzelski: [C:03+2] amd-gpu: handle service deadlock [puppet] - 10https://gerrit.wikimedia.org/r/1278477 (owner: 10Dpogorzelski) [14:04:34] FIRING: [94x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:05:20] FIRING: [3x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in 4d 23h 49m 25s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [14:08:11] ACKNOWLEDGEMENT - dump of es6 in codfw on backupmon1001 is CRITICAL: Last dump for es6 at codfw (es2036) taken on 2026-04-28 00:00:04 is 42 GiB, but the previous one was 23 GiB, a change of +85.9 % Jcrespo expected after cluster split: T421729 - The acknowledgement expires at: 2026-05-12 14:07:46. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [14:08:12] ACKNOWLEDGEMENT - dump of es6 in eqiad on backupmon1001 is CRITICAL: Last dump for es6 at eqiad (es1036) taken on 2026-04-28 00:00:03 is 42 GiB, but the previous one was 23 GiB, a change of +85.0 % Jcrespo expected after cluster split: T421729 - The acknowledgement expires at: 2026-05-12 14:07:46. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [14:08:12] ACKNOWLEDGEMENT - dump of es7 in codfw on backupmon1001 is CRITICAL: Last dump for es7 at codfw (es2040) taken on 2026-04-28 00:00:04 is 42 GiB, but the previous one was 23 GiB, a change of +84.8 % Jcrespo expected after cluster split: T421729 - The acknowledgement expires at: 2026-05-12 14:07:46. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [14:08:12] ACKNOWLEDGEMENT - dump of es7 in eqiad on backupmon1001 is CRITICAL: Last dump for es7 at eqiad (es1040) taken on 2026-04-28 00:00:03 is 42 GiB, but the previous one was 23 GiB, a change of +85.7 % Jcrespo expected after cluster split: T421729 - The acknowledgement expires at: 2026-05-12 14:07:46. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [14:09:34] FIRING: [94x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:09:42] (03CR) 10Jcrespo: [C:03+1] profile::mediabackup: move to the discovery2026 pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277506 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [14:09:46] (03PS1) 10Btullis: Add packages.qlever.org to reprepro as thirdparty/qlever [puppet] - 10https://gerrit.wikimedia.org/r/1278479 (https://phabricator.wikimedia.org/T424340) [14:09:54] (03CR) 10Jcrespo: [C:03+1] "mediabackups are paused now" [puppet] - 10https://gerrit.wikimedia.org/r/1277506 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [14:11:01] (03CR) 10Jcrespo: [C:03+2] profile::mediabackup: move to the discovery2026 pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1277506 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [14:11:17] (03PS1) 10Elukey: Revert "envoyproxy: trigger the envoy's config re-creation if deleted" [puppet] - 10https://gerrit.wikimedia.org/r/1278480 [14:11:38] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1278479 (https://phabricator.wikimedia.org/T424340) (owner: 10Btullis) [14:12:54] (03PS5) 10Andrew Bogott: Openstack: get osbpo packages from apt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1276011 (https://phabricator.wikimedia.org/T423598) [14:13:56] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1276011 (https://phabricator.wikimedia.org/T423598) (owner: 10Andrew Bogott) [14:14:34] FIRING: [94x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:17:04] 10ops-eqiad, 06SRE, 06DC-Ops: verify cables - https://phabricator.wikimedia.org/T424601#11866143 (10VRiley-WMF) a:03VRiley-WMF [14:18:27] RECOVERY - orchestrator resolve cache non-FQDNs on dborch1002 is OK: OK: all orchestrator resolve cache entries are FQDNs https://wikitech.wikimedia.org/wiki/Orchestrator [14:19:25] (03PS9) 10Arnaudb: envoyproxy: rebuild envoy.yaml when the placeholder is created [puppet] - 10https://gerrit.wikimedia.org/r/1275827 (https://phabricator.wikimedia.org/T421827) [14:19:34] FIRING: [94x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:21:01] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Inbound errors on interface lsw1-d4-eqiad:ethernet-1/19 (an-worker1230 {#5330}) - https://phabricator.wikimedia.org/T423757#11866152 (10BTullis) Hi @VRiley-WMF - Yes, please feel free to swap this cable any time. If it's a short o... [14:24:44] (03CR) 10Andrew Bogott: [C:03+2] Designate: use zookeeper as the tooz backend, everywhere [puppet] - 10https://gerrit.wikimedia.org/r/1277747 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [14:25:13] (03PS4) 10Elukey: services: Add TLS SANs to the evaluators' mesh configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277518 (https://phabricator.wikimedia.org/T424193) [14:26:22] !log herron@cumin1003 START - Cookbook sre.kafka.change-confluent-distro-version Change Confluent distribution for Kafka A:kafka-logging-eqiad cluster: Change Confluent distribution. [14:26:29] PROBLEM - Kafka broker TLS certificate validity on kafka-jumbo1013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [14:26:36] PROBLEM - Kafka Broker Server #page on kafka-jumbo1013 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [14:27:01] here [14:27:07] !incidents [14:27:07] 7877 (UNACKED) kafka-jumbo1013/Kafka Broker Server (paged) [14:27:07] 7876 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-eqiad:9804 Transit: Lumen (442550281) {#3867} xe-3/3/2 gnmi eqiad) [14:27:14] !ack 7877 [14:27:14] 7877 (ACKED) kafka-jumbo1013/Kafka Broker Server (paged) [14:27:38] (03CR) 10Herron: [V:03+1 C:03+2] kafka-logging: set eqiad (and all) brokers to confluent distro 77 [puppet] - 10https://gerrit.wikimedia.org/r/1277581 (https://phabricator.wikimedia.org/T423723) (owner: 10Herron) [14:27:42] (03CR) 10Brouberol: opensearch-cluster: Add a new destinationrule for the bulk indexing service (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278453 (https://phabricator.wikimedia.org/T424007) (owner: 10Btullis) [14:28:02] Here [14:28:29] RECOVERY - Kafka broker TLS certificate validity on kafka-jumbo1013 is OK: SSL OK - Certificate kafka-jumbo1013.eqiad.wmnet valid until 2026-08-23 08:42:00 +0000 (expires in 116 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [14:28:36] RECOVERY - Kafka Broker Server #page on kafka-jumbo1013 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [14:28:39] looks not very happy [14:28:42] but recovering [14:28:45] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=kafka-jumbo1013&var-datasource=000000026&var-cluster=mysql&from=now-30m&to=now&timezone=utc [14:29:02] (03PS7) 10Btullis: opensearch-cluster: Add a new destinationrule for the bulk indexing service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278453 (https://phabricator.wikimedia.org/T424007) [14:29:07] Amir1: brouberol is changing kafka distribution [14:29:11] (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add a new destinationrule for the bulk indexing service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278453 (https://phabricator.wikimedia.org/T424007) (owner: 10Btullis) [14:29:15] (03PS8) 10Btullis: opensearch-cluster: Add a new destinationrule for the bulk indexing service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278453 (https://phabricator.wikimedia.org/T424007) [14:29:16] elukey: so expected? [14:29:16] ah okay [14:29:30] yes that's me sorry [14:29:44] okay, thanks [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260428T1430) [14:30:09] no worries. as long as it's not an actual issue, all good [14:30:09] :D [14:30:27] we're upgrading kafka for the first time in 8 years. It's worth it :D [14:30:49] just make sure the cpu governer is set correctly, I don't care about the rest [14:31:28] jokes aside, thank you for doing it. I'm really grateful [14:33:51] pleasure, and it's a team effort really! [14:35:57] (03CR) 10Filippo Giunchedi: "See inline" [puppet] - 10https://gerrit.wikimedia.org/r/1276011 (https://phabricator.wikimedia.org/T423598) (owner: 10Andrew Bogott) [14:36:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Inbound errors on interface lsw1-d4-eqiad:ethernet-1/19 (an-worker1230 {#5330}) - https://phabricator.wikimedia.org/T423757#11866223 (10VRiley-WMF) 05Open→03Resolved This cable has been swapped out. [14:37:58] !log aokoth@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host phab2003.codfw.wmnet with OS bullseye [14:38:17] (03CR) 10Elukey: [C:03+2] Revert "envoyproxy: trigger the envoy's config re-creation if deleted" [puppet] - 10https://gerrit.wikimedia.org/r/1278480 (owner: 10Elukey) [14:39:54] (03CR) 10Btullis: opensearch-cluster: Add a new destinationrule for the bulk indexing service (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278453 (https://phabricator.wikimedia.org/T424007) (owner: 10Btullis) [14:42:51] (03PS9) 10Btullis: opensearch-cluster: Add a new destinationrule for the bulk indexing service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278453 (https://phabricator.wikimedia.org/T424007) [14:43:01] (03PS10) 10Btullis: opensearch-cluster: Add a new destinationrule for the bulk indexing service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278453 (https://phabricator.wikimedia.org/T424007) [14:43:02] (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add a new destinationrule for the bulk indexing service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278453 (https://phabricator.wikimedia.org/T424007) (owner: 10Btullis) [14:44:34] FIRING: [95x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:47:16] !log herron@cumin1003 END (PASS) - Cookbook sre.kafka.change-confluent-distro-version (exit_code=0) Change Confluent distribution for Kafka A:kafka-logging-eqiad cluster: Change Confluent distribution. [14:48:30] (03PS1) 10Gmodena: balzegraph: group alerts by instance [alerts] - 10https://gerrit.wikimedia.org/r/1278488 [14:49:34] FIRING: [94x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:50:08] (03CR) 10Andrew Bogott: "thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1276011 (https://phabricator.wikimedia.org/T423598) (owner: 10Andrew Bogott) [14:50:18] (03PS6) 10Andrew Bogott: Openstack: get osbpo packages from apt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1276011 (https://phabricator.wikimedia.org/T423598) [14:50:31] (03CR) 10Brouberol: "LGTM but I'd rather not +1 too fast as I don't know enough about how our debian config works" [puppet] - 10https://gerrit.wikimedia.org/r/1278479 (https://phabricator.wikimedia.org/T424340) (owner: 10Btullis) [14:54:34] FIRING: [93x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:54:39] FIRING: [8x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:55:51] (03CR) 10Filippo Giunchedi: [C:03+1] "Untested but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1276011 (https://phabricator.wikimedia.org/T423598) (owner: 10Andrew Bogott) [14:57:04] (03PS1) 10Muehlenhoff: puppetboard: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1278491 (https://phabricator.wikimedia.org/T420993) [14:57:23] (03CR) 10Elukey: [C:03+1] puppetboard: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1278491 (https://phabricator.wikimedia.org/T420993) (owner: 10Muehlenhoff) [14:59:28] 06SRE, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Datastores, 13Patch-For-Review: Upgrade kafka-logging to version 3.7 - https://phabricator.wikimedia.org/T423723#11866341 (10herron) [14:59:34] FIRING: [93x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:59:58] (03PS1) 10Gmodena: blazegraph: group alerts by instance [alerts] - 10https://gerrit.wikimedia.org/r/1278493 (https://phabricator.wikimedia.org/T418708) [15:00:05] jelto, arnoldokoth, mutante, and arnaudb: Time to do the SRE Collaboration Services office hours deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260428T1500). [15:02:00] !log jelto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: Phabricator deploy [15:02:30] !log jelto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: Phabricator deploy [15:03:30] (03PS1) 10Herron: kafka-logging: set eqiad (and all) brokers to protocol 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/1278489 (https://phabricator.wikimedia.org/T423723) [15:03:43] !log brennen@deploy1003 Started deploy [phabricator/deployment@ce0b865]: deploy phab2002 for T424656 [15:03:48] T424656: Deploy Phab/Phorge 2026-04-28 - https://phabricator.wikimedia.org/T424656 [15:03:50] (03CR) 10Muehlenhoff: [C:03+2] puppetboard: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1278491 (https://phabricator.wikimedia.org/T420993) (owner: 10Muehlenhoff) [15:04:28] !log brennen@deploy1003 Finished deploy [phabricator/deployment@ce0b865]: deploy phab2002 for T424656 (duration: 00m 44s) [15:04:34] FIRING: [94x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:04:50] !log brennen@deploy1003 Started deploy [phabricator/deployment@ce0b865]: deploy phab1004 for T424656 [15:05:42] !log brennen@deploy1003 Finished deploy [phabricator/deployment@ce0b865]: deploy phab1004 for T424656 (duration: 00m 52s) [15:07:06] (03CR) 10Elukey: [C:03+1] "yessssss" [puppet] - 10https://gerrit.wikimedia.org/r/1278489 (https://phabricator.wikimedia.org/T423723) (owner: 10Herron) [15:09:34] FIRING: [93x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:09:48] (03CR) 10Herron: [V:03+1 C:03+2] kafka-logging: set eqiad (and all) brokers to protocol 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/1278489 (https://phabricator.wikimedia.org/T423723) (owner: 10Herron) [15:11:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:14:17] (03PS1) 10Sbisson: testwiki: Enable Article Guidance extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278494 (https://phabricator.wikimedia.org/T417200) [15:14:34] FIRING: [93x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:14:36] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11866431 (10ayounsi) [15:14:44] !log herron@cumin1003 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-logging-eqiad [15:14:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278494 (https://phabricator.wikimedia.org/T417200) (owner: 10Sbisson) [15:15:50] (03PS4) 10Ottomata: EventStreamConfig - Declare .v1 streams for html content and feature counts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278476 (https://phabricator.wikimedia.org/T423920) [15:16:18] !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [15:16:49] !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [15:17:47] !log klausman@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [15:18:17] (03CR) 10Mvolz: [C:03+2] "PMC vs wayback was a red herring, it just was that the wayback url I happened to test was an http one. Https works, http doesn't." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276385 (owner: 10Mvolz) [15:18:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by otto@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278476 (https://phabricator.wikimedia.org/T423920) (owner: 10Ottomata) [15:18:20] !log klausman@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [15:18:55] !log klausman@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [15:19:03] !log klausman@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [15:19:13] !log klausman@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [15:19:16] (03Merged) 10jenkins-bot: EventStreamConfig - Declare .v1 streams for html content and feature counts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278476 (https://phabricator.wikimedia.org/T423920) (owner: 10Ottomata) [15:19:29] !log klausman@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [15:19:39] FIRING: [6x] CertAlmostExpired: Certificate for service wdqs1015:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:19:44] !log otto@deploy1003 Started scap sync-world: Backport for [[gerrit:1278476|EventStreamConfig - Declare .v1 streams for html content and feature counts (T423920)]] [15:19:48] T423920: Streaming HTML & Edit Types - productionization checklist - https://phabricator.wikimedia.org/T423920 [15:21:34] !log otto@deploy1003 otto: Backport for [[gerrit:1278476|EventStreamConfig - Declare .v1 streams for html content and feature counts (T423920)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:22:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:24:11] !log otto@deploy1003 otto: Continuing with deployment [15:28:05] !log otto@deploy1003 Finished scap sync-world: Backport for [[gerrit:1278476|EventStreamConfig - Declare .v1 streams for html content and feature counts (T423920)]] (duration: 08m 21s) [15:28:09] T423920: Streaming HTML & Edit Types - productionization checklist - https://phabricator.wikimedia.org/T423920 [15:29:34] FIRING: [92x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:31:05] (03CR) 10Trueg: "I fear you have to explain this one in more detail. :/" [alerts] - 10https://gerrit.wikimedia.org/r/1278493 (https://phabricator.wikimedia.org/T418708) (owner: 10Gmodena) [15:32:36] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#11866558 (10Ladsgroup) Since many of the containers have been failed half-way through or didn't get to run. Once this round is done, with something like: ` for i in {0..255}; do h=$(p... [15:34:13] !log herron@cumin1003 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-logging-eqiad [15:34:34] FIRING: [91x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:34:39] FIRING: [6x] CertAlmostExpired: Certificate for service wdqs1015:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:34:59] 06SRE, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Datastores: Upgrade kafka-logging to version 3.7 - https://phabricator.wikimedia.org/T423723#11866570 (10herron) [15:37:10] (03PS2) 10MacFan4000: ExtensionDistributor: mark 1.46 as development [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278498 (https://phabricator.wikimedia.org/T423262) [15:39:15] (03CR) 10Elukey: "My soul is deeply sad about my mistake :(" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277518 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [15:39:34] FIRING: [91x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:42:49] (03PS1) 10Elukey: role::crm: move the pki intermediate to discovery2026 [puppet] - 10https://gerrit.wikimedia.org/r/1278499 (https://phabricator.wikimedia.org/T420993) [15:42:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277701 (https://phabricator.wikimedia.org/T407432) (owner: 10Ebernhardson) [15:43:19] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1278499 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [15:44:00] 10SRE-swift-storage, 10Ceph, 06Data-Persistence: Migrate Data Persistence Envoy TLS proxy services to the 2026 discovery intermediate - https://phabricator.wikimedia.org/T424674#11866634 (10MatthewVernon) [15:44:34] FIRING: [91x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:45:10] (03PS4) 10Dwisehaupt: alertmanager: add frack networks to iptables allow on 9093 [puppet] - 10https://gerrit.wikimedia.org/r/1269672 (https://phabricator.wikimedia.org/T422888) [15:45:40] (03PS1) 10Gkyziridis: ml-services: Deploy rr_multilingual latest version on experimental staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278500 (https://phabricator.wikimedia.org/T415892) [15:45:40] (03CR) 10CI reject: [V:04-1] alertmanager: add frack networks to iptables allow on 9093 [puppet] - 10https://gerrit.wikimedia.org/r/1269672 (https://phabricator.wikimedia.org/T422888) (owner: 10Dwisehaupt) [15:45:46] (03CR) 10Elukey: "no ok nevermind, the .discovery part is not added and I need probably to differenciate prod from staging, will send another patch!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277518 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [15:48:38] (03CR) 10Jforrester: [C:03+1] "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278498 (https://phabricator.wikimedia.org/T423262) (owner: 10MacFan4000) [15:48:46] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278498 (https://phabricator.wikimedia.org/T423262) (owner: 10MacFan4000) [15:48:55] (03CR) 10Gkyziridis: [C:03+2] "Merge and test latest rr-multilingual model on experimental. I will re-deploy the rest of the models after testing." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278500 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [15:50:26] (03PS11) 10Btullis: opensearch-cluster: Add a new destinationrule for the bulk indexing service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278453 (https://phabricator.wikimedia.org/T424007) [15:50:43] 06SRE, 06Infrastructure-Foundations, 10netops: Network QoS: expand support to Nokia switches - https://phabricator.wikimedia.org/T424639#11866670 (10cmooney) [15:50:51] (03Merged) 10jenkins-bot: ml-services: Deploy rr_multilingual latest version on experimental staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278500 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [15:52:08] (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add a new destinationrule for the bulk indexing service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278453 (https://phabricator.wikimedia.org/T424007) (owner: 10Btullis) [15:52:47] !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [15:54:21] (03PS12) 10Btullis: opensearch-cluster: Add a new destinationrule for the bulk indexing service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278453 (https://phabricator.wikimedia.org/T424007) [15:54:32] (03PS5) 10Dwisehaupt: alertmanager: add frack networks to iptables allow on 9093 [puppet] - 10https://gerrit.wikimedia.org/r/1269672 (https://phabricator.wikimedia.org/T422888) [15:54:34] FIRING: [92x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:55:04] (03PS3) 10Tiziano Fogli: rsyslog: forward thanos-query-frontend logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/1275799 (https://phabricator.wikimedia.org/T423986) [15:55:05] (03PS3) 10Tiziano Fogli: logstash: add thanos-query-frontend filter [puppet] - 10https://gerrit.wikimedia.org/r/1275800 (https://phabricator.wikimedia.org/T423986) [15:55:05] (03PS1) 10Tiziano Fogli: logstash/filter: increase sockets-timeout for unit tests [puppet] - 10https://gerrit.wikimedia.org/r/1278501 (https://phabricator.wikimedia.org/T423986) [15:55:18] 06SRE, 06Infrastructure-Foundations, 10netops: Network QoS: expand support to Nokia switches - https://phabricator.wikimedia.org/T424639#11866721 (10cmooney) [15:55:21] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Inbound errors on interface lsw1-d4-eqiad:ethernet-1/19 (an-worker1230 {#5330}) - https://phabricator.wikimedia.org/T423757#11866723 (10BTullis) Great! Did it fix the issue? [15:57:28] (03PS2) 10Elukey: role::crm: move the pki intermediate to discovery2026 [puppet] - 10https://gerrit.wikimedia.org/r/1278499 (https://phabricator.wikimedia.org/T420993) [15:57:41] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1278499 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [15:58:02] (03CR) 10Tiziano Fogli: "Thanks for the review and the helpful advice. Any comments you left should be fixed, but I didn’t mark them as resolved in case you want t" [puppet] - 10https://gerrit.wikimedia.org/r/1275800 (https://phabricator.wikimedia.org/T423986) (owner: 10Tiziano Fogli) [15:58:53] (03PS1) 10Majavah: hieradata: cloudweb-dev: Use discovery2026 intermediary [puppet] - 10https://gerrit.wikimedia.org/r/1278502 (https://phabricator.wikimedia.org/T424675) [15:58:56] (03PS1) 10Majavah: hieradata: cloudweb: Use discovery2026 intermediary [puppet] - 10https://gerrit.wikimedia.org/r/1278503 (https://phabricator.wikimedia.org/T424675) [15:59:15] (03PS13) 10Btullis: opensearch-cluster: Add a new destinationrule for the bulk indexing service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278453 (https://phabricator.wikimedia.org/T424007) [15:59:34] FIRING: [91x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:59:39] (03PS4) 10Tiziano Fogli: logstash: add thanos-query-frontend filter [puppet] - 10https://gerrit.wikimedia.org/r/1275800 (https://phabricator.wikimedia.org/T423986) [15:59:45] (03PS5) 10Elukey: services: Add TLS SANs to the evaluators' mesh configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277518 (https://phabricator.wikimedia.org/T424193) [16:00:05] jhathaway and rzl: Your horoscope predicts another Puppet request window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260428T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:20] (03CR) 10Majavah: [C:03+2] hieradata: cloudweb-dev: Use discovery2026 intermediary [puppet] - 10https://gerrit.wikimedia.org/r/1278502 (https://phabricator.wikimedia.org/T424675) (owner: 10Majavah) [16:00:48] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[5667] - https://phabricator.wikimedia.org/T418903#11866745 (10VRiley-WMF) 05Open→03Resolved @MoritzMuehlenhoff @elukey this is completed [16:00:53] (03CR) 10Elukey: "Hopefully this time it makes more sense :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277518 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [16:03:19] (03CR) 10Dwisehaupt: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8480/co" [puppet] - 10https://gerrit.wikimedia.org/r/1269672 (https://phabricator.wikimedia.org/T422888) (owner: 10Dwisehaupt) [16:04:34] FIRING: [91x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:07:46] (03PS1) 10Andrew Bogott: Revert "Designate: use zookeeper as the tooz backend, everywhere" [puppet] - 10https://gerrit.wikimedia.org/r/1278505 [16:08:36] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission db2141.codfw.wmnet - https://phabricator.wikimedia.org/T424327#11866808 (10jcrespo) a:05jcrespo→03None [16:08:42] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission db2141.codfw.wmnet - https://phabricator.wikimedia.org/T424327#11866815 (10jcrespo) This is ready for dc ops. [16:08:45] (03PS1) 10Cwhite: logstash: use discovery2026 intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1278506 (https://phabricator.wikimedia.org/T424673) [16:09:19] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:09:39] FIRING: [6x] CertAlmostExpired: Certificate for service wdqs1015:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:09:55] (03CR) 10CI reject: [V:04-1] Revert "Designate: use zookeeper as the tooz backend, everywhere" [puppet] - 10https://gerrit.wikimedia.org/r/1278505 (owner: 10Andrew Bogott) [16:09:58] (03PS1) 10Cwhite: graphite: use discovery2026 intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1278507 (https://phabricator.wikimedia.org/T424673) [16:10:30] (03PS2) 10Andrew Bogott: Revert "Designate: use zookeeper as the tooz backend, everywhere" [puppet] - 10https://gerrit.wikimedia.org/r/1278505 [16:10:56] (03PS1) 10Cwhite: webperf: use discovery2026 intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1278508 (https://phabricator.wikimedia.org/T424673) [16:12:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[5667] - https://phabricator.wikimedia.org/T418903#11866841 (10MoritzMuehlenhoff) Thanks! I'll create a followup task for the cluster integration. [16:13:00] (03CR) 10Dwisehaupt: [V:03+1] alertmanager: add frack networks to iptables allow on 9093 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1269672 (https://phabricator.wikimedia.org/T422888) (owner: 10Dwisehaupt) [16:13:15] (03CR) 10Andrew Bogott: [C:03+2] Revert "Designate: use zookeeper as the tooz backend, everywhere" [puppet] - 10https://gerrit.wikimedia.org/r/1278505 (owner: 10Andrew Bogott) [16:13:58] 10ops-eqiad, 06SRE, 06DC-Ops, 06cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd105[56] - https://phabricator.wikimedia.org/T419892#11866850 (10Jclark-ctr) a:03Jclark-ctr [16:14:02] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti105[5678] - https://phabricator.wikimedia.org/T424680 (10MoritzMuehlenhoff) 03NEW [16:14:21] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install cloudcephosd1053 - https://phabricator.wikimedia.org/T416394#11866869 (10Jclark-ctr) a:03Jclark-ctr [16:14:34] FIRING: [92x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:14:57] (03CR) 10Hnowlan: [C:03+1] graphite: use discovery2026 intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1278507 (https://phabricator.wikimedia.org/T424673) (owner: 10Cwhite) [16:15:11] (03CR) 10Hnowlan: [C:03+1] webperf: use discovery2026 intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1278508 (https://phabricator.wikimedia.org/T424673) (owner: 10Cwhite) [16:15:26] (03CR) 10Hnowlan: [C:03+1] logstash: use discovery2026 intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1278506 (https://phabricator.wikimedia.org/T424673) (owner: 10Cwhite) [16:15:29] (03CR) 10Btullis: [C:03+2] opensearch-cluster: Add a new destinationrule for the bulk indexing service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278453 (https://phabricator.wikimedia.org/T424007) (owner: 10Btullis) [16:16:00] (03CR) 10Cwhite: [C:03+2] webperf: use discovery2026 intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1278508 (https://phabricator.wikimedia.org/T424673) (owner: 10Cwhite) [16:16:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:17:23] PROBLEM - librenms.wikimedia.org tls expiry on netmon2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [16:17:31] PROBLEM - SSH on netmon2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:17:33] PROBLEM - librenms.wikimedia.org requires authentication on netmon2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [16:17:39] (03Merged) 10jenkins-bot: opensearch-cluster: Add a new destinationrule for the bulk indexing service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278453 (https://phabricator.wikimedia.org/T424007) (owner: 10Btullis) [16:18:13] RECOVERY - librenms.wikimedia.org tls expiry on netmon2002 is OK: OK - Certificate librenms.wikimedia.org will expire on Sun 12 Jul 2026 02:51:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [16:18:21] RECOVERY - SSH on netmon2002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u9 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:18:23] RECOVERY - librenms.wikimedia.org requires authentication on netmon2002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 701 bytes in 0.168 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [16:18:34] (03PS1) 10DCausse: Completion: fix near match field name [extensions/WikibaseCirrusSearch] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1278509 (https://phabricator.wikimedia.org/T417648) [16:19:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/WikibaseCirrusSearch] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1278509 (https://phabricator.wikimedia.org/T417648) (owner: 10DCausse) [16:19:34] FIRING: [91x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:20:49] ottomata: Mind if I use change 1278476 (already deployed) for some spiderpig testing? [16:20:59] (03Abandoned) 10Jdlrobson: Don't set href for a link that has been unset [extensions/GrowthExperiments] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1275543 (https://phabricator.wikimedia.org/T422907) (owner: 10Jdlrobson) [16:21:23] PROBLEM - librenms.wikimedia.org tls expiry on netmon2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [16:21:31] PROBLEM - SSH on netmon2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:21:33] PROBLEM - librenms.wikimedia.org requires authentication on netmon2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [16:21:44] (03CR) 10Jgreen: [C:03+1] role::crm: move the pki intermediate to discovery2026 [puppet] - 10https://gerrit.wikimedia.org/r/1278499 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [16:21:50] (03PS2) 10Cwhite: graphite: use discovery2026 intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1278507 (https://phabricator.wikimedia.org/T424673) [16:21:58] !log btullis@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-test: apply [16:22:06] !log btullis@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [16:22:08] m n [16:22:13] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [16:22:17] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [16:22:30] !log btullis@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-semantic-search-test: apply [16:22:37] !log btullis@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search-test: apply [16:22:44] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search-test: apply [16:22:48] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search-test: apply [16:22:56] !log btullis@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [16:22:59] 10SRE-swift-storage, 10Ceph, 06Data-Persistence: Migrate Data Persistence Envoy TLS proxy services to the 2026 discovery intermediate - https://phabricator.wikimedia.org/T424674#11866924 (10MatthewVernon) [16:23:03] !log btullis@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [16:23:10] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [16:23:14] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [16:23:22] !log btullis@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [16:23:29] !log btullis@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [16:23:35] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [16:23:39] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [16:23:43] (03CR) 10Kevin Bazira: [C:03+2] ml-services: deploy gpt isvc in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278182 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [16:23:48] !log btullis@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-ipoid: apply [16:23:55] !log btullis@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-ipoid: apply [16:24:01] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid: apply [16:24:05] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid: apply [16:24:23] (03CR) 10Cwhite: [C:03+2] graphite: use discovery2026 intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1278507 (https://phabricator.wikimedia.org/T424673) (owner: 10Cwhite) [16:24:23] (03CR) 10Muehlenhoff: [C:03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1269672 (https://phabricator.wikimedia.org/T422888) (owner: 10Dwisehaupt) [16:25:46] (03Merged) 10jenkins-bot: ml-services: deploy gpt isvc in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278182 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [16:29:08] !log dancy@deploy1003 Installing scap version "4.254.0" for 2 host(s) [16:29:19] FIRING: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:29:35] FIRING: [91x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:29:40] FIRING: [5x] CertAlmostExpired: Certificate for service wdqs1015:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:29:43] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 37617024 and 11 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:30:43] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:30:59] !log dancy@deploy1003 Installation of scap version "4.254.0" completed for 2 hosts [16:33:34] (03CR) 10Elukey: [C:03+2] role::crm: move the pki intermediate to discovery2026 [puppet] - 10https://gerrit.wikimedia.org/r/1278499 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [16:34:19] FIRING: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:34:34] FIRING: [91x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:35:46] (03CR) 10Cwhite: [C:03+2] logstash: use discovery2026 intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1278506 (https://phabricator.wikimedia.org/T424673) (owner: 10Cwhite) [16:37:32] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO:Switch refresh diagram - https://phabricator.wikimedia.org/T408511#11867023 (10Papaul) [16:39:34] FIRING: [91x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:41:58] (03PS1) 10Jdlrobson: Provide support for upright in thumbnails for older browsers [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1278513 (https://phabricator.wikimedia.org/T424596) [16:44:34] FIRING: [91x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:44:39] FIRING: [7x] CertAlmostExpired: Certificate for service wdqs1015:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:48:02] 06SRE, 06Infrastructure-Foundations, 06ServiceOps new, 10ServiceOps-Datastores: Upgrade kafka-logging to version 3.7 - https://phabricator.wikimedia.org/T423723#11867057 (10herron) 05Open→03Resolved All kafka-logging brokers have been upgraded to 3.7 [16:49:24] (03PS1) 10Jasmine: parsoid/testreduce: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1278515 (https://phabricator.wikimedia.org/T424671) [16:49:34] FIRING: [89x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:52:06] (03PS2) 10Arlolra: Deploy PRV to 12 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277770 (https://phabricator.wikimedia.org/T424590) [16:54:34] FIRING: [90x] CertAlmostExpired: Certificate for service logstash1023:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:54:50] (03PS1) 10Jasmine: deployment_server: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1278516 (https://phabricator.wikimedia.org/T424671) [16:55:16] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11867079 (10ssingh) >>! In T408892#11749076, @ayounsi wrote: > As a side note we will need to manually change the IPs of the routed ganeti nodes in rack 23 to... [16:58:54] 10SRE-SLO, 06ServiceOps new, 06Data-Platform-SRE (2026-04-24 - 2026-05-15), 07Essential-Work, and 2 others: IPoid: Define service level indicators and service level objectives - https://phabricator.wikimedia.org/T348935#11867088 (10BTullis) I believe that these SLIs and SLOs are now defined at: https://wik... [16:59:34] FIRING: [86x] CertAlmostExpired: Certificate for service logstash1024:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:00:05] jasmine_: May I have your attention please! MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260428T1700) [17:03:37] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install cloudcephosd1053 - https://phabricator.wikimedia.org/T416394#11867123 (10Jclark-ctr) @Andrew This server can’t go in C8 that switch only supports 1G/10G. @ayounsi Can you confirm I’m right? I believe the only WMCS racks that support 25G are E4 and F4, and... [17:03:39] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Inbound errors on interface lsw1-d4-eqiad:ethernet-1/19 (an-worker1230 {#5330}) - https://phabricator.wikimedia.org/T423757#11867124 (10VRiley-WMF) It'll take time in order to see if it fixes it. We'll see if it doesn't throw... [17:04:20] 10ops-eqiad, 06SRE, 06DC-Ops, 06cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd105[56] - https://phabricator.wikimedia.org/T419892#11867127 (10Jclark-ctr) @Andrew This server can’t go in D5 that switch only supports 1G/10G. @ayounsi Can you confirm I’m right? I believe the only WMCS ra... [17:04:34] FIRING: [61x] CertAlmostExpired: Certificate for service prometheus1005:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:09:34] FIRING: [61x] CertAlmostExpired: Certificate for service prometheus1005:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:09:39] 06SRE, 06Infrastructure-Foundations, 10netops: Network telemetry - collect device sub-interface statistics with gnmic - https://phabricator.wikimedia.org/T424683 (10cmooney) 03NEW p:05Triage→03Medium [17:11:09] (03PS1) 10Btullis: Configure dse-k8s-worker nodes for ipip encapsulation [puppet] - 10https://gerrit.wikimedia.org/r/1278519 (https://phabricator.wikimedia.org/T420437) [17:11:18] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1278519 (https://phabricator.wikimedia.org/T420437) (owner: 10Btullis) [17:12:01] 06SRE, 06Infrastructure-Foundations, 10netops: Network telemetry - collect device sub-interface statistics with gnmic - https://phabricator.wikimedia.org/T424683#11867159 (10cmooney) [17:13:32] RECOVERY - librenms.wikimedia.org requires authentication on netmon2002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 701 bytes in 7.423 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [17:14:13] RECOVERY - librenms.wikimedia.org tls expiry on netmon2002 is OK: OK - Certificate librenms.wikimedia.org will expire on Sun 12 Jul 2026 02:51:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [17:14:23] RECOVERY - SSH on netmon2002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u9 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:14:34] FIRING: [61x] CertAlmostExpired: Certificate for service prometheus1005:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:15:10] !log jasmine@cumin1003 START - Cookbook sre.hosts.reboot-single for host rdb1014.eqiad.wmnet [17:15:24] !log jasmine@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host rdb1014.eqiad.wmnet [17:16:04] !log jasmine@cumin1003 START - Cookbook sre.hosts.reboot-single for host rdb1014.eqiad.wmnet [17:18:31] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [17:19:34] FIRING: [61x] CertAlmostExpired: Certificate for service prometheus1005:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:22:57] !log jasmine@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1014.eqiad.wmnet [17:23:50] !log jasmine@cumin1003 START - Cookbook sre.hosts.reboot-single for host rdb2008.codfw.wmnet [17:24:19] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:24:34] FIRING: [61x] CertAlmostExpired: Certificate for service prometheus1005:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:24:39] FIRING: [7x] CertAlmostExpired: Certificate for service wdqs1015:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:26:09] (03PS1) 10AOkoth: phabricator: replace phab2002 with phab2003 [puppet] - 10https://gerrit.wikimedia.org/r/1278521 (https://phabricator.wikimedia.org/T423727) [17:28:45] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 227.79 ms [17:29:34] FIRING: [61x] CertAlmostExpired: Certificate for service prometheus1005:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:30:33] !log jasmine@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2008.codfw.wmnet [17:31:23] !log jasmine@cumin1003 START - Cookbook sre.hosts.reboot-single for host rdb1013.eqiad.wmnet [17:34:34] FIRING: [61x] CertAlmostExpired: Certificate for service prometheus1005:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:35:09] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [17:37:31] FIRING: RedisReplicaDown: Redis replica down rdb1014:16379 redis_misc - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-site=eqiad&var-job=redis_misc&var-instance=rdb1014:16379 - https://alerts.wikimedia.org/?q=alertname%3DRedisReplicaDown [17:37:54] !log jasmine@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1013.eqiad.wmnet [17:38:06] !log jasmine@cumin1003 START - Cookbook sre.hosts.reboot-single for host rdb2007.codfw.wmnet [17:38:43] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 184483312 and 20 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:39:34] FIRING: [61x] CertAlmostExpired: Certificate for service prometheus1005:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:39:43] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3599624 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:40:17] (03CR) 10Andrew Bogott: [C:03+2] hieradata: cloudweb: Use discovery2026 intermediary [puppet] - 10https://gerrit.wikimedia.org/r/1278503 (https://phabricator.wikimedia.org/T424675) (owner: 10Majavah) [17:42:31] RESOLVED: RedisReplicaDown: Redis replica down rdb1014:16379 redis_misc - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-site=eqiad&var-job=redis_misc&var-instance=rdb1014:16379 - https://alerts.wikimedia.org/?q=alertname%3DRedisReplicaDown [17:44:01] FIRING: [2x] RedisReplicaDown: Redis replica down rdb1014:16379 redis_misc - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisReplicaDown [17:44:34] FIRING: [61x] CertAlmostExpired: Certificate for service prometheus1005:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:45:53] !log jasmine@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2007.codfw.wmnet [17:46:51] !log jasmine@cumin1003 START - Cookbook sre.hosts.reboot-single for host rdb1012.eqiad.wmnet [17:48:25] (03CR) 10Dzahn: "we should first test this role on a bookworm instance" [puppet] - 10https://gerrit.wikimedia.org/r/1278521 (https://phabricator.wikimedia.org/T423727) (owner: 10AOkoth) [17:49:01] RESOLVED: [2x] RedisReplicaDown: Redis replica down rdb1014:16379 redis_misc - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisReplicaDown [17:49:34] FIRING: [61x] CertAlmostExpired: Certificate for service prometheus1005:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:49:39] FIRING: [6x] CertAlmostExpired: Certificate for service wdqs1015:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:52:48] (03CR) 10Dzahn: "let's:" [puppet] - 10https://gerrit.wikimedia.org/r/1278521 (https://phabricator.wikimedia.org/T423727) (owner: 10AOkoth) [17:53:25] FIRING: [16x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:54:34] FIRING: [61x] CertAlmostExpired: Certificate for service prometheus1005:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:54:39] FIRING: [6x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:55:21] !log jasmine@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1012.eqiad.wmnet [17:55:39] !log jasmine@cumin1003 START - Cookbook sre.hosts.reboot-single for host rdb2010.codfw.wmnet [17:56:13] !log aokoth@cumin1003 START - Cookbook sre.hosts.reimage for host phab2003.codfw.wmnet with OS bookworm [17:56:31] PROBLEM - Host mr1-ulsfo.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [17:56:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:57:43] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 338178712 and 127 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:59:36] (03PS1) 10Andrew Bogott: zookeeper: allow overriding the zookeeper host ID [puppet] - 10https://gerrit.wikimedia.org/r/1278524 [17:59:39] FIRING: [6x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:00:05] thcipriani and thcipriani: How many deployers does it take to do MediaWiki train - Utc-7 Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260428T1800). [18:00:05] (03CR) 10CI reject: [V:04-1] zookeeper: allow overriding the zookeeper host ID [puppet] - 10https://gerrit.wikimedia.org/r/1278524 (owner: 10Andrew Bogott) [18:00:12] thcipriani: apologies, infra window reboots going a lil over the scheduled window - wrapping up momentarily [18:02:08] !log jasmine@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2010.codfw.wmnet [18:02:33] !log jasmine@cumin1003 START - Cookbook sre.hosts.reboot-single for host rdb1011.eqiad.wmnet [18:02:55] (03PS2) 10Andrew Bogott: zookeeper: allow overriding the zookeeper host ID [puppet] - 10https://gerrit.wikimedia.org/r/1278524 (https://phabricator.wikimedia.org/T422646) [18:04:34] FIRING: [62x] CertAlmostExpired: Certificate for service prometheus1005:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:04:50] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1278524 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [18:05:20] FIRING: [3x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in 4d 19h 49m 25s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [18:07:21] (03CR) 10Jdlrobson: [C:03+1] Enable the reading list beta feature survey on all wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277667 (https://phabricator.wikimedia.org/T421776) (owner: 10Stoyofuku-wmf) [18:08:28] (03PS3) 10Dzahn: gerrit: allow zuul machines to port 22 ssh (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1271042 (https://phabricator.wikimedia.org/T395938) [18:08:32] (03PS3) 10Andrew Bogott: zookeeper: allow overriding the zookeeper host ID [puppet] - 10https://gerrit.wikimedia.org/r/1278524 (https://phabricator.wikimedia.org/T422646) [18:09:02] (03CR) 10CI reject: [V:04-1] zookeeper: allow overriding the zookeeper host ID [puppet] - 10https://gerrit.wikimedia.org/r/1278524 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [18:09:06] (03CR) 10Dzahn: "Do we need this one? I had it in my "WIP" section." [puppet] - 10https://gerrit.wikimedia.org/r/1271042 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [18:09:10] (03PS4) 10Dzahn: gerrit: allow zuul machines to port 22 ssh [puppet] - 10https://gerrit.wikimedia.org/r/1271042 (https://phabricator.wikimedia.org/T395938) [18:09:11] !log jasmine@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1011.eqiad.wmnet [18:09:34] FIRING: [61x] CertAlmostExpired: Certificate for service prometheus1005:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:09:49] thcipriani: all clear, feel free to proceed [18:09:57] (03CR) 10Dzahn: "Tyler, we talked about needing this in our last meeting, right?" [puppet] - 10https://gerrit.wikimedia.org/r/1271042 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [18:10:30] (03PS4) 10Andrew Bogott: zookeeper: allow overriding the zookeeper host ID [puppet] - 10https://gerrit.wikimedia.org/r/1278524 (https://phabricator.wikimedia.org/T422646) [18:10:38] (03CR) 10Dzahn: "afair this was the only thing that kept it from picking up production jobs for real" [puppet] - 10https://gerrit.wikimedia.org/r/1271042 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [18:10:56] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1278524 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [18:11:24] !log aokoth@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on phab2003.codfw.wmnet with reason: host reimage [18:14:32] (03CR) 10Dzahn: "sorry, confused. parts of these comments were about https://gerrit.wikimedia.org/r/c/operations/puppet/+/1275537/1/modules/profile/manifes" [puppet] - 10https://gerrit.wikimedia.org/r/1271042 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [18:15:11] !log aokoth@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on phab2003.codfw.wmnet with reason: host reimage [18:15:33] !log jasmine@cumin1003 START - Cookbook sre.hosts.reboot-single for host rdb2009.codfw.wmnet [18:15:44] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 110024 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:16:34] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING WARNING - Packet loss = 0%, RTA = 650.90 ms [18:16:37] (03PS2) 10Dzahn: integration: switch integration-agent-docker VMs to Java 21 [puppet] - 10https://gerrit.wikimedia.org/r/1271032 (https://phabricator.wikimedia.org/T418109) [18:16:54] (03CR) 10Dzahn: "I am not sure yet when / if this is ready to be merged. Can you let me know?" [puppet] - 10https://gerrit.wikimedia.org/r/1271032 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn) [18:18:07] (03PS1) 10Andrew Bogott: Designate: use zookeeper as the tooz backend, everywhere [puppet] - 10https://gerrit.wikimedia.org/r/1278528 (https://phabricator.wikimedia.org/T422646) [18:18:20] (03CR) 10Dzahn: [V:03+1 C:04-1] "This is for the day we switch to new jenkins hosts. We talked about how we have to disable jenkins on the legacy host then." [puppet] - 10https://gerrit.wikimedia.org/r/1273919 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn) [18:18:47] (03CR) 10CI reject: [V:04-1] Designate: use zookeeper as the tooz backend, everywhere [puppet] - 10https://gerrit.wikimedia.org/r/1278528 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [18:19:43] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, and 3 others: Lumen 10G transport 442550293 disconnection - https://phabricator.wikimedia.org/T424758 (10RobH) 03NEW [18:20:00] (03PS2) 10Andrew Bogott: Designate: use zookeeper as the tooz backend, everywhere [puppet] - 10https://gerrit.wikimedia.org/r/1278528 (https://phabricator.wikimedia.org/T422646) [18:21:48] (03PS1) 10Dzahn: planet: use discovery2026 intermediate CA [puppet] - 10https://gerrit.wikimedia.org/r/1278530 (https://phabricator.wikimedia.org/T424669) [18:21:57] !log jasmine@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2009.codfw.wmnet [18:22:31] FIRING: RedisReplicaDown: Redis replica down rdb2010:16378 redis_misc - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-site=codfw&var-job=redis_misc&var-instance=rdb2010:16378 - https://alerts.wikimedia.org/?q=alertname%3DRedisReplicaDown [18:22:58] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [18:23:16] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1278528 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [18:24:12] (03PS1) 10Dzahn: site: remove planet[12]004 [puppet] - 10https://gerrit.wikimedia.org/r/1278531 [18:26:44] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 598747416 and 50 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:26:47] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, and 3 others: Lumen 10G transport 442550293 disconnection - https://phabricator.wikimedia.org/T424758#11868220 (10RobH) @Papaul: Please advise what the exact patch panel port https://netbox.wikimedia.org/circuits/circuits/103/ lands on before I request a cross con... [18:27:31] RESOLVED: RedisReplicaDown: Redis replica down rdb2010:16378 redis_misc - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-site=codfw&var-job=redis_misc&var-instance=rdb2010:16378 - https://alerts.wikimedia.org/?q=alertname%3DRedisReplicaDown [18:27:50] !log dzahn@cumin2002 START - Cookbook sre.hosts.decommission for hosts planet1004.eqiad.wmnet [18:28:00] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 224.62 ms [18:28:44] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3069656 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:29:34] FIRING: [60x] CertAlmostExpired: Certificate for service prometheus1005:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:30:22] (03PS1) 10TrainBranchBot: group0 to 1.46.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278533 (https://phabricator.wikimedia.org/T423877) [18:30:22] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1278530/8486/planet1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1278530 (https://phabricator.wikimedia.org/T424669) (owner: 10Dzahn) [18:30:25] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by jhuneidi@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278533 (https://phabricator.wikimedia.org/T423877) (owner: 10TrainBranchBot) [18:31:44] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 978647696 and 77 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:32:39] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [18:33:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277667 (https://phabricator.wikimedia.org/T421776) (owner: 10Stoyofuku-wmf) [18:33:32] !log aokoth@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host phab2003.codfw.wmnet with OS bookworm [18:33:44] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 319680 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:35:43] (03CR) 10Andrew Bogott: [C:03+2] Openstack: get osbpo packages from apt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1276011 (https://phabricator.wikimedia.org/T423598) (owner: 10Andrew Bogott) [18:36:29] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: planet1004.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - dzahn@cumin2002" [18:37:28] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: planet1004.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - dzahn@cumin2002" [18:37:28] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:37:30] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts planet1004.eqiad.wmnet [18:38:26] (03CR) 10Dzahn: [C:03+2] site: remove planet[12]004 [puppet] - 10https://gerrit.wikimedia.org/r/1278531 (owner: 10Dzahn) [18:39:44] (03CR) 10Dzahn: [C:03+2] "https://phabricator.wikimedia.org/T424763" [puppet] - 10https://gerrit.wikimedia.org/r/1278531 (owner: 10Dzahn) [18:44:34] FIRING: [60x] CertAlmostExpired: Certificate for service prometheus1005:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:44:59] (03PS1) 10Dzahn: zuul: switch to new discovery2026 intermediate CA [puppet] - 10https://gerrit.wikimedia.org/r/1278534 (https://phabricator.wikimedia.org/T424669) [18:46:45] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [18:47:32] (03CR) 10Xcollazo: [V:03+1 C:03+1] "Confirmed via contact email that this is indeed the volunteer and that they want this change." [puppet] - 10https://gerrit.wikimedia.org/r/1277254 (owner: 10Harej) [18:47:50] (03CR) 10Xcollazo: [V:03+1 C:03+1] "@btullis@wikimedia.org can you please merge?" [puppet] - 10https://gerrit.wikimedia.org/r/1277254 (owner: 10Harej) [18:49:34] FIRING: [60x] CertAlmostExpired: Certificate for service prometheus1005:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:51:45] RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [18:51:46] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 570528992 and 37 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:54:34] FIRING: [60x] CertAlmostExpired: Certificate for service prometheus1005:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:56:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [18:59:34] FIRING: [60x] CertAlmostExpired: Certificate for service prometheus1005:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:01:45] (03PS1) 10DLynch: ContentBranchNodeCheck: cope with null actions [extensions/VisualEditor] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1278538 (https://phabricator.wikimedia.org/T424416) [19:02:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/VisualEditor] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1278538 (https://phabricator.wikimedia.org/T424416) (owner: 10DLynch) [19:05:44] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 23096 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:06:54] (03PS1) 10Herron: prometheus: switch to discovery2026 for envoy [puppet] - 10https://gerrit.wikimedia.org/r/1278527 (https://phabricator.wikimedia.org/T420993) [19:07:47] (03PS2) 10Herron: prometheus: switch to discovery2026 for envoy [puppet] - 10https://gerrit.wikimedia.org/r/1278527 (https://phabricator.wikimedia.org/T420993) [19:08:44] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 67340032 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:09:44] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:11:53] 10SRE-swift-storage, 10Ceph, 06Data-Persistence: Migrate Data Persistence Envoy TLS proxy services to the 2026 discovery intermediate - https://phabricator.wikimedia.org/T424674#11868360 (10Eevans) [19:11:56] (03PS1) 10Andrew Bogott: Fix keyfile for wikimedia openstack apt components [puppet] - 10https://gerrit.wikimedia.org/r/1278543 (https://phabricator.wikimedia.org/T423598) [19:13:50] (03PS2) 10Andrew Bogott: Fix keyfile for wikimedia openstack apt components [puppet] - 10https://gerrit.wikimedia.org/r/1278543 (https://phabricator.wikimedia.org/T423598) [19:14:54] (03CR) 10Andrew Bogott: [C:03+2] Fix keyfile for wikimedia openstack apt components [puppet] - 10https://gerrit.wikimedia.org/r/1278543 (https://phabricator.wikimedia.org/T423598) (owner: 10Andrew Bogott) [19:16:44] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 65169072 and 7 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:18:44] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 2974568 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:20:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:22:44] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 1000903304 and 112 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:23:43] (03CR) 10Dzahn: [C:03+2] zuul: switch to new discovery2026 intermediate CA [puppet] - 10https://gerrit.wikimedia.org/r/1278534 (https://phabricator.wikimedia.org/T424669) (owner: 10Dzahn) [19:26:44] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 24088 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:26:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [19:28:12] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install cloudcephosd1054 - https://phabricator.wikimedia.org/T416395#11868388 (10Jclark-ctr) @Andrew This server can’t go in C8 that switch only supports 1G/10G. @ayounsi Can you confirm I’m right? I believe the only WMCS racks that support 25G are E4 and F4, and t... [19:29:34] FIRING: [60x] CertAlmostExpired: Certificate for service prometheus1005:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:34:34] FIRING: [60x] CertAlmostExpired: Certificate for service prometheus1005:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:34:41] (03PS1) 10Eevans: restbase: migrate envoy TLS proxy services to new intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1278554 (https://phabricator.wikimedia.org/T424674) [19:37:13] (03PS2) 10Eevans: restbase: migrate envoy TLS proxy services to new intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1278554 (https://phabricator.wikimedia.org/T424674) [19:37:27] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1278554 (https://phabricator.wikimedia.org/T424674) (owner: 10Eevans) [19:41:25] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:46:44] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 63505728 and 20 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:47:44] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 8285024 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:50:56] (03PS2) 10Herron: titan: switch to discovery2026 for envoy [puppet] - 10https://gerrit.wikimedia.org/r/1278546 (https://phabricator.wikimedia.org/T420993) [19:52:56] (03Merged) 10jenkins-bot: group0 to 1.46.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278533 (https://phabricator.wikimedia.org/T423877) (owner: 10TrainBranchBot) [20:00:04] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260428T2000). [20:00:05] stephanebisson, ebernhardson, James_F, and kemayo: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:09] o/ [20:00:11] \o [20:00:13] o/ [20:00:17] o/ [20:00:37] I can’t deploy, sorry. But can verify. [20:00:54] I propose to get started, as the first one on the list, unless someone has something super urgent. [20:01:12] +1 [20:01:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278494 (https://phabricator.wikimedia.org/T417200) (owner: 10Sbisson) [20:03:27] (03Merged) 10jenkins-bot: testwiki: Enable Article Guidance extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278494 (https://phabricator.wikimedia.org/T417200) (owner: 10Sbisson) [20:04:17] !log sbisson@deploy1003 Started scap sync-world: Backport for [[gerrit:1278494|testwiki: Enable Article Guidance extension (T417200)]] [20:04:23] T417200: Deploy Article Guidance extension to production (testwiki) - https://phabricator.wikimedia.org/T417200 [20:05:44] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 197825824 and 22 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:06:09] !log sbisson@deploy1003 sbisson: Backport for [[gerrit:1278494|testwiki: Enable Article Guidance extension (T417200)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:06:44] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3433016 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:07:37] (03PS1) 10AKhatun: alerts: mw-page-html-feature-counts-change-enrich [alerts] - 10https://gerrit.wikimedia.org/r/1278559 (https://phabricator.wikimedia.org/T424224) [20:08:18] (03PS1) 10Bking: wdqs: remove references to defunct role wdqs::internal [puppet] - 10https://gerrit.wikimedia.org/r/1278561 (https://phabricator.wikimedia.org/T420993) [20:08:32] (03PS1) 10Ryan Kemper: wdqs: drop dangling query-legacy-full helm refs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278562 (https://phabricator.wikimedia.org/T415073) [20:09:34] FIRING: [60x] CertAlmostExpired: Certificate for service prometheus1005:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [20:09:35] (03CR) 10CI reject: [V:04-1] ContentBranchNodeCheck: cope with null actions [extensions/VisualEditor] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1278538 (https://phabricator.wikimedia.org/T424416) (owner: 10DLynch) [20:09:44] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 551661976 and 66 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:10:56] !log sbisson@deploy1003 sbisson: Continuing with deployment [20:11:11] I am available to help deploy backports if needed [20:12:44] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 71600 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:13:14] (03CR) 10DLynch: "recheck" [extensions/VisualEditor] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1278538 (https://phabricator.wikimedia.org/T424416) (owner: 10DLynch) [20:14:34] FIRING: [60x] CertAlmostExpired: Certificate for service prometheus1005:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [20:14:44] (03CR) 10AKhatun: alerts: mw-page-html-feature-counts-change-enrich (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1278559 (https://phabricator.wikimedia.org/T424224) (owner: 10AKhatun) [20:14:51] !log sbisson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1278494|testwiki: Enable Article Guidance extension (T417200)]] (duration: 10m 34s) [20:14:55] T417200: Deploy Article Guidance extension to production (testwiki) - https://phabricator.wikimedia.org/T417200 [20:15:32] I'm done. Over to you ebernhardson [20:15:36] James_F: i can probably ship your config patch with mine? [20:19:34] FIRING: [60x] CertAlmostExpired: Certificate for service prometheus1005:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [20:20:04] ebernhardson: Sure. [20:20:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.07% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:20:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277701 (https://phabricator.wikimedia.org/T407432) (owner: 10Ebernhardson) [20:20:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278498 (https://phabricator.wikimedia.org/T423262) (owner: 10MacFan4000) [20:20:52] (03CR) 10Cwhite: [C:03+1] prometheus: switch to discovery2026 for envoy [puppet] - 10https://gerrit.wikimedia.org/r/1278527 (https://phabricator.wikimedia.org/T420993) (owner: 10Herron) [20:21:16] (03CR) 10Cwhite: [C:03+1] titan: switch to discovery2026 for envoy [puppet] - 10https://gerrit.wikimedia.org/r/1278546 (https://phabricator.wikimedia.org/T420993) (owner: 10Herron) [20:21:59] (03Merged) 10jenkins-bot: cirrus: AB test query suggester variants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277701 (https://phabricator.wikimedia.org/T407432) (owner: 10Ebernhardson) [20:22:09] (03Merged) 10jenkins-bot: ExtensionDistributor: mark 1.46 as development [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278498 (https://phabricator.wikimedia.org/T423262) (owner: 10MacFan4000) [20:22:26] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-d1-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T424614#11868579 (10Jclark-ctr) 05Open→03Resolved [20:22:33] !log ebernhardson@deploy1003 Started scap sync-world: Backport for [[gerrit:1277701|cirrus: AB test query suggester variants (T407432)]], [[gerrit:1278498|ExtensionDistributor: mark 1.46 as development (T423262)]] [20:22:39] T407432: Follow-up AB test of dym language model variants - https://phabricator.wikimedia.org/T407432 [20:22:39] T423262: Add REL1_46 to ExtensionDistributor as the development snapshot - https://phabricator.wikimedia.org/T423262 [20:22:53] (03PS1) 10Mstyles: miscweb: updated image for security landing page [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278570 (https://phabricator.wikimedia.org/T423940) [20:24:25] !log ebernhardson@deploy1003 ebernhardson, macfan4000: Backport for [[gerrit:1277701|cirrus: AB test query suggester variants (T407432)]], [[gerrit:1278498|ExtensionDistributor: mark 1.46 as development (T423262)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:24:39] FIRING: [4x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [20:24:55] James_F: should be available to verify [20:24:56] ebernhardson: Confirmed good from my end. [20:24:59] kk [20:25:08] Thanks! [20:25:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.07% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:25:21] !log ebernhardson@deploy1003 ebernhardson, macfan4000: Continuing with deployment [20:25:39] (03CR) 10Herron: [V:03+1 C:03+2] titan: switch to discovery2026 for envoy [puppet] - 10https://gerrit.wikimedia.org/r/1278546 (https://phabricator.wikimedia.org/T420993) (owner: 10Herron) [20:29:08] !log ebernhardson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1277701|cirrus: AB test query suggester variants (T407432)]], [[gerrit:1278498|ExtensionDistributor: mark 1.46 as development (T423262)]] (duration: 06m 35s) [20:29:19] T407432: Follow-up AB test of dym language model variants - https://phabricator.wikimedia.org/T407432 [20:29:21] T423262: Add REL1_46 to ExtensionDistributor as the development snapshot - https://phabricator.wikimedia.org/T423262 [20:29:34] FIRING: [60x] CertAlmostExpired: Certificate for service prometheus1005:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [20:29:37] Kemayo: all done, you're up [20:29:39] FIRING: [5x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [20:29:44] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 450424656 and 50 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:29:45] do you have deploy rights, or need assistance? [20:32:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 19.66% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:34:34] FIRING: [59x] CertAlmostExpired: Certificate for service prometheus1005:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [20:35:00] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [20:35:20] Kemayo: i can ship your patch if you like, but need to start in the next 5 min or so [20:35:28] PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.199, interfaces up: 34, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:35:28] (03CR) 10Herron: [C:03+2] prometheus: switch to discovery2026 for envoy [puppet] - 10https://gerrit.wikimedia.org/r/1278527 (https://phabricator.wikimedia.org/T420993) (owner: 10Herron) [20:36:06] ebernhardson: Sorry, I was talking to someone. I can get it myself if you'd rather. [20:36:30] Kemayo: alright, just wanted to make sure you had rights. Sounds like it's all set. [20:36:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/VisualEditor] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1278538 (https://phabricator.wikimedia.org/T424416) (owner: 10DLynch) [20:37:24] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [20:37:28] RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 208.80.154.199, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:38:47] (03CR) 10CI reject: [V:04-1] ContentBranchNodeCheck: cope with null actions [extensions/VisualEditor] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1278538 (https://phabricator.wikimedia.org/T424416) (owner: 10DLynch) [20:39:34] FIRING: [57x] CertAlmostExpired: Certificate for service prometheus1005:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [20:39:34] *sigh* [20:39:45] (03CR) 10DLynch: [C:03+2] "recheck" [extensions/VisualEditor] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1278538 (https://phabricator.wikimedia.org/T424416) (owner: 10DLynch) [20:40:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/VisualEditor] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1278538 (https://phabricator.wikimedia.org/T424416) (owner: 10DLynch) [20:40:02] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 0.89 ms [20:40:10] "GnuTLS recv error (-54): Error in the pull function." what even [20:40:35] That's T421827 which has been causing issues for a while now. [20:40:36] T421827: gerrit: Adapt timeouts to avoid 502 errors in CI jobs - https://phabricator.wikimedia.org/T421827 [20:41:09] ahh ty [20:41:21] (03Merged) 10jenkins-bot: ContentBranchNodeCheck: cope with null actions [extensions/VisualEditor] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1278538 (https://phabricator.wikimedia.org/T424416) (owner: 10DLynch) [20:41:49] !log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1278538|ContentBranchNodeCheck: cope with null actions (T424416)]] [20:42:04] T424416: Create generic paragraph checking functionality, usable by any edit check that works on a per-paragraph basis - https://phabricator.wikimedia.org/T424416 [20:42:26] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 2.67 ms [20:43:38] !log kemayo@deploy1003 kemayo: Backport for [[gerrit:1278538|ContentBranchNodeCheck: cope with null actions (T424416)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:44:35] !log kemayo@deploy1003 kemayo: Continuing with deployment [20:45:44] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:47:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 18.84% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:48:25] !log kemayo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1278538|ContentBranchNodeCheck: cope with null actions (T424416)]] (duration: 06m 36s) [20:48:30] T424416: Create generic paragraph checking functionality, usable by any edit check that works on a per-paragraph basis - https://phabricator.wikimedia.org/T424416 [20:48:44] Okay, I'm all done. [20:49:34] FIRING: [54x] CertAlmostExpired: Certificate for service prometheus1005:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [20:49:36] (03PS1) 10Sbisson: testwiki: Article Guidance experiment config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278584 (https://phabricator.wikimedia.org/T417200) [20:50:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278584 (https://phabricator.wikimedia.org/T417200) (owner: 10Sbisson) [20:50:47] (03PS2) 10Sbisson: testwiki: Article Guidance experiment config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278584 (https://phabricator.wikimedia.org/T417200) [20:53:44] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 180316968 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:54:34] FIRING: [52x] CertAlmostExpired: Certificate for service prometheus1005:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [20:55:46] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 4722808 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:56:51] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1009.eqiad.wmnet with OS trixie [20:58:54] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [20:59:28] PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.199, interfaces up: 34, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:59:34] FIRING: [50x] CertAlmostExpired: Certificate for service prometheus1005:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:00:05] Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260428T2100) [21:00:28] RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 208.80.154.199, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:02:58] o/ will do some deployments when I'm free. [21:03:56] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [21:04:34] FIRING: [48x] CertAlmostExpired: Certificate for service prometheus1005:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:04:39] FIRING: [6x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:04:40] (03PS1) 10Herron: prometheus::pop: switch to discovery2026 [puppet] - 10https://gerrit.wikimedia.org/r/1278595 (https://phabricator.wikimedia.org/T420993) [21:04:56] (03CR) 10Herron: [C:03+2] prometheus::pop: switch to discovery2026 [puppet] - 10https://gerrit.wikimedia.org/r/1278595 (https://phabricator.wikimedia.org/T420993) (owner: 10Herron) [21:07:29] !log dancy@deploy1003 Installing scap version "4.255.0" for 2 host(s) [21:08:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta [21:08:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1278513 (https://phabricator.wikimedia.org/T424596) (owner: 10Jdlrobson) [21:09:20] !log dancy@deploy1003 Installation of scap version "4.255.0" completed for 2 hosts [21:11:44] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 43927056 and 38 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:13:46] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 7680752 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:14:34] FIRING: [38x] CertAlmostExpired: Certificate for service prometheus2007:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:15:16] (03PS1) 10Ryan Kemper: wdqs: nuke dead config from legacy-full decom [puppet] - 10https://gerrit.wikimedia.org/r/1278602 (https://phabricator.wikimedia.org/T415073) [21:15:18] (03PS1) 10Ryan Kemper: cumin: repurpose wdqs-public, add wdqs-internal [puppet] - 10https://gerrit.wikimedia.org/r/1278603 (https://phabricator.wikimedia.org/T415073) [21:16:45] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1009.eqiad.wmnet with reason: host reimage [21:19:34] FIRING: [34x] CertAlmostExpired: Certificate for service prometheus2007:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:19:39] RESOLVED: [6x] CertAlmostExpired: Certificate for service wdqs1012:443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:20:48] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1009.eqiad.wmnet with reason: host reimage [21:20:48] (03CR) 10CI reject: [V:04-1] Provide support for upright in thumbnails for older browsers [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1278513 (https://phabricator.wikimedia.org/T424596) (owner: 10Jdlrobson) [21:22:44] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 582854480 and 32 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:23:27] (03PS1) 10Bking: wcqs: Migrate to new discovery intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1278610 (https://phabricator.wikimedia.org/T420993) [21:23:51] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1278610 (https://phabricator.wikimedia.org/T420993) (owner: 10Bking) [21:24:34] FIRING: [32x] CertAlmostExpired: Certificate for service prometheus3004:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:24:46] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:25:27] (03CR) 10Bking: [C:03+1] wdqs: nuke dead config from legacy-full decom [puppet] - 10https://gerrit.wikimedia.org/r/1278602 (https://phabricator.wikimedia.org/T415073) (owner: 10Ryan Kemper) [21:26:08] (03CR) 10Bking: [C:03+1] cumin: repurpose wdqs-public, add wdqs-internal [puppet] - 10https://gerrit.wikimedia.org/r/1278603 (https://phabricator.wikimedia.org/T415073) (owner: 10Ryan Kemper) [21:26:28] (03CR) 10Bking: [C:03+1] wdqs: drop dangling query-legacy-full helm refs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278562 (https://phabricator.wikimedia.org/T415073) (owner: 10Ryan Kemper) [21:27:57] (03CR) 10SBassett: [C:03+1] miscweb: updated image for security landing page [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278570 (https://phabricator.wikimedia.org/T423940) (owner: 10Mstyles) [21:28:02] (03PS1) 10Dduvall: zuul: Change name of main WMCS nodepool pool [puppet] - 10https://gerrit.wikimedia.org/r/1278623 [21:28:41] (03PS2) 10Ryan Kemper: wdqs: remove refs to defunct role wdqs::internal [puppet] - 10https://gerrit.wikimedia.org/r/1278561 (https://phabricator.wikimedia.org/T335067) (owner: 10Bking) [21:31:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1278513 (https://phabricator.wikimedia.org/T424596) (owner: 10Jdlrobson) [21:34:34] FIRING: [28x] CertAlmostExpired: Certificate for service prometheus4003:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:35:44] (03PS2) 10Scott French: parsoid/testreduce: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1278515 (https://phabricator.wikimedia.org/T424671) (owner: 10Jasmine) [21:35:59] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1278515 (https://phabricator.wikimedia.org/T424671) (owner: 10Jasmine) [21:38:41] (03CR) 10Bking: [C:03+2] wdqs: remove refs to defunct role wdqs::internal [puppet] - 10https://gerrit.wikimedia.org/r/1278561 (https://phabricator.wikimedia.org/T335067) (owner: 10Bking) [21:40:24] (03Merged) 10jenkins-bot: Provide support for upright in thumbnails for older browsers [core] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1278513 (https://phabricator.wikimedia.org/T424596) (owner: 10Jdlrobson) [21:40:49] !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1278513|Provide support for upright in thumbnails for older browsers (T424596)]] [21:40:54] T424596: Firefox 115esr doesn't support thumbnail sizes or upright parameter - https://phabricator.wikimedia.org/T424596 [21:42:39] !log jdlrobson@deploy1003 jdlrobson: Backport for [[gerrit:1278513|Provide support for upright in thumbnails for older browsers (T424596)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:45:14] (03PS2) 10Bking: wdqs: nuke dead config from legacy-full decom [puppet] - 10https://gerrit.wikimedia.org/r/1278602 (https://phabricator.wikimedia.org/T415073) (owner: 10Ryan Kemper) [21:47:40] !log jdlrobson@deploy1003 jdlrobson: Continuing with deployment [21:47:52] (03CR) 10Scott French: [C:03+1] parsoid/testreduce: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1278515 (https://phabricator.wikimedia.org/T424671) (owner: 10Jasmine) [21:48:06] (03CR) 10Bking: [C:03+2] wdqs: nuke dead config from legacy-full decom [puppet] - 10https://gerrit.wikimedia.org/r/1278602 (https://phabricator.wikimedia.org/T415073) (owner: 10Ryan Kemper) [21:48:30] FIRING: Outbound discards: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [21:51:15] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1009.eqiad.wmnet with OS trixie [21:51:32] !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1278513|Provide support for upright in thumbnails for older browsers (T424596)]] (duration: 10m 43s) [21:51:37] T424596: Firefox 115esr doesn't support thumbnail sizes or upright parameter - https://phabricator.wikimedia.org/T424596 [21:52:11] (03CR) 10AKhatun: [C:03+2] stream: mw-page-html-feature-counts-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277605 (https://phabricator.wikimedia.org/T424223) (owner: 10AKhatun) [21:53:25] FIRING: [16x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:53:30] FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [21:54:15] (03Merged) 10jenkins-bot: stream: mw-page-html-feature-counts-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277605 (https://phabricator.wikimedia.org/T424223) (owner: 10AKhatun) [22:01:01] (03CR) 10Dzahn: [C:03+2] zuul: Change name of main WMCS nodepool pool [puppet] - 10https://gerrit.wikimedia.org/r/1278623 (owner: 10Dduvall) [22:05:20] FIRING: [3x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in 4d 15h 49m 25s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [22:06:46] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 685565104 and 43 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:08:46] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3100344 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:14:36] (03CR) 10Jasmine: [C:03+2] parsoid/testreduce: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1278515 (https://phabricator.wikimedia.org/T424671) (owner: 10Jasmine) [22:23:50] (03CR) 10RLazarus: "> Yikes, the diff is definitely a mess." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277518 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [22:40:48] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 139217720 and 9 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:41:48] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 75560 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:42:50] (03PS1) 10Cathal Mooney: gnmic: add subscriptions to openconfig subinterface path [puppet] - 10https://gerrit.wikimedia.org/r/1278682 [22:43:26] (03CR) 10CI reject: [V:04-1] gnmic: add subscriptions to openconfig subinterface path [puppet] - 10https://gerrit.wikimedia.org/r/1278682 (owner: 10Cathal Mooney) [22:44:13] (03PS2) 10Cathal Mooney: gnmic: add subscriptions to openconfig subinterface path [puppet] - 10https://gerrit.wikimedia.org/r/1278682 (https://phabricator.wikimedia.org/T424683) [22:45:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270882 (https://phabricator.wikimedia.org/T417690) (owner: 10D3r1ck01) [22:45:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271969 (https://phabricator.wikimedia.org/T418507) (owner: 10Bartosz Dziewoński) [22:46:37] (03CR) 10Cwhite: [C:04-1] logstash/filter: increase sockets-timeout for unit tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1278501 (https://phabricator.wikimedia.org/T423986) (owner: 10Tiziano Fogli) [22:49:04] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1278682 (https://phabricator.wikimedia.org/T424683) (owner: 10Cathal Mooney) [22:49:10] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Network telemetry - collect device sub-interface statistics with gnmic - https://phabricator.wikimedia.org/T424683#11869006 (10cmooney) I had a stab at this in the above patch. Some notes on the event processors added: |Name|Event Process... [23:02:06] (03PS3) 10Cathal Mooney: gnmic: add subscriptions to openconfig subinterface path [puppet] - 10https://gerrit.wikimedia.org/r/1278682 (https://phabricator.wikimedia.org/T424683) [23:12:50] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 60494384 and 6 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [23:13:48] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3986368 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [23:16:48] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 837500280 and 73 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [23:19:48] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [23:21:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [23:21:38] (03PS1) 10Santiago Faci: WikiLambdaApi: update stream configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278704 (https://phabricator.wikimedia.org/T415254) [23:21:40] (03PS1) 10Santiago Faci: WikiLambdaAPI: update stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278705 (https://phabricator.wikimedia.org/T415254) [23:23:11] (03Abandoned) 10Santiago Faci: WikiLambdaAPI: update stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278705 (https://phabricator.wikimedia.org/T415254) (owner: 10Santiago Faci) [23:23:45] (03CR) 10CI reject: [V:04-1] WikiLambdaApi: update stream configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278704 (https://phabricator.wikimedia.org/T415254) (owner: 10Santiago Faci) [23:24:01] (03PS2) 10Santiago Faci: WikiLambdaApi: update stream configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278704 (https://phabricator.wikimedia.org/T415254) [23:28:48] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 118723032 and 11 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [23:29:48] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 5080 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [23:38:47] (03PS2) 10Scott French: deployment_server: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1278516 (https://phabricator.wikimedia.org/T424671) (owner: 10Jasmine) [23:38:57] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1278516 (https://phabricator.wikimedia.org/T424671) (owner: 10Jasmine) [23:39:58] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1278717 [23:39:58] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1278717 (owner: 10TrainBranchBot) [23:41:40] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:45:17] (03CR) 10Scott French: [C:03+1] deployment_server: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1278516 (https://phabricator.wikimedia.org/T424671) (owner: 10Jasmine) [23:45:24] (03CR) 10Novem Linguae: [C:04-1] "Sounds like this other related patch will be merged instead: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1265959" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248954 (https://phabricator.wikimedia.org/T419309) (owner: 10ZhaoFJx) [23:47:53] (03CR) 10Novem Linguae: arbcom_zhwiki: Enable SecurePoll without PII rights (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265959 (https://phabricator.wikimedia.org/T419309) (owner: 101F616EMO) [23:55:16] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1278717 (owner: 10TrainBranchBot)