[00:27:07] <wikibugs>	 (03PS3) 10Neriah: upload: Return 400 instead of 429 for non-standard thumbnail sizes [puppet] - 10https://gerrit.wikimedia.org/r/1253662 (https://phabricator.wikimedia.org/T414805)
[00:58:50] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 183207792 and 18 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:00:50] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 50776 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:08:16] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: pc5 on pc2015 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 313.21 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[01:08:36] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: pc1 on pc2021 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 319.32 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[01:08:45] <jinxer-wm>	 FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta
[01:08:51] <jinxer-wm>	 FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, ...
[01:08:51] <jinxer-wm>	 IC-313592 51ms 10Gbps wave) {#11372}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=eqiad+prometheus%2Fops&var-device=cr2-eqord:9804&var-interface=xe-0%2F1%2F3 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation
[01:08:58] <rzl>	 !ack
[01:08:58] <sirenbot>	 7879 (ACKED)  TransitPeeringTransportOutSaturation network sre (cr2-eqord:9804 Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372} xe-0/1/3 gnmi eqiad)
[01:09:50] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 131868000 and 10 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:10:50] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 4600 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:13:23] <andrewbogott>	 rzl, are you able to talk me through what you're looking at? Or screenshare?
[01:13:51] <jinxer-wm>	 FIRING: [2x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) #page - https://w.wiki/Gbyf  - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation
[01:17:36] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: pc1 on pc2021 is OK: OK slave_sql_lag Replication lag: 0.35 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[01:18:16] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: pc5 on pc2015 is OK: OK slave_sql_lag Replication lag: 0.32 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[01:23:51] <jinxer-wm>	 RESOLVED: [2x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) #page - https://w.wiki/Gbyf  - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation
[01:28:57] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.dns.admin DNS admin: depool ulsfo for service: upload-addrs [reason: no reason specified, no task ID specified]
[01:29:10] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool ulsfo for service: upload-addrs [reason: no reason specified, no task ID specified]
[01:34:49] <jinxer-wm>	 FIRING: [17x] CertAlmostExpired: Certificate for service apus:443 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[01:53:25] <jinxer-wm>	 FIRING: [16x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:53:31] <jinxer-wm>	 FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[02:00:51] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[02:02:27] <wikibugs>	 (03PS1) 10RLazarus: interfaces: Update playbook link [alerts] - 10https://gerrit.wikimedia.org/r/1278792
[02:03:06] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 04 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265959 (https://phabricator.wikimedia.org/T419309) (owner: 101F616EMO)
[02:05:20] <jinxer-wm>	 FIRING: [3x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in 4d 11h 49m 25s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry
[02:08:28] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 271 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 765, active_shards: 1262, relocating_shards: 0, initializing_shards: 8, unassigned_shards: 
[02:08:28] <icinga-wm>	 ayed_unassigned_shards: 0, number_of_pending_tasks: 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 113, active_shards_percent_as_number: 82.32224396607958 https://wikitech.wikimedia.org/wiki/Search%23Administration
[02:09:19] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:09:28] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1332, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 196, delayed_unassign
[02:09:28] <icinga-wm>	 s: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 86.88845401174167 https://wikitech.wikimedia.org/wiki/Search%23Administration
[02:12:14] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install rdb201[34] - https://phabricator.wikimedia.org/T418922#11869336 (10Jclark-ctr) @jhancock.wm eqiad servers failed install also.  @jijiki when you make change can you fix eqiad and codfw?
[02:16:26] <icinga-wm>	 PROBLEM - Host wikikube-worker1039 is DOWN: PING CRITICAL - Packet loss = 100%
[02:21:33] <jinxer-wm>	 FIRING: KubernetesCalicoDown: wikikube-worker1039.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1039.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[02:28:14] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1008.eqiad.wmnet with OS trixie
[02:29:16] <icinga-wm>	 PROBLEM - WMF Cloud -Omega Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration
[02:29:16] <icinga-wm>	 PROBLEM - WMF Cloud -Omega Cluster- - Prod MW AppServer Port - SSL Expiry on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration
[02:29:45] <jinxer-wm>	 FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow
[02:34:19] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:34:45] <jinxer-wm>	 RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow
[02:36:14] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 277 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 763, active_shards: 1256, relocating_shards: 0, initializing_shards: 8, unassigned_shards: 
[02:36:14] <icinga-wm>	 ayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 81.93085453359426 https://wikitech.wikimedia.org/wiki/Search%23Administration
[02:36:28] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1007 is CRITICAL: CRITICAL - elasticsearch inactive shards 273 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 765, active_shards: 1260, relocating_shards: 0, initializing_shards: 6, unassigned_shards: 267, delayed_unassigned_shards: 0,
[02:36:28] <icinga-wm>	 of_pending_tasks: 4, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 1036, active_shards_percent_as_number: 82.1917808219178 https://wikitech.wikimedia.org/wiki/Search%23Administration
[02:36:30] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 270 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 765, active_shards: 1263, relocating_shards: 0, initializing_shards: 6, unassigned_shards: 
[02:36:30] <icinga-wm>	 ayed_unassigned_shards: 0, number_of_pending_tasks: 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 38, active_shards_percent_as_number: 82.38747553816047 https://wikitech.wikimedia.org/wiki/Search%23Administration
[02:36:36] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 265 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1268, relocating_shards: 0, initializing_shards: 5, unassigned_shard
[02:36:36] <icinga-wm>	 delayed_unassigned_shards: 0, number_of_pending_tasks: 3, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 321, active_shards_percent_as_number: 82.7136333985649 https://wikitech.wikimedia.org/wiki/Search%23Administration
[02:36:38] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch inactive shards 265 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1268, relocating_shards: 0, initializing_shards: 5, unassigned_shard
[02:36:38] <icinga-wm>	 delayed_unassigned_shards: 0, number_of_pending_tasks: 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 128, active_shards_percent_as_number: 82.7136333985649 https://wikitech.wikimedia.org/wiki/Search%23Administration
[02:37:14] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1314, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 215, delayed_unassign
[02:37:14] <icinga-wm>	 s: 0, number_of_pending_tasks: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.71428571428571 https://wikitech.wikimedia.org/wiki/Search%23Administration
[02:37:28] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 766, active_shards: 1329, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 199, delayed_unassigned_shards: 0, number_of_pending_ta
[02:37:28] <icinga-wm>	 number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 86.69275929549902 https://wikitech.wikimedia.org/wiki/Search%23Administration
[02:37:30] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1331, relocating_shards: 0, initializing_shards: 3, unassigned_shards: 199, delayed_unassign
[02:37:30] <icinga-wm>	 s: 0, number_of_pending_tasks: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 86.8232224396608 https://wikitech.wikimedia.org/wiki/Search%23Administration
[02:37:36] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1337, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 191, delayed_unassign
[02:37:36] <icinga-wm>	 s: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 87.21461187214612 https://wikitech.wikimedia.org/wiki/Search%23Administration
[02:37:38] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1337, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 191, delayed_unassign
[02:37:38] <icinga-wm>	 s: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 87.21461187214612 https://wikitech.wikimedia.org/wiki/Search%23Administration
[02:50:50] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 36572304 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[02:51:24] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1008.eqiad.wmnet with reason: host reimage
[02:51:50] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 116536 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[02:55:28] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1008.eqiad.wmnet with reason: host reimage
[03:16:21] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1008.eqiad.wmnet with OS trixie
[03:23:31] <jinxer-wm>	 FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[03:41:40] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:50:39] <wikibugs>	 (03PS1) 10Jasmine: role::kafka::main: move to Confluent Kafka 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/1278832 (https://phabricator.wikimedia.org/T419216)
[03:51:57] <wikibugs>	 (03CR) 10Jasmine: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1278832 (https://phabricator.wikimedia.org/T419216) (owner: 10Jasmine)
[04:09:51] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 129657464 and 9 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[04:10:51] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 17704 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[04:16:38] <wikibugs>	 (03Abandoned) 10Ryan Kemper: growthbook: Bump vendored job templ 1.0.1 → 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270558 (https://phabricator.wikimedia.org/T420691) (owner: 10Ryan Kemper)
[04:33:44] <wikibugs>	 (03PS2) 10Jasmine: kafka-main: set main-eqiad cluster brokers to Confluent distro 77 (3.7) [puppet] - 10https://gerrit.wikimedia.org/r/1278832 (https://phabricator.wikimedia.org/T419216)
[04:35:00] <wikibugs>	 (03PS3) 10Jasmine: kafka-main: set main-eqiad cluster brokers to Confluent distro 77 (3.7) [puppet] - 10https://gerrit.wikimedia.org/r/1278832 (https://phabricator.wikimedia.org/T419216)
[04:35:50] <wikibugs>	 (03PS4) 10Jasmine: kafka-main: set main-eqiad cluster brokers to Confluent distro 77 (3.7) [puppet] - 10https://gerrit.wikimedia.org/r/1278832 (https://phabricator.wikimedia.org/T419216)
[04:38:43] <wikibugs>	 (03Abandoned) 10Ryan Kemper: growthbook: Add automation API key placeholders [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269245 (https://phabricator.wikimedia.org/T420696) (owner: 10Ryan Kemper)
[04:41:51] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 269825928 and 33 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[04:42:51] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 2641264 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[05:00:27] <wikibugs>	 (03CR) 10WAN233: [C:03+1] change logo at zh-classical wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276284 (https://phabricator.wikimedia.org/T424128) (owner: 10WAN233)
[05:02:18] <wikibugs>	 (03CR) 10WAN233: [C:03+1] change logo at zh-classical wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276284 (https://phabricator.wikimedia.org/T424128) (owner: 10WAN233)
[05:07:03] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 23 hosts with reason: Primary switchover s5 T424550
[05:07:07] <stashbot>	 T424550: Switchover s5 master (db1230 -> db1210) - https://phabricator.wikimedia.org/T424550
[05:07:09] <icinga-wm>	 PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 66%, RTA = 5469.73 ms
[05:07:18] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Set db1210 with weight 0 T424550', diff saved to https://phabricator.wikimedia.org/P91814 and previous config saved to /var/cache/conftool/dbconfig/20260429-050718-marostegui.json
[05:07:34] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1210 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1277598 (https://phabricator.wikimedia.org/T424550) (owner: 10Gerrit maintenance bot)
[05:08:10] <marostegui>	 !log Starting s5 eqiad failover from db1230 to db1210 - T424550
[05:08:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:09:00] <jinxer-wm>	 FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta
[05:09:31] <wikibugs>	 06SRE, 10SRE-Access-Requests: Update SSH key for production access – Surbhi Gupta - https://phabricator.wikimedia.org/T422363#11869411 (10SGupta-WMF) 05Resolved→03Open Hi, I’ve configured my SSH setup with the new key and can reach the bastion (bast1004.wikimedia.org).  I can see my key being offered durin...
[05:10:33] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Set s5 eqiad as read-only for maintenance - T424550', diff saved to https://phabricator.wikimedia.org/P91815 and previous config saved to /var/cache/conftool/dbconfig/20260429-051032-marostegui.json
[05:10:55] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote db1210 to s5 primary and set section read-write T424550', diff saved to https://phabricator.wikimedia.org/P91816 and previous config saved to /var/cache/conftool/dbconfig/20260429-051054-marostegui.json
[05:11:38] <logmsgbot>	 !log marostegui@dns1004 START - running authdns-update
[05:12:11] <icinga-wm>	 RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 269.82 ms
[05:12:44] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1230 T424550', diff saved to https://phabricator.wikimedia.org/P91817 and previous config saved to /var/cache/conftool/dbconfig/20260429-051244-marostegui.json
[05:12:49] <stashbot>	 T424550: Switchover s5 master (db1230 -> db1210) - https://phabricator.wikimedia.org/T424550
[05:13:06] <logmsgbot>	 !log marostegui@dns1004 END - running authdns-update
[05:14:31] <icinga-wm>	 PROBLEM - orchestrator resolve cache non-FQDNs on dborch1002 is CRITICAL: CRITICAL: 2 non-FQDN entries in orchestrator resolve cache: https://wikitech.wikimedia.org/wiki/Orchestrator
[05:15:31] <icinga-wm>	 RECOVERY - orchestrator resolve cache non-FQDNs on dborch1002 is OK: OK: all orchestrator resolve cache entries are FQDNs https://wikitech.wikimedia.org/wiki/Orchestrator
[05:16:33] <wikibugs>	 (03PS1) 10Marostegui: db1230: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1278851
[05:17:13] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1278874 (https://phabricator.wikimedia.org/T424550)
[05:17:23] <wikibugs>	 (03Abandoned) 10Marostegui: wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1277599 (https://phabricator.wikimedia.org/T424550) (owner: 10Gerrit maintenance bot)
[05:18:05] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1278874 (https://phabricator.wikimedia.org/T424550) (owner: 10Marostegui)
[05:18:31] <icinga-wm>	 PROBLEM - orchestrator resolve cache non-FQDNs on dborch1002 is CRITICAL: CRITICAL: 2 non-FQDN entries in orchestrator resolve cache: https://wikitech.wikimedia.org/wiki/Orchestrator
[05:18:54] <logmsgbot>	 !log marostegui@dns1004 START - running authdns-update
[05:19:06] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1230: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1278851 (owner: 10Marostegui)
[05:20:30] <logmsgbot>	 !log marostegui@dns1004 END - running authdns-update
[05:20:48] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1230.eqiad.wmnet with reason: Reimage to Trixie
[05:20:52] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1230: Reimage to Trixie
[05:21:00] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1230: Reimage to Trixie
[05:21:29] <wikibugs>	 06SRE, 10SRE-Access-Requests: Update SSH key for production access – Surbhi Gupta - https://phabricator.wikimedia.org/T422363#11869437 (10SGupta-WMF) 05Open→03Resolved
[05:21:31] <icinga-wm>	 RECOVERY - orchestrator resolve cache non-FQDNs on dborch1002 is OK: OK: all orchestrator resolve cache entries are FQDNs https://wikitech.wikimedia.org/wiki/Orchestrator
[05:22:27] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1230.eqiad.wmnet with OS trixie
[05:24:19] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:24:31] <icinga-wm>	 PROBLEM - orchestrator resolve cache non-FQDNs on dborch1002 is CRITICAL: CRITICAL: 2 non-FQDN entries in orchestrator resolve cache: https://wikitech.wikimedia.org/wiki/Orchestrator
[05:29:19] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:30:06] <wikibugs>	 (03PS1) 10Marostegui: db1254,db2225: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279075 (https://phabricator.wikimedia.org/T424615)
[05:30:57] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1254,db2225: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279075 (https://phabricator.wikimedia.org/T424615) (owner: 10Marostegui)
[05:31:05] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2225.codfw.wmnet with reason: Reimage to Trixie
[05:31:11] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2225: Reimage to Trixie
[05:31:30] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2225: Reimage to Trixie
[05:31:35] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1254.eqiad.wmnet with reason: Reimage to Trixie
[05:31:39] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1254: Reimage to Trixie
[05:32:08] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1254: Reimage to Trixie
[05:32:56] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2225.codfw.wmnet with OS trixie
[05:34:49] <jinxer-wm>	 FIRING: [17x] CertAlmostExpired: Certificate for service apus:443 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[05:35:02] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1254.eqiad.wmnet with OS trixie
[05:37:49] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1230.eqiad.wmnet with reason: host reimage
[05:42:47] <wikibugs>	 (03PS1) 10Abijeet Patro: Don't load general modules  as style modules [extensions/Translate] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1279078 (https://phabricator.wikimedia.org/T424618)
[05:43:15] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/Translate] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1279078 (https://phabricator.wikimedia.org/T424618) (owner: 10Abijeet Patro)
[05:43:47] <wikibugs>	 (03Abandoned) 10Abijeet Patro: Don't load general modules  as style modules [extensions/Translate] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1279078 (https://phabricator.wikimedia.org/T424618) (owner: 10Abijeet Patro)
[05:44:16] <wikibugs>	 (03PS1) 10Abijeet Patro: Don't load general modules  as style modules [extensions/Translate] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279079 (https://phabricator.wikimedia.org/T424618)
[05:44:38] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1230.eqiad.wmnet with reason: host reimage
[05:44:40] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/Translate] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279079 (https://phabricator.wikimedia.org/T424618) (owner: 10Abijeet Patro)
[05:44:55] <wikibugs>	 (03PS1) 10Marostegui: installserver: Do not reimage db2251 [puppet] - 10https://gerrit.wikimedia.org/r/1279080 (https://phabricator.wikimedia.org/T418979)
[05:45:46] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1254,db2225: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1279081
[05:45:59] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1230: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1279082
[05:47:05] <wikibugs>	 (03CR) 10Ayounsi: [C:04-1] "I've updated the Wikipage instead: https://wikitech.wikimedia.org/w/index.php?title=Network_monitoring&diff=2407118&oldid=2377392" [alerts] - 10https://gerrit.wikimedia.org/r/1278792 (owner: 10RLazarus)
[05:47:54] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] installserver: Do not reimage db2251 [puppet] - 10https://gerrit.wikimedia.org/r/1279080 (https://phabricator.wikimedia.org/T418979) (owner: 10Marostegui)
[05:49:16] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2225.codfw.wmnet with reason: host reimage
[05:50:13] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1254.eqiad.wmnet with reason: host reimage
[05:53:25] <jinxer-wm>	 FIRING: [16x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:55:05] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2225.codfw.wmnet with reason: host reimage
[05:57:42] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db1230: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1279082 (owner: 10Marostegui)
[05:59:44] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1254.eqiad.wmnet with reason: host reimage
[06:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260429T0600)
[06:05:20] <jinxer-wm>	 FIRING: [3x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in 4d 7h 49m 25s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry
[06:06:21] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1230.eqiad.wmnet with OS trixie
[06:08:29] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1230: after reimage to trixie
[06:12:10] <wikibugs>	 (03PS4) 10Ryan Kemper: growthbook: Drop dead SSO_CONFIG placeholder [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270559 (https://phabricator.wikimedia.org/T420696)
[06:13:32] <wikibugs>	 (03PS1) 10Marostegui: db1198,db2227: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279083 (https://phabricator.wikimedia.org/T424792)
[06:15:28] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db1254,db2225: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1279081 (owner: 10Marostegui)
[06:15:42] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1198,db2227: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279083 (https://phabricator.wikimedia.org/T424792) (owner: 10Marostegui)
[06:16:58] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1198.eqiad.wmnet with reason: Reimage to Trixie
[06:17:03] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1198: Reimage to Trixie
[06:17:12] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2227.codfw.wmnet with reason: Reimage to Trixie
[06:17:18] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2227: Reimage to Trixie
[06:17:36] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2227: Reimage to Trixie
[06:18:00] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1198: Reimage to Trixie
[06:18:16] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2225.codfw.wmnet with OS trixie
[06:19:34] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1198.eqiad.wmnet with OS trixie
[06:19:46] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2227.codfw.wmnet with OS trixie
[06:20:56] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2225: after reimage to trixie
[06:21:48] <jinxer-wm>	 FIRING: KubernetesCalicoDown: wikikube-worker1039.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1039.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[06:22:12] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1254.eqiad.wmnet with OS trixie
[06:25:29] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1254: after reimage to trixie
[06:31:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: push_cross_cluster_settings_9200.service on cloudelastic1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:31:34] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] growthbook: Drop dead SSO_CONFIG placeholder [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270559 (https://phabricator.wikimedia.org/T420696) (owner: 10Ryan Kemper)
[06:32:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1278603 (https://phabricator.wikimedia.org/T415073) (owner: 10Ryan Kemper)
[06:33:21] <wikibugs>	 (03PS6) 10Elukey: services: Add TLS SANs to the evaluators' mesh configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277518 (https://phabricator.wikimedia.org/T424193)
[06:33:51] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1198.eqiad.wmnet with reason: host reimage
[06:34:13] <wikibugs>	 (03CR) 10Elukey: "oh noeeessss! Sorry :( It turns out that my attention is not good if I do 10 things at the time (like renewing TLS certs). Hopefully final" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277518 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey)
[06:35:52] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "Change looks good to me! I think that at this point the rollout is safe enough to proceed with eqiad first, but we could also tackle codfw" [puppet] - 10https://gerrit.wikimedia.org/r/1278832 (https://phabricator.wikimedia.org/T419216) (owner: 10Jasmine)
[06:36:31] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Remove pc2011 [puppet] - 10https://gerrit.wikimedia.org/r/1279084 (https://phabricator.wikimedia.org/T424012)
[06:36:44] <icinga-wm>	 PROBLEM - Check unit status of push_cross_cluster_settings_9200 on cloudelastic1011 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[06:36:44] <wikibugs>	 (03CR) 10Elukey: [C:03+1] restbase: migrate envoy TLS proxy services to new intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1278554 (https://phabricator.wikimedia.org/T424674) (owner: 10Eevans)
[06:37:10] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2227.codfw.wmnet with reason: host reimage
[06:37:36] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.decommission for hosts pc2011.codfw.wmnet
[06:38:00] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Remove pc2011 [puppet] - 10https://gerrit.wikimedia.org/r/1279084 (https://phabricator.wikimedia.org/T424012) (owner: 10Marostegui)
[06:38:35] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1198,db2227: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1279085
[06:39:04] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1198.eqiad.wmnet with reason: host reimage
[06:39:54] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] deployment_server: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1278516 (https://phabricator.wikimedia.org/T424671) (owner: 10Jasmine)
[06:40:32] <wikibugs>	 (03PS1) 10Muehlenhoff: Record LDAP access for dtorsani [puppet] - 10https://gerrit.wikimedia.org/r/1279086
[06:41:05] <icinga-wm>	 RECOVERY - WMF Cloud -Chi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 746 bytes in 0.012 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration
[06:41:05] <icinga-wm>	 RECOVERY - WMF Cloud -Chi Cluster- - Public Internet Port - SSL Expiry on cloudelastic.wikimedia.org is OK: OK - Certificate cloudelastic.wikimedia.org will expire on Sun 05 Jul 2026 07:49:09 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Search%23Administration
[06:41:05] <icinga-wm>	 RECOVERY - WMF Cloud -Chi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 746 bytes in 0.017 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration
[06:41:05] <icinga-wm>	 RECOVERY - WMF Cloud -Chi Cluster- - Prod MW AppServer Port - SSL Expiry on cloudelastic.wikimedia.org is OK: OK - Certificate cloudelastic.wikimedia.org will expire on Sun 05 Jul 2026 07:49:09 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Search%23Administration
[06:42:35] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269464 (https://phabricator.wikimedia.org/T420427) (owner: 10DCausse)
[06:42:47] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.dns.netbox
[06:43:20] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2227.codfw.wmnet with reason: host reimage
[06:44:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for dtorsani [puppet] - 10https://gerrit.wikimedia.org/r/1279086 (owner: 10Muehlenhoff)
[06:44:16] <icinga-wm>	 PROBLEM - WMF Cloud -Chi Cluster- - Public Internet Port - SSL Expiry on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration
[06:44:16] <icinga-wm>	 PROBLEM - WMF Cloud -Chi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration
[06:44:16] <icinga-wm>	 PROBLEM - WMF Cloud -Chi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration
[06:44:16] <icinga-wm>	 PROBLEM - WMF Cloud -Chi Cluster- - Prod MW AppServer Port - SSL Expiry on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration
[06:46:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: push_cross_cluster_settings_9200.service on cloudelastic1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:46:44] <icinga-wm>	 RECOVERY - Check unit status of push_cross_cluster_settings_9200 on cloudelastic1011 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[06:48:35] <logmsgbot>	 marostegui@cumin1003 decommission (PID 2302352) is awaiting input
[06:53:38] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: pc2011.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003"
[06:53:56] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1230: after reimage to trixie
[06:54:09] <wikibugs>	 (03PS2) 10Tiziano Fogli: logstash/filter: increase sockets-timeout for unit tests [puppet] - 10https://gerrit.wikimedia.org/r/1278501 (https://phabricator.wikimedia.org/T423986)
[06:54:09] <wikibugs>	 (03PS5) 10Tiziano Fogli: logstash: add thanos-query-frontend filter [puppet] - 10https://gerrit.wikimedia.org/r/1275800 (https://phabricator.wikimedia.org/T423986)
[06:54:49] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: pc2011.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003"
[06:54:49] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[06:54:50] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts pc2011.codfw.wmnet
[06:55:59] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission pc2011.codfw.wmnet - https://phabricator.wikimedia.org/T424012#11869572 (10Marostegui) a:05Marostegui→03Jhancock.wm
[06:56:05] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission pc2011.codfw.wmnet - https://phabricator.wikimedia.org/T424012#11869576 (10Marostegui) Ready for dc-ops
[06:56:34] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db1198,db2227: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1279085 (owner: 10Marostegui)
[06:58:00] <wikibugs>	 (03CR) 10Tiziano Fogli: logstash/filter: increase sockets-timeout for unit tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1278501 (https://phabricator.wikimedia.org/T423986) (owner: 10Tiziano Fogli)
[06:58:52] <wikibugs>	 (03Abandoned) 10Ryan Kemper: Revert wdqs deadlock remediation threshold to 600 [puppet] - 10https://gerrit.wikimedia.org/r/1263176 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper)
[06:59:50] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] dse-k8s: Also write set-rbd-readahead logs to journal [puppet] - 10https://gerrit.wikimedia.org/r/1255887 (https://phabricator.wikimedia.org/T419041) (owner: 10Ryan Kemper)
[07:00:04] <jouncebot>	 Amir1, Urbanecm, and awight: Time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260429T0700).
[07:00:04] <jouncebot>	 dcausse and abijeet: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:29] <dcausse>	 o/
[07:01:18] <dcausse>	 I can deploy
[07:01:40] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1198.eqiad.wmnet with OS trixie
[07:03:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1003 using scap backport" [extensions/WikibaseCirrusSearch] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1278509 (https://phabricator.wikimedia.org/T417648) (owner: 10DCausse)
[07:03:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269464 (https://phabricator.wikimedia.org/T420427) (owner: 10DCausse)
[07:04:31] <wikibugs>	 (03Merged) 10jenkins-bot: search: add alt. completion indices to test keyword tokenizer (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269464 (https://phabricator.wikimedia.org/T420427) (owner: 10DCausse)
[07:04:50] <wikibugs>	 (03PS7) 10Bartosz Wójtowicz: Fix gRPC Gateway protocol to allow TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277436 (https://phabricator.wikimedia.org/T424049)
[07:05:11] <wikibugs>	 (03PS8) 10Bartosz Wójtowicz: Fix gRPC Gateway protocol to allow TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277436 (https://phabricator.wikimedia.org/T424049)
[07:06:04] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2227.codfw.wmnet with OS trixie
[07:06:23] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2225: after reimage to trixie
[07:06:53] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1198: after reimage to trixie
[07:10:41] <wikibugs>	 (03PS1) 10Muehlenhoff: idp_clouddev: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279091 (https://phabricator.wikimedia.org/T424676)
[07:11:00] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1254: after reimage to trixie
[07:11:19] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2227: after reimage to trixie
[07:14:05] <wikibugs>	 (03PS1) 10Arnaudb: phabricator: add -ignore_readdir_race to clean_tmp_files service [puppet] - 10https://gerrit.wikimedia.org/r/1279092 (https://phabricator.wikimedia.org/T424796)
[07:15:11] <wikibugs>	 (03Merged) 10jenkins-bot: Completion: fix near match field name [extensions/WikibaseCirrusSearch] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1278509 (https://phabricator.wikimedia.org/T417648) (owner: 10DCausse)
[07:15:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] idp_clouddev: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279091 (https://phabricator.wikimedia.org/T424676) (owner: 10Muehlenhoff)
[07:16:08] <wikibugs>	 (03PS1) 10Arnaudb: vrts: switch to new discovery2026 intermediate CA [puppet] - 10https://gerrit.wikimedia.org/r/1279090 (https://phabricator.wikimedia.org/T424669)
[07:16:14] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] vrts: switch to new discovery2026 intermediate CA [puppet] - 10https://gerrit.wikimedia.org/r/1279090 (https://phabricator.wikimedia.org/T424669) (owner: 10Arnaudb)
[07:17:19] <logmsgbot>	 !log dcausse@deploy1003 Started scap sync-world: Backport for [[gerrit:1278509|Completion: fix near match field name (T417648)]], [[gerrit:1269464|search: add alt. completion indices to test keyword tokenizer (1/2) (T420427)]]
[07:17:24] <stashbot>	 T417648: [MEX] M4 - improve findability of properties on lookups - https://phabricator.wikimedia.org/T417648
[07:17:25] <stashbot>	 T420427: Search shouldn't trim trailing space when suggesting suggestions - https://phabricator.wikimedia.org/T420427
[07:19:18] <logmsgbot>	 !log dcausse@deploy1003 dcausse: Backport for [[gerrit:1278509|Completion: fix near match field name (T417648)]], [[gerrit:1269464|search: add alt. completion indices to test keyword tokenizer (1/2) (T420427)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[07:20:30] <wikibugs>	 (03PS1) 10Arnaudb: lists: switch to new discovery2026 intermediate CA [puppet] - 10https://gerrit.wikimedia.org/r/1279089 (https://phabricator.wikimedia.org/T424669)
[07:20:32] <logmsgbot>	 !log dcausse@deploy1003 dcausse: Continuing with deployment
[07:21:26] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10ServiceOps-Upgrades-Hardware: Multi-bit memory errors on wikikube-worker1039.eqiad.wmnet - https://phabricator.wikimedia.org/T424797 (10JMeybohm) 03NEW
[07:21:35] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10ServiceOps-Upgrades-Hardware: Multi-bit memory errors on wikikube-worker1039.eqiad.wmnet - https://phabricator.wikimedia.org/T424797#11869639 (10JMeybohm)
[07:22:59] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker1039.eqiad.wmnet
[07:23:12] <wikibugs>	 (03CR) 10A smart kitten: "(in case you have any interest in reviewing logo patches, apologies if not `:)`)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276284 (https://phabricator.wikimedia.org/T424128) (owner: 10WAN233)
[07:23:45] <wikibugs>	 (03CR) 10Phuedx: [C:03+1] WikiLambdaApi: update stream configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278704 (https://phabricator.wikimedia.org/T415254) (owner: 10Santiago Faci)
[07:23:45] <jinxer-wm>	 FIRING: Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[07:24:26] <logmsgbot>	 !log dcausse@deploy1003 Finished scap sync-world: Backport for [[gerrit:1278509|Completion: fix near match field name (T417648)]], [[gerrit:1269464|search: add alt. completion indices to test keyword tokenizer (1/2) (T420427)]] (duration: 07m 07s)
[07:24:32] <stashbot>	 T417648: [MEX] M4 - improve findability of properties on lookups - https://phabricator.wikimedia.org/T417648
[07:24:32] <stashbot>	 T420427: Search shouldn't trim trailing space when suggesting suggestions - https://phabricator.wikimedia.org/T420427
[07:24:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install cloudcephosd1053 - https://phabricator.wikimedia.org/T416394#11869641 (10ayounsi) That's correct. Those switches are also EOL and will be refreshed next FY. New switches will be 25G compatible.
[07:25:36] <logmsgbot>	 !log jayme@cumin1003 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) depool for host wikikube-worker1039.eqiad.wmnet
[07:26:08] <dcausse>	 I'm done deploying
[07:26:44] <wikibugs>	 (03PS1) 10Arnaudb: ci: switch to new discovery2026 intermediate CA [puppet] - 10https://gerrit.wikimedia.org/r/1279088 (https://phabricator.wikimedia.org/T424669)
[07:29:38] <wikibugs>	 (03PS1) 10Muehlenhoff: idm: Unconditionally use Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279095
[07:30:13] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker1039.eqiad.wmnet
[07:30:15] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker1039.eqiad.wmnet
[07:30:25] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10ServiceOps-Upgrades-Hardware: Multi-bit memory errors on wikikube-worker1039.eqiad.wmnet - https://phabricator.wikimedia.org/T424797#11869648 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by jayme@cumin1003 depool for host wikikube-worker1039.eqi...
[07:31:07] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' .
[07:31:37] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Multi-bit memory errors on wikikube-worker1039.eqiad.wmnet - https://phabricator.wikimedia.org/T424797#11869655 (10JMeybohm)
[07:31:38] <wikibugs>	 (03CR) 10CI reject: [V:04-1] idm: Unconditionally use Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279095 (owner: 10Muehlenhoff)
[07:31:44] <wikibugs>	 (03PS1) 10Marostegui: db1233,db2189: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279096 (https://phabricator.wikimedia.org/T424615)
[07:32:33] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2189.codfw.wmnet with reason: Reimage to Trixie
[07:32:37] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1233,db2189: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279096 (https://phabricator.wikimedia.org/T424615) (owner: 10Marostegui)
[07:32:39] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2189: Reimage to Trixie
[07:32:41] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1233.eqiad.wmnet with reason: Reimage to Trixie
[07:32:46] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1233: Reimage to Trixie
[07:32:53] <wikibugs>	 (03PS1) 10Arnaudb: gerrit: switch to new discovery2026 intermediate CA [puppet] - 10https://gerrit.wikimedia.org/r/1279087 (https://phabricator.wikimedia.org/T424669)
[07:33:04] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1233: Reimage to Trixie
[07:33:07] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2189: Reimage to Trixie
[07:34:16] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2189.codfw.wmnet with OS trixie
[07:34:31] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1233,db2189: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1279097
[07:34:38] <wikibugs>	 (03CR) 10Marostegui: [C:04-2] "Not yet" [puppet] - 10https://gerrit.wikimedia.org/r/1279097 (owner: 10Marostegui)
[07:34:51] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1233.eqiad.wmnet with OS trixie
[07:36:09] <wikibugs>	 (03PS2) 10Muehlenhoff: idm: Unconditionally use Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279095
[07:37:52] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm, thank you" [puppet] - 10https://gerrit.wikimedia.org/r/1279092 (https://phabricator.wikimedia.org/T424796) (owner: 10Arnaudb)
[07:38:09] <wikibugs>	 (03CR) 10CI reject: [V:04-1] idm: Unconditionally use Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279095 (owner: 10Muehlenhoff)
[07:38:17] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] phabricator: add -ignore_readdir_race to clean_tmp_files service [puppet] - 10https://gerrit.wikimedia.org/r/1279092 (https://phabricator.wikimedia.org/T424796) (owner: 10Arnaudb)
[07:38:27] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1007.eqiad.wmnet with OS trixie
[07:39:28] <ryankemper>	 !log T422860 [cloudelastic] Restarted opensearch services on `cloudelastic1011` and `cloudelastic1012` (needed to pick up missing opensearch plugins, which have already been fixed in puppet) (note: this was done ~2h ago; logged in wrong channel)
[07:39:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:39:32] <stashbot>	 T422860: Migrate Cloudelastic to OpenSearch 2.x - https://phabricator.wikimedia.org/T422860
[07:43:29] <wikibugs>	 (03PS1) 10Arnaudb: envoyproxy: update verify-envoy-config logic [puppet] - 10https://gerrit.wikimedia.org/r/1278482 (https://phabricator.wikimedia.org/T421827)
[07:43:29] <wikibugs>	 (03CR) 10Arnaudb: "the initial change has been split into a relation chain, sorry for the spam!" [puppet] - 10https://gerrit.wikimedia.org/r/1278482 (https://phabricator.wikimedia.org/T421827) (owner: 10Arnaudb)
[07:44:44] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 308 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1225, relocating_shards: 0, initializing_shards: 23, unassigned_shar
[07:44:44] <icinga-wm>	  delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 79.90867579908677 https://wikitech.wikimedia.org/wiki/Search%23Administration
[07:44:46] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch inactive shards 308 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1225, relocating_shards: 0, initializing_shards: 23, unassigned_shar
[07:44:46] <icinga-wm>	  delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 79.90867579908677 https://wikitech.wikimedia.org/wiki/Search%23Administration
[07:44:46] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 308 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1225, relocating_shards: 0, initializing_shards: 23, unassigned_shar
[07:44:46] <icinga-wm>	  delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 79.90867579908677 https://wikitech.wikimedia.org/wiki/Search%23Administration
[07:45:20] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 302 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1231, relocating_shards: 0, initializing_shards: 21, unassigned_shar
[07:45:20] <icinga-wm>	  delayed_unassigned_shards: 0, number_of_pending_tasks: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 80.30006523157208 https://wikitech.wikimedia.org/wiki/Search%23Administration
[07:45:32] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1008 is CRITICAL: CRITICAL - elasticsearch inactive shards 297 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1236, relocating_shards: 0, initializing_shards: 21, unassigned_shar
[07:45:32] <icinga-wm>	  delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 80.62622309197651 https://wikitech.wikimedia.org/wiki/Search%23Administration
[07:47:48] <ryankemper>	 ^ cluster was green before reimage of a single host, this shouldn't have happened. investigating. note this is cloudelastic not prod-cirrus, so not a huge blast radius
[07:49:20] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1304, relocating_shards: 0, initializing_shards: 21, unassigned_shards: 208, delayed_unassig
[07:49:20] <icinga-wm>	 ds: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.06196999347684 https://wikitech.wikimedia.org/wiki/Search%23Administration
[07:49:32] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1008 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1306, relocating_shards: 0, initializing_shards: 21, unassigned_shards: 206, delayed_unassig
[07:49:32] <icinga-wm>	 ds: 0, number_of_pending_tasks: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.19243313763862 https://wikitech.wikimedia.org/wiki/Search%23Administration
[07:49:44] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1311, relocating_shards: 0, initializing_shards: 21, unassigned_shards: 201, delayed_unassig
[07:49:44] <icinga-wm>	 ds: 0, number_of_pending_tasks: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.51859099804305 https://wikitech.wikimedia.org/wiki/Search%23Administration
[07:49:46] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1312, relocating_shards: 0, initializing_shards: 21, unassigned_shards: 200, delayed_unassig
[07:49:46] <icinga-wm>	 ds: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.58382257012394 https://wikitech.wikimedia.org/wiki/Search%23Administration
[07:49:46] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1312, relocating_shards: 0, initializing_shards: 21, unassigned_shards: 200, delayed_unassig
[07:49:46] <icinga-wm>	 ds: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.58382257012394 https://wikitech.wikimedia.org/wiki/Search%23Administration
[07:51:33] <ryankemper>	 ah, I misread the original output; it went green->yellow not green->red. sorry for the noise, should quiet down now though
[07:51:59] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1277503 (https://phabricator.wikimedia.org/T421827) (owner: 10Arnaudb)
[07:51:59] <wikibugs>	 (03CR) 10Elukey: [C:03+2] services: Add TLS SANs to the evaluators' mesh configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277518 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey)
[07:52:06] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1233.eqiad.wmnet with reason: host reimage
[07:52:09] <icinga-wm>	 PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[07:52:17] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1198: after reimage to trixie
[07:52:44] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' .
[07:53:21] <logmsgbot>	 !log a-pizzata@deploy1003 Started deploy [analytics/refinery@d6a17a0] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@d6a17a0a]
[07:53:38] <elukey>	 jouncebot: nowandnext
[07:53:38] <jouncebot>	 For the next 0 hour(s) and 6 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260429T0700)
[07:53:38] <jouncebot>	 In 2 hour(s) and 6 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260429T1000)
[07:53:49] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2189.codfw.wmnet with reason: host reimage
[07:54:24] <wikibugs>	 (03PS1) 10Marostegui: db1175,db2194: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279158 (https://phabricator.wikimedia.org/T424792)
[07:55:19] <logmsgbot>	 !log a-pizzata@deploy1003 Finished deploy [analytics/refinery@d6a17a0] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@d6a17a0a] (duration: 01m 57s)
[07:55:28] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1175,db2194: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279158 (https://phabricator.wikimedia.org/T424792) (owner: 10Marostegui)
[07:55:32] <logmsgbot>	 !log a-pizzata@deploy1003 Started deploy [analytics/refinery@d6a17a0]: Regular analytics weekly train [analytics/refinery@d6a17a0a]
[07:55:56] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1175.eqiad.wmnet with reason: Reimage to Trixie
[07:56:01] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1175: Reimage to Trixie
[07:56:08] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1175: Reimage to Trixie
[07:56:38] <wikibugs>	 (03CR) 10Elukey: [C:03+2] restbase: migrate envoy TLS proxy services to new intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1278554 (https://phabricator.wikimedia.org/T424674) (owner: 10Eevans)
[07:56:44] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2227: after reimage to trixie
[07:57:22] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2194.codfw.wmnet with reason: Reimage to Trixie
[07:57:28] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2194: Reimage to Trixie
[07:57:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): 2 devices deleted from netbox that where active - https://phabricator.wikimedia.org/T424019#11869732 (10ayounsi) Good job! The last step needed was to run the ImportPuppetDB Netbox script: https://netbox.wikimedia.org/extras/scrip...
[07:57:46] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2194: Reimage to Trixie
[07:58:29] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1175.eqiad.wmnet with OS trixie
[07:59:15] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1233.eqiad.wmnet with reason: host reimage
[07:59:23] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2194.codfw.wmnet with OS trixie
[07:59:40] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] kafka-main: set main-eqiad cluster brokers to Confluent distro 77 (3.7) [puppet] - 10https://gerrit.wikimedia.org/r/1278832 (https://phabricator.wikimedia.org/T419216) (owner: 10Jasmine)
[07:59:44] <logmsgbot>	 !log a-pizzata@deploy1003 Finished deploy [analytics/refinery@d6a17a0]: Regular analytics weekly train [analytics/refinery@d6a17a0a] (duration: 04m 12s)
[08:01:13] <wikibugs>	 (03PS1) 10Jelto: gitlab: rename backup-restore process [puppet] - 10https://gerrit.wikimedia.org/r/1279229 (https://phabricator.wikimedia.org/T424239)
[08:01:35] <wikibugs>	 (03PS1) 10MVernon: role::cephadm::rgw: use discovery2026 intermediate for envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279230 (https://phabricator.wikimedia.org/T424674)
[08:02:23] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2189.codfw.wmnet with reason: host reimage
[08:03:32] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1175,db2194: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1279231
[08:03:45] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8489/co" [puppet] - 10https://gerrit.wikimedia.org/r/1279229 (https://phabricator.wikimedia.org/T424239) (owner: 10Jelto)
[08:06:50] <logmsgbot>	 !log a-pizzata@deploy1003 Started deploy [analytics/refinery@d6a17a0] (thin): Regular analytics weekly train THIN [analytics/refinery@d6a17a0a]
[08:07:49] <wikibugs>	 (03PS1) 10Dpogorzelski: ml-serve: fix gpu partitioning scheme [puppet] - 10https://gerrit.wikimedia.org/r/1279232
[08:08:45] <logmsgbot>	 !log a-pizzata@deploy1003 Finished deploy [analytics/refinery@d6a17a0] (thin): Regular analytics weekly train THIN [analytics/refinery@d6a17a0a] (duration: 01m 54s)
[08:08:51] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Cookbook for rack depool - https://phabricator.wikimedia.org/T327300#11869774 (10ayounsi) >>! In T327300#11843281, @FCeratto-WMF wrote: > In zarcillo we have the relation `host <-> role <-> rack` and we can label replicas and candidates as depool...
[08:09:05] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' .
[08:09:30] <icinga-wm>	 RECOVERY - WMF Cloud -Omega Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 750 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration
[08:09:30] <icinga-wm>	 RECOVERY - WMF Cloud -Omega Cluster- - Prod MW AppServer Port - SSL Expiry on cloudelastic.wikimedia.org is OK: OK - Certificate cloudelastic.wikimedia.org will expire on Sun 05 Jul 2026 07:49:09 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Search%23Administration
[08:09:34] <jinxer-wm>	 FIRING: [17x] CertAlmostExpired: Certificate for service apus:443 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[08:12:39] <icinga-wm>	 RECOVERY - orchestrator resolve cache non-FQDNs on dborch1002 is OK: OK: all orchestrator resolve cache entries are FQDNs https://wikitech.wikimedia.org/wiki/Orchestrator
[08:13:42] <wikibugs>	 (03CR) 10Elukey: [C:03+1] role::cephadm::rgw: use discovery2026 intermediate for envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279230 (https://phabricator.wikimedia.org/T424674) (owner: 10MVernon)
[08:14:11] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1175.eqiad.wmnet with reason: host reimage
[08:14:27] <wikibugs>	 (03PS1) 10Muehlenhoff: puppetserver: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279234 (https://phabricator.wikimedia.org/T424676)
[08:14:34] <jinxer-wm>	 FIRING: [17x] CertAlmostExpired: Certificate for service apus:443 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[08:14:42] <wikibugs>	 (03PS2) 10Muehlenhoff: puppetserver: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279234 (https://phabricator.wikimedia.org/T424676)
[08:15:25] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.179 second response time https://wikitech.wikimedia.org/wiki/Swift
[08:15:29] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2023 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 4.752 second response time https://wikitech.wikimedia.org/wiki/Swift
[08:15:35] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[08:15:35] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[08:15:41] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2022 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.176 second response time https://wikitech.wikimedia.org/wiki/Swift
[08:15:41] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2022 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.184 second response time https://wikitech.wikimedia.org/wiki/Swift
[08:15:43] <wikibugs>	 (03CR) 10Marostegui: Revert "db1233,db2189: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1279097 (owner: 10Marostegui)
[08:15:49] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[08:15:49] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[08:15:49] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[08:16:04] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db1233,db2189: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1279097 (owner: 10Marostegui)
[08:16:25] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.198 second response time https://wikitech.wikimedia.org/wiki/Swift
[08:16:25] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.202 second response time https://wikitech.wikimedia.org/wiki/Swift
[08:16:35] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[08:16:39] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.170 second response time https://wikitech.wikimedia.org/wiki/Swift
[08:16:39] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Swift
[08:16:49] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[08:16:49] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[08:16:50] <Emperor>	 !log disable puppet in apus/codfw for TLS key rollover T424674
[08:16:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:16:54] <stashbot>	 T424674: Migrate Data Persistence Envoy TLS proxy services to the 2026 discovery intermediate - https://phabricator.wikimedia.org/T424674
[08:16:57] <jinxer-wm>	 FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:17:25] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 1.404 second response time https://wikitech.wikimedia.org/wiki/Swift
[08:17:25] <jynus>	 Emperor: expected bump?
[08:17:39] <jinxer-wm>	 FIRING: DiskSpace: Disk space cloudelastic1010:9100:/srv 13.17% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudelastic1010 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[08:17:39] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 571 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Swift
[08:17:45] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 6.433 second response time https://wikitech.wikimedia.org/wiki/Swift
[08:17:49] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[08:17:49] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[08:17:49] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[08:17:52] <Emperor>	 no, I was working on apus, I just want to put that back, then I'll get to the page
[08:17:53] <icinga-wm>	 RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 499.56 ms
[08:18:10] <jinxer-wm>	 FIRING: [17x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:18:25] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2023 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.199 second response time https://wikitech.wikimedia.org/wiki/Swift
[08:18:26] <Emperor>	 !log re-enable puppet in apus/codfw for TLS key rollover T424674 (no change, incident took over)
[08:18:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:18:32] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2024 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 1.906 second response time https://wikitech.wikimedia.org/wiki/Swift
[08:18:40] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Swift
[08:18:40] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Swift
[08:18:40] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Swift
[08:18:40] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.192 second response time https://wikitech.wikimedia.org/wiki/Swift
[08:18:40] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.221 second response time https://wikitech.wikimedia.org/wiki/Swift
[08:18:50] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[08:18:53] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1175.eqiad.wmnet with reason: host reimage
[08:19:50] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 9.205 second response time https://wikitech.wikimedia.org/wiki/Swift
[08:19:51] <jinxer-wm>	 FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, ...
[08:19:51] <jinxer-wm>	 IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F0%2F1%3A2 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation
[08:19:58] <wikibugs>	 (03CR) 10JavierMonton: [C:03+1] alerts: mw-page-html-feature-counts-change-enrich (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1278559 (https://phabricator.wikimedia.org/T424224) (owner: 10AKhatun)
[08:20:42] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2021 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Swift
[08:21:27] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2194.codfw.wmnet with reason: host reimage
[08:21:28] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1233.eqiad.wmnet with OS trixie
[08:21:57] <jinxer-wm>	 RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:22:10] <wikibugs>	 (03CR) 10Elukey: [C:03+1] puppetserver: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279234 (https://phabricator.wikimedia.org/T424676) (owner: 10Muehlenhoff)
[08:22:15] <wikibugs>	 (03CR) 10Elukey: [C:03+2] puppetserver: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279234 (https://phabricator.wikimedia.org/T424676) (owner: 10Muehlenhoff)
[08:24:34] <jinxer-wm>	 FIRING: [16x] CertAlmostExpired: Certificate for service apus:443 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[08:24:42] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2189.codfw.wmnet with OS trixie
[08:24:44] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2194.codfw.wmnet with reason: host reimage
[08:24:48] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1233: after reimage to trixie
[08:26:08] <wikibugs>	 (03PS1) 10Elukey: role::config_master: move to pki intermediate discovery2026 [puppet] - 10https://gerrit.wikimedia.org/r/1279235 (https://phabricator.wikimedia.org/T424676)
[08:27:09] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] "+1 to do codfw first" [puppet] - 10https://gerrit.wikimedia.org/r/1278832 (https://phabricator.wikimedia.org/T419216) (owner: 10Jasmine)
[08:28:14] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2189: after reimage to trixie
[08:29:17] <icinga-wm>	 PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[08:29:56] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: sync
[08:29:59] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: sync
[08:31:39] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+2] ml-serve: fix gpu partitioning scheme [puppet] - 10https://gerrit.wikimedia.org/r/1279232 (owner: 10Dpogorzelski)
[08:34:19] <icinga-wm>	 RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 233.94 ms
[08:36:07] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db1175,db2194: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1279231 (owner: 10Marostegui)
[08:36:43] <wikibugs>	 (03PS1) 10Muehlenhoff: configmaster: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279237 (https://phabricator.wikimedia.org/T424676)
[08:37:04] <wikibugs>	 (03CR) 10Elukey: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/1279235 :P" [puppet] - 10https://gerrit.wikimedia.org/r/1279237 (https://phabricator.wikimedia.org/T424676) (owner: 10Muehlenhoff)
[08:37:32] <logmsgbot>	 !log urbanecm@deploy1003 mwscript-k8s job started: extensions/Translate/scripts/moveTranslatableBundle.php --wiki=mediawikiwiki Wikimedia_Apps/Team/Android/TriviaGame 'Wikimedia Apps/Team/Android/Which' came 'first? Game' 'Martin Urbanec (WMF)' '--reason=per [[:phab:T423845]]'  # T423845
[08:37:37] <stashbot>	 T423845: Request to move translatable page: Wikimedia Apps/Team/Android/TriviaGame - https://phabricator.wikimedia.org/T423845
[08:37:46] <wikibugs>	 (03PS1) 10Kevin Bazira: inference-services: allow LLM isvcs to work on ml-serve1014 and ml-serve1015 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279238 (https://phabricator.wikimedia.org/T418350)
[08:38:03] <logmsgbot>	 !log urbanecm@deploy1003 mwscript-k8s job started: extensions/Translate/scripts/moveTranslatableBundle.php --wiki=mediawikiwiki Wikimedia_Apps/Team/Android/TriviaGame 'Wikimedia Apps/Team/Android/"Which came first?" Game' 'Martin Urbanec (WMF)' '--reason=per [[:phab:T423845]]'  # T423845
[08:38:53] <logmsgbot>	 !log urbanecm@deploy1003 mwscript-k8s job started: extensions/Translate/scripts/moveTranslatableBundle.php --wiki=mediawikiwiki Wikimedia_Apps/Team/Android/TriviaGame 'Wikimedia Apps/Team/Android/"Which came first?" Game' 'Martin Urbanec (WMF)' '--reason=per [[:phab:T423845]]'  # T423845
[08:39:34] <jinxer-wm>	 FIRING: [16x] CertAlmostExpired: Certificate for service apus:443 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[08:39:53] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/CheckUser] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1278380 (https://phabricator.wikimedia.org/T420517) (owner: 10STran)
[08:40:43] <icinga-wm>	 PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[08:40:50] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1175.eqiad.wmnet with OS trixie
[08:41:17] <wikibugs>	 (03PS6) 10Tiziano Fogli: logstash: add thanos-query-frontend filter [puppet] - 10https://gerrit.wikimedia.org/r/1275800 (https://phabricator.wikimedia.org/T423986)
[08:41:33] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: rename backup-restore process [puppet] - 10https://gerrit.wikimedia.org/r/1279229 (https://phabricator.wikimedia.org/T424239) (owner: 10Jelto)
[08:42:16] <logmsgbot>	 !log ryankemper@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1007.eqiad.wmnet with OS trixie
[08:42:39] <jinxer-wm>	 FIRING: DiskSpace: Disk space cloudelastic1010:9100:/srv 8.062% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudelastic1010 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[08:42:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1279235 (https://phabricator.wikimedia.org/T424676) (owner: 10Elukey)
[08:43:27] <wikibugs>	 (03CR) 10Muehlenhoff: "All great minds think alike :) +1d yours, gonna abandon mine" [puppet] - 10https://gerrit.wikimedia.org/r/1279237 (https://phabricator.wikimedia.org/T424676) (owner: 10Muehlenhoff)
[08:43:34] <wikibugs>	 (03Abandoned) 10Muehlenhoff: configmaster: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279237 (https://phabricator.wikimedia.org/T424676) (owner: 10Muehlenhoff)
[08:45:13] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1175: after reimage to trixie
[08:45:15] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+1] inference-services: allow LLM isvcs to work on ml-serve1014 and ml-serve1015 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279238 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[08:46:04] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] inference-services: allow LLM isvcs to work on ml-serve1014 and ml-serve1015 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279238 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[08:48:25] <wikibugs>	 (03Merged) 10jenkins-bot: inference-services: allow LLM isvcs to work on ml-serve1014 and ml-serve1015 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279238 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[08:48:59] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2194.codfw.wmnet with OS trixie
[08:51:17] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[08:51:50] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' .
[08:53:31] <jinxer-wm>	 RESOLVED: Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[08:54:15] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2194: after reimage to trixie
[08:56:09] <icinga-wm>	 RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 229.04 ms
[08:56:47] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[08:56:54] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1174 (T419961)', diff saved to https://phabricator.wikimedia.org/P91854 and previous config saved to /var/cache/conftool/dbconfig/20260429-085654-fceratto.json
[08:59:51] <jinxer-wm>	 RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, ...
[08:59:51] <jinxer-wm>	 IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F0%2F1%3A2 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation
[09:00:13] <logmsgbot>	 jmm@cumin2002 reimage (PID 197991) is awaiting input
[09:01:07] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti5005.eqsin.wmnet with OS bookworm
[09:01:22] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11869976 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti5005.eqsin.wmnet with OS bookworm
[09:02:31] <icinga-wm>	 PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[09:04:15] <wikibugs>	 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 13Patch-For-Review: Migrate Data Persistence Envoy TLS proxy services to the 2026 discovery intermediate - https://phabricator.wikimedia.org/T424674#11869984 (10MoritzMuehlenhoff)
[09:05:32] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11869990 (10MoritzMuehlenhoff)
[09:05:35] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T419961)', diff saved to https://phabricator.wikimedia.org/P91857 and previous config saved to /var/cache/conftool/dbconfig/20260429-090534-fceratto.json
[09:06:07] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] envoy: configure listener buffer and fast open queue length [puppet] - 10https://gerrit.wikimedia.org/r/1277503 (https://phabricator.wikimedia.org/T421827) (owner: 10Arnaudb)
[09:07:40] <wikibugs>	 (03PS2) 10Jelto: sre.gitlab.upgrade: add downtime for failing gitlab-backup-restore.service [cookbooks] - 10https://gerrit.wikimedia.org/r/1277234 (https://phabricator.wikimedia.org/T424239)
[09:07:40] <wikibugs>	 (03CR) 10Jelto: "I used some of the code from I3a0cc2c0ce747af5b31cdccdb6ad60d290bb2305" [cookbooks] - 10https://gerrit.wikimedia.org/r/1277234 (https://phabricator.wikimedia.org/T424239) (owner: 10Jelto)
[09:07:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1278610 (https://phabricator.wikimedia.org/T420993) (owner: 10Bking)
[09:09:00] <jinxer-wm>	 FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta
[09:10:12] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1233: after reimage to trixie
[09:10:49] <wikibugs>	 (03PS1) 10Gkyziridis: ml-services: Deploy rr-multilingual latest model version on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279244 (https://phabricator.wikimedia.org/T415892)
[09:11:55] <logmsgbot>	 jmm@cumin2002 reimage (PID 197991) is awaiting input
[09:13:39] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2189: after reimage to trixie
[09:15:43] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P91862 and previous config saved to /var/cache/conftool/dbconfig/20260429-091542-fceratto.json
[09:15:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1279230 (https://phabricator.wikimedia.org/T424674) (owner: 10MVernon)
[09:16:42] <wikibugs>	 06SRE, 10Observability-Metrics, 06SRE Observability (FY2025/2026-Q4): Migrate prometheus5002 to prometheus5003 - https://phabricator.wikimedia.org/T424024#11870035 (10tappof)
[09:17:39] <jinxer-wm>	 FIRING: [2x] DiskSpace: Disk space cloudelastic1010:9100:/srv 9.07% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudelastic1010 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[09:17:51] <wikibugs>	 (03CR) 10Elukey: [C:03+2] role::config_master: move to pki intermediate discovery2026 [puppet] - 10https://gerrit.wikimedia.org/r/1279235 (https://phabricator.wikimedia.org/T424676) (owner: 10Elukey)
[09:17:52] <icinga-wm>	 RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 229.29 ms
[09:19:06] <wikibugs>	 (03PS1) 10Marostegui: db1229,db2175: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279245 (https://phabricator.wikimedia.org/T424615)
[09:19:46] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2175.codfw.wmnet with reason: Reimage to Trixie
[09:19:51] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2175: Reimage to Trixie
[09:20:03] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1229,db2175: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279245 (https://phabricator.wikimedia.org/T424615) (owner: 10Marostegui)
[09:20:11] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1229.eqiad.wmnet with reason: Reimage to Trixie
[09:20:16] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1229: Reimage to Trixie
[09:20:19] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2175: Reimage to Trixie
[09:21:04] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1229: Reimage to Trixie
[09:21:54] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2175.codfw.wmnet with OS trixie
[09:22:10] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1229.eqiad.wmnet with OS trixie
[09:22:23] <wikibugs>	 (03CR) 10Arnaudb: "this will be an improvement for the upgrade process, thanks! I think I spotted a small issue, let me know if that does not make sense" [cookbooks] - 10https://gerrit.wikimedia.org/r/1277234 (https://phabricator.wikimedia.org/T424239) (owner: 10Jelto)
[09:23:33] <wikibugs>	 (03PS3) 10Jelto: sre.gitlab.upgrade: add downtime for failing gitlab-backup-restore.service [cookbooks] - 10https://gerrit.wikimedia.org/r/1277234 (https://phabricator.wikimedia.org/T424239)
[09:24:34] <jinxer-wm>	 FIRING: [15x] CertAlmostExpired: Certificate for service apus:443 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[09:25:41] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.dns.admin DNS admin: pool ulsfo [reason: primary network link stable, no task ID specified]
[09:25:51] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P91866 and previous config saved to /var/cache/conftool/dbconfig/20260429-092551-fceratto.json
[09:25:59] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool ulsfo [reason: primary network link stable, no task ID specified]
[09:27:48] <wikibugs>	 (03CR) 10Jelto: sre.gitlab.upgrade: add downtime for failing gitlab-backup-restore.service (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1277234 (https://phabricator.wikimedia.org/T424239) (owner: 10Jelto)
[09:28:25] <wikibugs>	 (03PS1) 10JavierMonton: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279247 (https://phabricator.wikimedia.org/T424624)
[09:28:26] <wikibugs>	 (03PS1) 10Elukey: Update Yarn, Analytics Webserver, Eventschemas and Matomo [puppet] - 10https://gerrit.wikimedia.org/r/1279246 (https://phabricator.wikimedia.org/T424672)
[09:28:32] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] "thanks for the change and the quick fix, lgtm!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1277234 (https://phabricator.wikimedia.org/T424239) (owner: 10Jelto)
[09:28:44] <wikibugs>	 06SRE, 10Observability-Metrics, 06SRE Observability (FY2025/2026-Q4): Migrate prometheus5002 to prometheus5003 - https://phabricator.wikimedia.org/T424024#11870080 (10tappof)
[09:30:14] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1229,db2175: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1279249
[09:30:38] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1175: after reimage to trixie
[09:30:57] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1279246 (https://phabricator.wikimedia.org/T424672) (owner: 10Elukey)
[09:31:13] <wikibugs>	 (03CR) 10Jelto: [C:03+2] sre.gitlab.upgrade: add downtime for failing gitlab-backup-restore.service [cookbooks] - 10https://gerrit.wikimedia.org/r/1277234 (https://phabricator.wikimedia.org/T424239) (owner: 10Jelto)
[09:31:37] <wikibugs>	 (03CR) 10Elukey: [C:03+2] Update Yarn, Analytics Webserver, Eventschemas and Matomo [puppet] - 10https://gerrit.wikimedia.org/r/1279246 (https://phabricator.wikimedia.org/T424672) (owner: 10Elukey)
[09:32:39] <jinxer-wm>	 RESOLVED: [2x] DiskSpace: Disk space cloudelastic1010:9100:/srv 9.095% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudelastic1010 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[09:33:47] <wikibugs>	 (03PS1) 10Marostegui: db1166,db2190: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279250 (https://phabricator.wikimedia.org/T424792)
[09:34:01] <wikibugs>	 (03Merged) 10jenkins-bot: sre.gitlab.upgrade: add downtime for failing gitlab-backup-restore.service [cookbooks] - 10https://gerrit.wikimedia.org/r/1277234 (https://phabricator.wikimedia.org/T424239) (owner: 10Jelto)
[09:34:11] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1166.eqiad.wmnet with reason: Reimage to Trixie
[09:34:16] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1166: Reimage to Trixie
[09:34:20] <icinga-wm>	 PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[09:34:26] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db1229,db2175: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1279249 (owner: 10Marostegui)
[09:34:34] <jinxer-wm>	 FIRING: [15x] CertAlmostExpired: Certificate for service apus:443 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[09:34:35] <wikibugs>	 (03CR) 10Marostegui: Revert "db1229,db2175: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1279249 (owner: 10Marostegui)
[09:34:43] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1166,db2190: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279250 (https://phabricator.wikimedia.org/T424792) (owner: 10Marostegui)
[09:34:44] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1166: Reimage to Trixie
[09:35:56] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1166.eqiad.wmnet with OS trixie
[09:35:58] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T419961)', diff saved to https://phabricator.wikimedia.org/P91869 and previous config saved to /var/cache/conftool/dbconfig/20260429-093557-fceratto.json
[09:36:17] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance
[09:36:25] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1191 (T419961)', diff saved to https://phabricator.wikimedia.org/P91870 and previous config saved to /var/cache/conftool/dbconfig/20260429-093624-fceratto.json
[09:37:04] <wikibugs>	 (03PS1) 10Tiziano Fogli: prom5003: assign prometheus::pop role [puppet] - 10https://gerrit.wikimedia.org/r/1279251 (https://phabricator.wikimedia.org/T424024)
[09:37:06] <wikibugs>	 (03PS1) 10Tiziano Fogli: prom5003: setup firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1279252 (https://phabricator.wikimedia.org/T424024)
[09:37:08] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11870110 (10SLyngshede-WMF) Depooling command:  ` $ ssh cumin1003.eqiad.wmnet $ sudo cookbook sre.dns.admin depool ulsfo `
[09:37:08] <wikibugs>	 (03PS1) 10Tiziano Fogli: prometheus::pop: enable rsyncd on eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1279253 (https://phabricator.wikimedia.org/T424024)
[09:37:10] <wikibugs>	 (03PS1) 10Tiziano Fogli: prometheus/eqsin: remove 5002, add 5003 [puppet] - 10https://gerrit.wikimedia.org/r/1279254 (https://phabricator.wikimedia.org/T424024)
[09:37:34] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1229.eqiad.wmnet with reason: host reimage
[09:39:08] <wikibugs>	 (03PS1) 10Tiziano Fogli: prom5003: clean up firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1279255 (https://phabricator.wikimedia.org/T424024)
[09:39:23] <icinga-wm>	 RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING WARNING - Packet loss = 0%, RTA = 562.65 ms
[09:39:40] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2194: after reimage to trixie
[09:40:31] <logmsgbot>	 !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Test noop upgrade on the replica
[09:40:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[09:41:16] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2190.codfw.wmnet with reason: Reimage to Trixie
[09:41:21] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2190: Reimage to Trixie
[09:41:40] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2190: Reimage to Trixie
[09:42:09] <wikibugs>	 (03PS1) 10Tiziano Fogli: prometheus/eqsin: update svc record [dns] - 10https://gerrit.wikimedia.org/r/1279256 (https://phabricator.wikimedia.org/T424024)
[09:42:15] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1229.eqiad.wmnet with reason: host reimage
[09:43:34] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T419961)', diff saved to https://phabricator.wikimedia.org/P91873 and previous config saved to /var/cache/conftool/dbconfig/20260429-094333-fceratto.json
[09:44:06] <logmsgbot>	 !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Test noop upgrade on the replica
[09:44:28] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1166,db2190: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1279257
[09:44:34] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2190.codfw.wmnet with reason: Reimage to Trixie
[09:44:34] <jinxer-wm>	 FIRING: [14x] CertAlmostExpired: Certificate for service apus:443 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[09:44:39] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2190: Reimage to Trixie
[09:44:46] <logmsgbot>	 !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.depool (exit_code=99) depool db2190: Reimage to Trixie
[09:45:52] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2190.codfw.wmnet with OS trixie
[09:46:18] <wikibugs>	 (03PS1) 10Arnaudb: jenkins: add log monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1278362 (https://phabricator.wikimedia.org/T421827)
[09:46:18] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] "self merging that change, I've tested the monitoring script in my homedir on contint1002 with no issue" [puppet] - 10https://gerrit.wikimedia.org/r/1278362 (https://phabricator.wikimedia.org/T421827) (owner: 10Arnaudb)
[09:51:22] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1166.eqiad.wmnet with reason: host reimage
[09:51:51] <wikibugs>	 (03PS1) 10Volans: cloud management: add RO Netbox for Spicerack [puppet] - 10https://gerrit.wikimedia.org/r/1279258
[09:52:03] <wikibugs>	 (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1279258 (owner: 10Volans)
[09:52:51] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[09:53:42] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P91874 and previous config saved to /var/cache/conftool/dbconfig/20260429-095341-fceratto.json
[09:53:54] <wikibugs>	 (03PS1) 10Btullis: Update the PKI intermediate for the cephosd clusters [puppet] - 10https://gerrit.wikimedia.org/r/1279260 (https://phabricator.wikimedia.org/T424672)
[09:53:56] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[09:54:30] <icinga-wm>	 PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100%
[09:54:32] <logmsgbot>	 !log marostegui@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db2175.codfw.wmnet with OS trixie
[09:55:26] <wikibugs>	 (03PS2) 10Volans: cloud management: add RO Netbox for Spicerack [puppet] - 10https://gerrit.wikimedia.org/r/1279258
[09:55:30] <wikibugs>	 (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1279258 (owner: 10Volans)
[09:55:37] <icinga-wm>	 PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:55:51] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[09:56:00] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1279260 (https://phabricator.wikimedia.org/T424672) (owner: 10Btullis)
[09:56:57] <jinxer-wm>	 FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:57:15] <icinga-wm>	 RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 37.17 ms
[09:57:32] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1166.eqiad.wmnet with reason: host reimage
[09:57:55] <wikibugs>	 (03CR) 10Volans: "PCC seems happy:" [puppet] - 10https://gerrit.wikimedia.org/r/1279258 (owner: 10Volans)
[09:58:10] <jinxer-wm>	 FIRING: [19x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:58:18] <tappof>	 Emperor: XioNoX ^^ My bad, I refreshed a dashboard for a test and launched heavyweight queries.
[09:58:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1279260 (https://phabricator.wikimedia.org/T424672) (owner: 10Btullis)
[09:58:59] <wikibugs>	 (03Abandoned) 10Arnaudb: gerrit: disable connection reuse on the httpd → jetty layer [puppet] - 10https://gerrit.wikimedia.org/r/1269479 (https://phabricator.wikimedia.org/T421827) (owner: 10Arnaudb)
[09:59:14] <tappof>	 !incidents
[09:59:14] <sirenbot>	 7882 (UNACKED)  ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad)
[09:59:15] <sirenbot>	 7881 (RESOLVED)  TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw)
[09:59:15] <sirenbot>	 7880 (RESOLVED)  ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw)
[09:59:15] <sirenbot>	 7879 (RESOLVED)  TransitPeeringTransportOutSaturation network sre (cr2-eqord:9804 Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372} xe-0/1/3 gnmi eqiad)
[09:59:15] <sirenbot>	 7877 (RESOLVED)  kafka-jumbo1013/Kafka Broker Server (paged)
[09:59:46] <tappof>	 !ack 7882
[09:59:46] <sirenbot>	 7882 (ACKED)  ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad)
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260429T1000)
[10:00:15] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2175.codfw.wmnet with OS trixie
[10:00:16] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Update the PKI intermediate for the cephosd clusters [puppet] - 10https://gerrit.wikimedia.org/r/1279260 (https://phabricator.wikimedia.org/T424672) (owner: 10Btullis)
[10:00:57] <Emperor>	 tappof: thanks for letting us know. You expect it to self-resolve, or will something need kicking?
[10:01:08] <tappof>	 Emperor: XioNoX It should recover soon.
[10:01:57] <jinxer-wm>	 RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:02:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] cloud management: add RO Netbox for Spicerack [puppet] - 10https://gerrit.wikimedia.org/r/1279258 (owner: 10Volans)
[10:03:10] <jinxer-wm>	 FIRING: [18x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:03:50] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P91875 and previous config saved to /var/cache/conftool/dbconfig/20260429-100349-fceratto.json
[10:04:23] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2190.codfw.wmnet with reason: host reimage
[10:04:27] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1229.eqiad.wmnet with OS trixie
[10:05:20] <jinxer-wm>	 FIRING: [3x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in 4d 3h 49m 25s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry
[10:05:37] <icinga-wm>	 RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:07:14] <wikibugs>	 (03PS1) 10Marostegui: db1229: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279262
[10:07:51] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1229: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279262 (owner: 10Marostegui)
[10:07:55] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1229: after reimage to trixie
[10:08:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1279251 (https://phabricator.wikimedia.org/T424024) (owner: 10Tiziano Fogli)
[10:08:36] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[10:08:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1279252 (https://phabricator.wikimedia.org/T424024) (owner: 10Tiziano Fogli)
[10:08:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1279253 (https://phabricator.wikimedia.org/T424024) (owner: 10Tiziano Fogli)
[10:09:14] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[10:09:18] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2190.codfw.wmnet with reason: host reimage
[10:12:03] <Emperor>	 !log disable puppet in apus/codfw rgws for TLS key rollover T424674
[10:12:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:12:08] <stashbot>	 T424674: Migrate Data Persistence Envoy TLS proxy services to the 2026 discovery intermediate - https://phabricator.wikimedia.org/T424674
[10:12:31] <wikibugs>	 (03CR) 10MVernon: [C:03+2] role::cephadm::rgw: use discovery2026 intermediate for envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279230 (https://phabricator.wikimedia.org/T424674) (owner: 10MVernon)
[10:12:56] <logmsgbot>	 !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ganeti5005.eqsin.wmnet with OS bookworm
[10:13:05] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11870286 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti5005.eqsin.wmnet with OS bookworm executed with errors...
[10:13:34] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti5005.eqsin.wmnet with OS bookworm
[10:13:44] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11870287 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti5005.eqsin.wmnet with OS bookworm
[10:13:58] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T419961)', diff saved to https://phabricator.wikimedia.org/P91877 and previous config saved to /var/cache/conftool/dbconfig/20260429-101358-fceratto.json
[10:14:10] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw
[10:14:18] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance
[10:14:26] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1194 (T419961)', diff saved to https://phabricator.wikimedia.org/P91878 and previous config saved to /var/cache/conftool/dbconfig/20260429-101426-fceratto.json
[10:15:33] <Emperor>	 !log disable puppet in apus/eqiad rgws for TLS key rollover T424674
[10:15:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:17:11] <wikibugs>	 (03CR) 10Jforrester: "Do we want to name these following MSB (so wikifunctions-evaluator-python/etc.)?" [dns] - 10https://gerrit.wikimedia.org/r/1277099 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey)
[10:17:26] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2175.codfw.wmnet with reason: host reimage
[10:19:30] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db1166,db2190: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1279257 (owner: 10Marostegui)
[10:20:00] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1166.eqiad.wmnet with OS trixie
[10:20:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1279255 (https://phabricator.wikimedia.org/T424024) (owner: 10Tiziano Fogli)
[10:20:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1279254 (https://phabricator.wikimedia.org/T424024) (owner: 10Tiziano Fogli)
[10:21:43] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T419961)', diff saved to https://phabricator.wikimedia.org/P91879 and previous config saved to /var/cache/conftool/dbconfig/20260429-102142-fceratto.json
[10:21:48] <jinxer-wm>	 FIRING: KubernetesCalicoDown: wikikube-worker1039.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1039.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[10:22:32] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw
[10:23:09] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): 2 devices deleted from netbox that where active - https://phabricator.wikimedia.org/T424019#11870307 (10BTullis) Thanks all. I have now marked those two devices as active in netbox and I have told the Wikidata Platform team that t...
[10:23:36] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker1039.eqiad.wmnet
[10:23:37] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker1039.eqiad.wmnet
[10:24:34] <jinxer-wm>	 FIRING: [11x] CertAlmostExpired: Certificate for service apus:443 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[10:24:43] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1166: after reimage to trixie
[10:25:04] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2175.codfw.wmnet with reason: host reimage
[10:27:56] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.4 point update - https://phabricator.wikimedia.org/T420240#11870321 (10MoritzMuehlenhoff)
[10:29:59] <moritzm>	 !log installing Envoy upgrades on chartmuseum* T410975 T419637
[10:30:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:30:04] <stashbot>	 T410975: Upgrade Envoy to v1.35.7 - https://phabricator.wikimedia.org/T410975
[10:30:05] <stashbot>	 T419637: Upgrade Envoy to v1.35.9 - https://phabricator.wikimedia.org/T419637
[10:31:51] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P91882 and previous config saved to /var/cache/conftool/dbconfig/20260429-103150-fceratto.json
[10:31:57] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06ServiceOps new, and 2 others: scap can’t deploy (blob upload unknown) after apus.discovery.wmnet is repooled in codfw - https://phabricator.wikimedia.org/T422166#11870345 (10Blake) 05In progress→03Resolved The service has been excluded from the switchover, and...
[10:32:06] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2190.codfw.wmnet with OS trixie
[10:32:38] <moritzm>	 !log installing Envoy upgrades on webperf* T410975 T419637
[10:32:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:42] <wikibugs>	 10SRE-swift-storage, 10Ceph, 06Data-Persistence: Migrate Data Persistence Envoy TLS proxy services to the 2026 discovery intermediate - https://phabricator.wikimedia.org/T424674#11870353 (10MatthewVernon)
[10:34:34] <jinxer-wm>	 FIRING: [11x] CertAlmostExpired: Certificate for service apus:443 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[10:37:23] <wikibugs>	 (03PS1) 10MVernon: role::thanos::frontend: use discovery2026 intermediate for envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279265 (https://phabricator.wikimedia.org/T424674)
[10:39:10] <wikibugs>	 (03PS1) 10Marostegui: db2175: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279266
[10:41:59] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P91885 and previous config saved to /var/cache/conftool/dbconfig/20260429-104158-fceratto.json
[10:42:47] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] prom5003: assign prometheus::pop role [puppet] - 10https://gerrit.wikimedia.org/r/1279251 (https://phabricator.wikimedia.org/T424024) (owner: 10Tiziano Fogli)
[10:42:58] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] prom5003: setup firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1279252 (https://phabricator.wikimedia.org/T424024) (owner: 10Tiziano Fogli)
[10:43:19] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] prometheus::pop: enable rsyncd on eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1279253 (https://phabricator.wikimedia.org/T424024) (owner: 10Tiziano Fogli)
[10:43:53] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2175: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279266 (owner: 10Marostegui)
[10:45:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti5005.eqsin.wmnet with reason: host reimage
[10:45:07] <wikibugs>	 (03PS1) 10Jelto: etherpad: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279268 (https://phabricator.wikimedia.org/T420993)
[10:46:40] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:47:43] <wikibugs>	 (03PS1) 10STran: Enable staggered rollout for IRS on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279269
[10:48:03] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2175.codfw.wmnet with OS trixie
[10:48:28] <wikibugs>	 (03PS2) 10STran: Enable staggered rollout for IRS on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279269 (https://phabricator.wikimedia.org/T424075)
[10:49:08] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2175: After reimage
[10:49:12] <logmsgbot>	 !log marostegui@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool db2175: After reimage
[10:49:23] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2175: After reimage
[10:50:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti5005.eqsin.wmnet with reason: host reimage
[10:50:38] <wikibugs>	 (03CR) 10Mszwarc: [C:03+1] Enable staggered rollout for IRS on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279269 (https://phabricator.wikimedia.org/T424075) (owner: 10STran)
[10:50:42] <wikibugs>	 06SRE, 10Observability-Metrics, 13Patch-For-Review, 06SRE Observability (FY2025/2026-Q4): Migrate prometheus5002 to prometheus5003 - https://phabricator.wikimedia.org/T424024#11870386 (10tappof)
[10:52:06] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T419961)', diff saved to https://phabricator.wikimedia.org/P91887 and previous config saved to /var/cache/conftool/dbconfig/20260429-105206-fceratto.json
[10:52:27] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance
[10:52:28] <wikibugs>	 (03PS2) 10Federico Ceratto: sre.mysql.global-read-only Set all sections as RO/RW [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874)
[10:52:28] <wikibugs>	 (03CR) 10Federico Ceratto: "Flagging CR as ready for an initial review, but we still want to test it as discussed." [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto)
[10:52:35] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1202 (T419961)', diff saved to https://phabricator.wikimedia.org/P91888 and previous config saved to /var/cache/conftool/dbconfig/20260429-105234-fceratto.json
[10:53:20] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1229: after reimage to trixie
[10:54:02] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker1039.eqiad.wmnet
[10:54:18] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker1039.eqiad.wmnet
[10:54:54] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker1039.eqiad.wmnet
[10:55:00] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker1039.eqiad.wmnet
[10:55:56] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1279268 (https://phabricator.wikimedia.org/T420993) (owner: 10Jelto)
[10:57:25] <jinxer-wm>	 FIRING: [2x] CertAlmostExpired: Certificate for service phab1004:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[10:57:55] <Emperor>	 !incidents
[10:57:56] <sirenbot>	 7883 (UNACKED)  [2x] CertAlmostExpired sre (phab1004:443 probes/custom eqiad)
[10:57:56] <sirenbot>	 7882 (RESOLVED)  ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad)
[10:57:56] <sirenbot>	 7881 (RESOLVED)  TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw)
[10:57:56] <sirenbot>	 7880 (RESOLVED)  ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw)
[10:57:56] <sirenbot>	 7879 (RESOLVED)  TransitPeeringTransportOutSaturation network sre (cr2-eqord:9804 Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372} xe-0/1/3 gnmi eqiad)
[10:57:57] <sirenbot>	 7877 (RESOLVED)  kafka-jumbo1013/Kafka Broker Server (paged)
[10:58:00] <Emperor>	 !ack
[10:58:00] <sirenbot>	 7883 (ACKED)  [2x] CertAlmostExpired sre (phab1004:443 probes/custom eqiad)
[10:58:11] <wikibugs>	 (03CR) 10Jelto: [C:03+2] etherpad: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279268 (https://phabricator.wikimedia.org/T420993) (owner: 10Jelto)
[10:58:16] <Emperor>	 elukey: are on-call about to get p.aged about a lot of certs?
[10:59:34] <Emperor>	 though phab1004 isn't in the link I get from the alert
[11:00:04] <jouncebot>	 mvolz: Time to do the Services – Citoid / Zotero deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260429T1100).
[11:00:06] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T419961)', diff saved to https://phabricator.wikimedia.org/P91891 and previous config saved to /var/cache/conftool/dbconfig/20260429-110005-fceratto.json
[11:00:36] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2190: After reimage
[11:06:11] <wikibugs>	 (03PS1) 10Hnowlan: grafana: use discovery2026 intermediate certificate [puppet] - 10https://gerrit.wikimedia.org/r/1279271 (https://phabricator.wikimedia.org/T424673)
[11:07:11] <wikibugs>	 (03Abandoned) 10Marostegui: Revert "db1229,db2175: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1279249 (owner: 10Marostegui)
[11:08:19] <wikibugs>	 (03PS1) 10MVernon: role::swift::proxy: use discovery2026 intermediate for envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279272 (https://phabricator.wikimedia.org/T424674)
[11:10:08] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1166: after reimage to trixie
[11:10:14] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P91895 and previous config saved to /var/cache/conftool/dbconfig/20260429-111013-fceratto.json
[11:11:17] <moritzm>	 !log installing libpng1.6 security updates
[11:11:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:11:31] <wikibugs>	 (03PS1) 10Jelto: aphlict: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279273 (https://phabricator.wikimedia.org/T420993)
[11:11:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti5005.eqsin.wmnet with OS bookworm
[11:11:57] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11870507 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti5005.eqsin.wmnet with OS bookworm completed: - ganeti5...
[11:12:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1279273 (https://phabricator.wikimedia.org/T420993) (owner: 10Jelto)
[11:12:32] <wikibugs>	 (03PS9) 10Bartosz Wójtowicz: Fix gRPC Gateway protocol to allow TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277436 (https://phabricator.wikimedia.org/T424049)
[11:13:10] <wikibugs>	 (03CR) 10Jelto: [C:03+2] aphlict: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279273 (https://phabricator.wikimedia.org/T420993) (owner: 10Jelto)
[11:16:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1279271 (https://phabricator.wikimedia.org/T424673) (owner: 10Hnowlan)
[11:17:49] <wikibugs>	 (03CR) 10Marostegui: sre.mysql.global-read-only Set all sections as RO/RW (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto)
[11:18:11] <wikibugs>	 (03CR) 10Marostegui: "@Ladsgroup@gmail.com can you also check this please, to make sure nothing MW side would explode." [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto)
[11:20:22] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P91898 and previous config saved to /var/cache/conftool/dbconfig/20260429-112021-fceratto.json
[11:23:04] <wikibugs>	 (03PS1) 10Jelto: phabricator: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279274 (https://phabricator.wikimedia.org/T420993)
[11:23:39] <wikibugs>	 (03PS1) 10Muehlenhoff: Add ganeti5005 to the routed Ganeti cluster in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1279275 (https://phabricator.wikimedia.org/T421863)
[11:27:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1279274 (https://phabricator.wikimedia.org/T420993) (owner: 10Jelto)
[11:28:08] <wikibugs>	 (03CR) 10Jelto: [C:03+2] phabricator: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279274 (https://phabricator.wikimedia.org/T420993) (owner: 10Jelto)
[11:28:53] <wikibugs>	 (03PS1) 10Brouberol: Restore kerberos API authentication by explicitly setting an empty public role [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279277 (https://phabricator.wikimedia.org/T424761)
[11:30:24] <wikibugs>	 (03PS2) 10Brouberol: Restore kerberos API authentication by explicitly setting an empty public role [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279277 (https://phabricator.wikimedia.org/T424761)
[11:30:30] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T419961)', diff saved to https://phabricator.wikimedia.org/P91899 and previous config saved to /var/cache/conftool/dbconfig/20260429-113029-fceratto.json
[11:30:51] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1227.eqiad.wmnet with reason: Maintenance
[11:30:57] <wikibugs>	 (03PS3) 10Brouberol: Restore kerberos API authentication by explicitly setting an empty public role [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279277 (https://phabricator.wikimedia.org/T424761)
[11:31:06] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1227 (T419961)', diff saved to https://phabricator.wikimedia.org/P91901 and previous config saved to /var/cache/conftool/dbconfig/20260429-113105-fceratto.json
[11:31:25] <wikibugs>	 (03PS1) 10STran: Support staggered rollout via Test Kitchen [extensions/ReportIncident] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279279 (https://phabricator.wikimedia.org/T424220)
[11:31:39] <wikibugs>	 (03PS1) 10STran: Update IRS instrumentation [extensions/ReportIncident] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279280 (https://phabricator.wikimedia.org/T424075)
[11:32:36] <wikibugs>	 (03PS1) 10Novem Linguae: purge_securepoll: don't exclude private wikis [puppet] - 10https://gerrit.wikimedia.org/r/1279281 (https://phabricator.wikimedia.org/T419309)
[11:32:51] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Fantastic! Thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279277 (https://phabricator.wikimedia.org/T424761) (owner: 10Brouberol)
[11:34:51] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2175: After reimage
[11:35:12] <wikibugs>	 (03CR) 10Mszwarc: [C:03+1] Support staggered rollout via Test Kitchen [extensions/ReportIncident] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279279 (https://phabricator.wikimedia.org/T424220) (owner: 10STran)
[11:35:20] <wikibugs>	 (03CR) 10Mszwarc: [C:03+1] Update IRS instrumentation [extensions/ReportIncident] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279280 (https://phabricator.wikimedia.org/T424075) (owner: 10STran)
[11:35:26] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Restore kerberos API authentication by explicitly setting an empty public role [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279277 (https://phabricator.wikimedia.org/T424761) (owner: 10Brouberol)
[11:35:34] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/ReportIncident] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279279 (https://phabricator.wikimedia.org/T424220) (owner: 10STran)
[11:35:42] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/ReportIncident] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279280 (https://phabricator.wikimedia.org/T424075) (owner: 10STran)
[11:35:50] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279269 (https://phabricator.wikimedia.org/T424075) (owner: 10STran)
[11:37:25] <jinxer-wm>	 RESOLVED: [2x] CertAlmostExpired: Certificate for service phab1004:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[11:38:14] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T419961)', diff saved to https://phabricator.wikimedia.org/P91903 and previous config saved to /var/cache/conftool/dbconfig/20260429-113813-fceratto.json
[11:38:23] <wikibugs>	 (03PS1) 10Jelto: peopleweb: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279282 (https://phabricator.wikimedia.org/T420993)
[11:39:21] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[11:39:26] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1279282 (https://phabricator.wikimedia.org/T420993) (owner: 10Jelto)
[11:39:54] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[11:40:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1279272 (https://phabricator.wikimedia.org/T424674) (owner: 10MVernon)
[11:40:55] <wikibugs>	 (03CR) 10Jelto: [C:03+2] peopleweb: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279282 (https://phabricator.wikimedia.org/T420993) (owner: 10Jelto)
[11:41:26] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1279265 (https://phabricator.wikimedia.org/T424674) (owner: 10MVernon)
[11:41:55] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+1] purge_securepoll: don't exclude private wikis [puppet] - 10https://gerrit.wikimedia.org/r/1279281 (https://phabricator.wikimedia.org/T419309) (owner: 10Novem Linguae)
[11:42:36] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply
[11:42:42] <wikibugs>	 (03CR) 10Dpogorzelski: "We don't need to change custom_deploy.d/istio/ml-serve/config.yaml, this config is no longer used" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277436 (https://phabricator.wikimedia.org/T424049) (owner: 10Bartosz Wójtowicz)
[11:42:56] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gitlab1004:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:43:06] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply
[11:43:45] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply
[11:44:17] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply
[11:45:37] <wikibugs>	 (03PS1) 10Muehlenhoff: proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279285
[11:46:00] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2190: After reimage
[11:46:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1279256 (https://phabricator.wikimedia.org/T424024) (owner: 10Tiziano Fogli)
[11:46:21] <wikibugs>	 (03PS10) 10Bartosz Wójtowicz: Fix gRPC Gateway protocol to allow TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277436 (https://phabricator.wikimedia.org/T424049)
[11:46:43] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Configure dse-k8s-worker nodes for ipip encapsulation [puppet] - 10https://gerrit.wikimedia.org/r/1278519 (https://phabricator.wikimedia.org/T420437) (owner: 10Btullis)
[11:46:46] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Add ganeti5005 to the routed Ganeti cluster in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1279275 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff)
[11:46:50] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+1] Fix gRPC Gateway protocol to allow TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277436 (https://phabricator.wikimedia.org/T424049) (owner: 10Bartosz Wójtowicz)
[11:47:15] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-product: apply
[11:47:28] <wikibugs>	 (03PS2) 10Novem Linguae: purge_securepoll: don't exclude private wikis [puppet] - 10https://gerrit.wikimedia.org/r/1279281 (https://phabricator.wikimedia.org/T419309)
[11:47:45] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-product: apply
[11:47:51] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[11:47:56] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gitlab1004:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:48:22] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P91905 and previous config saved to /var/cache/conftool/dbconfig/20260429-114821-fceratto.json
[11:48:29] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[11:51:19] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-fr-tech: apply
[11:51:48] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-fr-tech: apply
[11:51:56] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply
[11:52:26] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply
[11:52:33] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply
[11:53:11] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply
[11:53:16] <wikibugs>	 (03CR) 10Bartosz Wójtowicz: [C:03+2] Fix gRPC Gateway protocol to allow TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277436 (https://phabricator.wikimedia.org/T424049) (owner: 10Bartosz Wójtowicz)
[11:53:41] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply
[11:54:21] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply
[11:54:30] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-sre: apply
[11:54:34] <Emperor>	 !log TLS key rollover for thanos-fe T424674
[11:54:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:54:39] <stashbot>	 T424674: Migrate Data Persistence Envoy TLS proxy services to the 2026 discovery intermediate - https://phabricator.wikimedia.org/T424674
[11:54:49] <wikibugs>	 (03CR) 10MVernon: [C:03+2] role::thanos::frontend: use discovery2026 intermediate for envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279265 (https://phabricator.wikimedia.org/T424674) (owner: 10MVernon)
[11:55:00] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-sre: apply
[11:55:32] <wikibugs>	 (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1278682 (https://phabricator.wikimedia.org/T424683) (owner: 10Cathal Mooney)
[11:55:56] <wikibugs>	 (03PS1) 10Jelto: doc: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279287 (https://phabricator.wikimedia.org/T420993)
[11:56:35] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wikidata: apply
[11:57:00] <wikibugs>	 (03PS2) 10Jelto: doc: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279287 (https://phabricator.wikimedia.org/T424669)
[11:57:07] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wikidata: apply
[11:57:14] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply
[11:57:46] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply
[11:57:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1279287 (https://phabricator.wikimedia.org/T424669) (owner: 10Jelto)
[11:58:10] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] "looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/1279287 (https://phabricator.wikimedia.org/T424669) (owner: 10Jelto)
[11:58:30] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P91906 and previous config saved to /var/cache/conftool/dbconfig/20260429-115829-fceratto.json
[12:00:23] <wikibugs>	 (03PS1) 10Marostegui: db1223,db2177: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279289 (https://phabricator.wikimedia.org/T424792)
[12:00:33] <wikibugs>	 (03Merged) 10jenkins-bot: Fix gRPC Gateway protocol to allow TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277436 (https://phabricator.wikimedia.org/T424049) (owner: 10Bartosz Wójtowicz)
[12:00:47] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1223.eqiad.wmnet with reason: Reimage to Trixie
[12:00:52] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1223: Reimage to Trixie
[12:00:56] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "LGTM but I'm not really sure I get why this is beneficial?  Seems fine but I think I'm missing that bit, maybe in future we start setting " [puppet] - 10https://gerrit.wikimedia.org/r/1278390 (https://phabricator.wikimedia.org/T416360) (owner: 10Ayounsi)
[12:00:56] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2177.codfw.wmnet with reason: Reimage to Trixie
[12:00:58] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1223,db2177: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279289 (https://phabricator.wikimedia.org/T424792) (owner: 10Marostegui)
[12:01:02] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2177: Reimage to Trixie
[12:01:19] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1223: Reimage to Trixie
[12:01:20] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2177: Reimage to Trixie
[12:01:31] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Add pint ignore rules for CoreRouterInterfaceDropPercent [alerts] - 10https://gerrit.wikimedia.org/r/1277472 (owner: 10Cathal Mooney)
[12:01:34] <wikibugs>	 (03CR) 10Jelto: [C:03+2] doc: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279287 (https://phabricator.wikimedia.org/T424669) (owner: 10Jelto)
[12:02:57] <wikibugs>	 (03PS1) 10Marostegui: db1197: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279290 (https://phabricator.wikimedia.org/T424615)
[12:03:04] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1223.eqiad.wmnet with OS trixie
[12:03:12] <wikibugs>	 (03Merged) 10jenkins-bot: Add pint ignore rules for CoreRouterInterfaceDropPercent [alerts] - 10https://gerrit.wikimedia.org/r/1277472 (owner: 10Cathal Mooney)
[12:03:16] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2177.codfw.wmnet with OS trixie
[12:03:50] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1197.eqiad.wmnet with reason: Reimage to Trixie
[12:03:52] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1197: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279290 (https://phabricator.wikimedia.org/T424615) (owner: 10Marostegui)
[12:03:55] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1197: Reimage to Trixie
[12:04:03] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2204.codfw.wmnet with reason: Reimage to Trixie
[12:04:09] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2204: Reimage to Trixie
[12:04:28] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2204: Reimage to Trixie
[12:04:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add ganeti5005 to the routed Ganeti cluster in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1279275 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff)
[12:04:33] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1197: Reimage to Trixie
[12:04:34] <jinxer-wm>	 FIRING: [9x] CertAlmostExpired: Certificate for service grafana:443 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[12:05:18] <hnowlan>	 jouncebot: nowandnext
[12:05:18] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 54 minute(s)
[12:05:18] <jouncebot>	 In 0 hour(s) and 54 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260429T1300)
[12:05:43] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1197.eqiad.wmnet with OS trixie
[12:05:58] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2204.codfw.wmnet with OS trixie
[12:06:23] <wikibugs>	 (03PS1) 10Marostegui: db2204: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279292 (https://phabricator.wikimedia.org/T424615)
[12:08:38] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T419961)', diff saved to https://phabricator.wikimedia.org/P91911 and previous config saved to /var/cache/conftool/dbconfig/20260429-120837-fceratto.json
[12:09:00] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1231.eqiad.wmnet with reason: Maintenance
[12:09:08] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1231 (T419961)', diff saved to https://phabricator.wikimedia.org/P91912 and previous config saved to /var/cache/conftool/dbconfig/20260429-120907-fceratto.json
[12:09:55] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1223,db2177: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1279298
[12:11:39] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[12:12:53] <wikibugs>	 06SRE, 10Observability-Metrics, 13Patch-For-Review, 06SRE Observability (FY2025/2026-Q4): Migrate prometheus5002 to prometheus5003 - https://phabricator.wikimedia.org/T424024#11870872 (10tappof)
[12:13:02] <wikibugs>	 (03CR) 10Elukey: [C:03+1] cloud management: add RO Netbox for Spicerack [puppet] - 10https://gerrit.wikimedia.org/r/1279258 (owner: 10Volans)
[12:14:14] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[12:14:20] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:14:35] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[12:14:51] <wikibugs>	 (03CR) 10Elukey: "Hey James, fine for me, I have already added the configs in k8s for the current naming scheme, but I can change them. Lemme know!" [dns] - 10https://gerrit.wikimedia.org/r/1277099 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey)
[12:16:00] <elukey>	 Emperor: sorry just seen your ping now, I think there are few remaining systems with almost expired certs, they shouldn't be paging in theory
[12:16:05] <elukey>	 did you see otherwise?
[12:16:28] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[12:16:34] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T419961)', diff saved to https://phabricator.wikimedia.org/P91913 and previous config saved to /var/cache/conftool/dbconfig/20260429-121633-fceratto.json
[12:17:54] <Emperor>	 elukey: yeah, we got paged about phab1004 earlier (hence my question)
[12:18:34] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1223.eqiad.wmnet with reason: host reimage
[12:18:38] <wikibugs>	 (03CR) 10Volans: [C:03+2] cloud management: add RO Netbox for Spicerack [puppet] - 10https://gerrit.wikimedia.org/r/1279258 (owner: 10Volans)
[12:19:20] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:19:51] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1197.eqiad.wmnet with reason: host reimage
[12:20:53] <wikibugs>	 (03PS1) 10Elukey: admin_ng: add extra TLS SANs for wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279315 (https://phabricator.wikimedia.org/T424193)
[12:21:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5005.eqsin.wmnet
[12:21:16] <logmsgbot>	 !log bwojtowicz@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' .
[12:21:23] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2177.codfw.wmnet with reason: host reimage
[12:21:38] <Emperor>	 !log TLS key rollover for ms-fe T424674
[12:21:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:21:42] <stashbot>	 T424674: Migrate Data Persistence Envoy TLS proxy services to the 2026 discovery intermediate - https://phabricator.wikimedia.org/T424674
[12:22:02] <wikibugs>	 (03CR) 10MVernon: [C:03+2] role::swift::proxy: use discovery2026 intermediate for envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279272 (https://phabricator.wikimedia.org/T424674) (owner: 10MVernon)
[12:22:45] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2204.codfw.wmnet with reason: host reimage
[12:22:53] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' .
[12:23:37] <elukey>	 Emperor: not sure why it happened, the CertAlmostExpired definition in the alerts repo doesn't have a page severity option afaics
[12:23:51] <wikibugs>	 (03PS1) 10Jelto: doc: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279317 (https://phabricator.wikimedia.org/T424669)
[12:24:15] <wikibugs>	 (03CR) 10Elukey: [C:03+1] doc: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279317 (https://phabricator.wikimedia.org/T424669) (owner: 10Jelto)
[12:24:16] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1223.eqiad.wmnet with reason: host reimage
[12:24:56] <wikibugs>	 (03CR) 10Jelto: "The old patch was in the wrong file I89a48749795b414dc51d3e6ff16b3c9d51b488a8. This should be the correct file." [puppet] - 10https://gerrit.wikimedia.org/r/1279317 (https://phabricator.wikimedia.org/T424669) (owner: 10Jelto)
[12:25:30] <wikibugs>	 (03CR) 10Jelto: [C:03+2] doc: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279317 (https://phabricator.wikimedia.org/T424669) (owner: 10Jelto)
[12:25:34] <Emperor>	 elukey: IHNI either, maybe something the service owners set up for that service?
[12:26:10] <elukey>	 I am wondering if it is just for services in service.yaml that can page
[12:26:42] <jelto>	 we have a dedicated blackbox check for Phab with a pag.ing severity. Maybe this triggered the pag.ing alert 
[12:26:42] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P91914 and previous config saved to /var/cache/conftool/dbconfig/20260429-122641-fceratto.json
[12:28:31] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2177.codfw.wmnet with reason: host reimage
[12:28:49] <Emperor>	 elukey: https://portal.victorops.com/ui/wikimedia/incident/7883/details has the details
[12:29:11] <elukey>	 yeah see what jelto wrote above --^
[12:29:27] <wikibugs>	 (03PS1) 10Elukey: role::chartmuseum: move to pki discovery2026 [puppet] - 10https://gerrit.wikimedia.org/r/1279320 (https://phabricator.wikimedia.org/T424671)
[12:29:53] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' .
[12:30:33] <wikibugs>	 06SRE, 10Observability-Metrics, 13Patch-For-Review, 06SRE Observability (FY2025/2026-Q4): Migrate prometheus5002 to prometheus5003 - https://phabricator.wikimedia.org/T424024#11870959 (10tappof)
[12:31:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5005.eqsin.wmnet
[12:31:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1279320 (https://phabricator.wikimedia.org/T424671) (owner: 10Elukey)
[12:32:21] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1197.eqiad.wmnet with reason: host reimage
[12:33:10] <wikibugs>	 (03CR) 10Elukey: [C:03+2] role::chartmuseum: move to pki discovery2026 [puppet] - 10https://gerrit.wikimedia.org/r/1279320 (https://phabricator.wikimedia.org/T424671) (owner: 10Elukey)
[12:34:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti5005.eqsin.wmnet to cluster eqsin02 and group 01
[12:34:44] <wikibugs>	 (03PS1) 10Jelto: releases: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279322 (https://phabricator.wikimedia.org/T424669)
[12:35:47] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti5005.eqsin.wmnet to cluster eqsin02 and group 01
[12:36:07] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2204.codfw.wmnet with reason: host reimage
[12:36:48] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P91915 and previous config saved to /var/cache/conftool/dbconfig/20260429-123648-fceratto.json
[12:36:58] <wikibugs>	 (03PS2) 10Elukey: admin_ng: add extra TLS SANs for wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279315 (https://phabricator.wikimedia.org/T424193)
[12:37:40] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] prometheus/eqsin: remove 5002, add 5003 [puppet] - 10https://gerrit.wikimedia.org/r/1279254 (https://phabricator.wikimedia.org/T424024) (owner: 10Tiziano Fogli)
[12:38:11] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13[73-82] - https://phabricator.wikimedia.org/T423719#11871050 (10Jclark-ctr) @jmeybohm can you update site.pp. it only has servers upto  wikik...
[12:38:52] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): 2 devices deleted from netbox that where active - https://phabricator.wikimedia.org/T424019#11871061 (10Jclark-ctr) 05Open→03Resolved
[12:39:20] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:39:34] <jinxer-wm>	 FIRING: [7x] CertAlmostExpired: Certificate for service grafana:443 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[12:40:25] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Multi-bit memory errors on wikikube-worker1039.eqiad.wmnet - https://phabricator.wikimedia.org/T424797#11871065 (10Jclark-ctr) a:03Jclark-ctr @JMeybohm  this server is out of warranty. i  could swap with a spare from decom server bu...
[12:40:55] <tappof>	 !log migrate prometheus5002 to prometheus5003 T424024
[12:40:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:40:59] <stashbot>	 T424024: Migrate prometheus5002 to prometheus5003 - https://phabricator.wikimedia.org/T424024
[12:41:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1279322 (https://phabricator.wikimedia.org/T424669) (owner: 10Jelto)
[12:42:30] <wikibugs>	 (03CR) 10AikoChou: "Thanks for working on this!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279244 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis)
[12:42:43] <wikibugs>	 (03CR) 10AikoChou: [C:03+1] ml-services: Deploy rr-multilingual latest model version on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279244 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis)
[12:44:20] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:45:14] <wikibugs>	 (03PS1) 10Elukey: role::grafana: migrate to new pki intermediate discovery2026 [puppet] - 10https://gerrit.wikimedia.org/r/1279323 (https://phabricator.wikimedia.org/T424673)
[12:46:39] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1223.eqiad.wmnet with OS trixie
[12:46:49] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy rr-multilingual latest model version on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279244 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis)
[12:46:56] <wikibugs>	 (03CR) 10Muehlenhoff: "There's already https://gerrit.wikimedia.org/r/c/operations/puppet/+/1279271" [puppet] - 10https://gerrit.wikimedia.org/r/1279323 (https://phabricator.wikimedia.org/T424673) (owner: 10Elukey)
[12:46:57] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T419961)', diff saved to https://phabricator.wikimedia.org/P91916 and previous config saved to /var/cache/conftool/dbconfig/20260429-124656-fceratto.json
[12:47:17] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1236.eqiad.wmnet with reason: Maintenance
[12:47:25] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1236 (T419961)', diff saved to https://phabricator.wikimedia.org/P91917 and previous config saved to /var/cache/conftool/dbconfig/20260429-124725-fceratto.json
[12:47:53] <wikibugs>	 (03CR) 10Jelto: [C:03+2] releases: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279322 (https://phabricator.wikimedia.org/T424669) (owner: 10Jelto)
[12:47:54] <wikibugs>	 (03Abandoned) 10Elukey: role::grafana: migrate to new pki intermediate discovery2026 [puppet] - 10https://gerrit.wikimedia.org/r/1279323 (https://phabricator.wikimedia.org/T424673) (owner: 10Elukey)
[12:48:11] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11871082 (10MoritzMuehlenhoff)
[12:48:24] <wikibugs>	 (03CR) 10Elukey: [C:03+2] grafana: use discovery2026 intermediate certificate [puppet] - 10https://gerrit.wikimedia.org/r/1279271 (https://phabricator.wikimedia.org/T424673) (owner: 10Hnowlan)
[12:49:00] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: Deploy rr-multilingual latest model version on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279244 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis)
[12:49:02] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11871084 (10MoritzMuehlenhoff)
[12:49:32] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11871087 (10MoritzMuehlenhoff)
[12:50:10] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11871088 (10MoritzMuehlenhoff)
[12:50:20] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Multi-bit memory errors on wikikube-worker1039.eqiad.wmnet - https://phabricator.wikimedia.org/T424797#11871089 (10JMeybohm) >>! In T424797#11871065, @Jclark-ctr wrote: > @JMeybohm  this server is out of warranty. i  could swap with a...
[12:50:39] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2177.codfw.wmnet with OS trixie
[12:51:08] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1223: after reimage to trixie
[12:53:29] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] prometheus/eqsin: update svc record [dns] - 10https://gerrit.wikimedia.org/r/1279256 (https://phabricator.wikimedia.org/T424024) (owner: 10Tiziano Fogli)
[12:53:32] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db1223,db2177: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1279298 (owner: 10Marostegui)
[12:53:55] <logmsgbot>	 !log tappof@dns1004 START - running authdns-update
[12:54:02] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2204: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279292 (https://phabricator.wikimedia.org/T424615) (owner: 10Marostegui)
[12:54:30] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1197.eqiad.wmnet with OS trixie
[12:54:34] <jinxer-wm>	 FIRING: [5x] CertAlmostExpired: Certificate for service grafana:443 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[12:54:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279285 (owner: 10Muehlenhoff)
[12:54:45] <wikibugs>	 (03PS1) 10Bartosz Wójtowicz: Add 50051 to istio ingressgateway ports for ml-staging-codfw. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279327 (https://phabricator.wikimedia.org/T424049)
[12:55:21] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+1] Add 50051 to istio ingressgateway ports for ml-staging-codfw. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279327 (https://phabricator.wikimedia.org/T424049) (owner: 10Bartosz Wójtowicz)
[12:55:32] <logmsgbot>	 !log tappof@dns1004 END - running authdns-update
[12:55:37] <logmsgbot>	 !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[12:56:02] <logmsgbot>	 !log jmm@deploy1003 helmfile [staging] START helmfile.d/services/proton: apply
[12:56:07] <wikibugs>	 (03PS1) 10Jelto: jenkins: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279329 (https://phabricator.wikimedia.org/T424669)
[12:56:46] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] "looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/1279329 (https://phabricator.wikimedia.org/T424669) (owner: 10Jelto)
[12:56:47] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: CentralAuthTokenSessionProvider: Add security context to "centralauthtoken is invalid" logs [extensions/CentralAuth] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1279330
[12:56:57] <wikibugs>	 (03CR) 10Jelto: [C:03+2] jenkins: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279329 (https://phabricator.wikimedia.org/T424669) (owner: 10Jelto)
[12:57:03] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: CentralAuthTokenSessionProvider: Add security context to "centralauthtoken is invalid" logs [extensions/CentralAuth] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279331
[12:57:14] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/CentralAuth] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1279330 (owner: 10Bartosz Dziewoński)
[12:57:14] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.3 point update - https://phabricator.wikimedia.org/T414179#11871111 (10MoritzMuehlenhoff)
[12:57:20] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/CentralAuth] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279331 (owner: 10Bartosz Dziewoński)
[12:57:27] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2204.codfw.wmnet with OS trixie
[12:57:36] <MatmaRex>	 jouncebot: next
[12:57:36] <jouncebot>	 In 0 hour(s) and 2 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260429T1300)
[12:57:41] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1197: after reimage to trixie
[12:57:45] <logmsgbot>	 !log urbanecm@deploy1003 mwscript-k8s job started: GrowthExperiments:reassignMentees --wiki=enwiki --mentor=GrayStorm --performer=GrayStorm --as-job  # T418194
[12:57:49] <stashbot>	 T418194: Mentors still having mentees after removing themselves - https://phabricator.wikimedia.org/T418194
[12:57:56] <wikibugs>	 (03PS3) 10Federico Ceratto: sre.mysql.global-read-only Set all sections as RO/RW [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874)
[12:58:11] <logmsgbot>	 !log jmm@deploy1003 helmfile [staging] DONE helmfile.d/services/proton: apply
[12:58:27] <MatmaRex>	 hi folks, i added some small patches to the window, i hope you can fit them in (i don't have deployment access). they are safe to deploy together with other changes.
[12:58:32] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2177: after reimage to trixie
[12:58:46] <wikibugs>	 (03CR) 10Federico Ceratto: sre.mysql.global-read-only Set all sections as RO/RW (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto)
[12:59:11] <stephanebisson>	 MatmaRex the window is quite busy but I can try. Are you able to test them?
[12:59:18] <MatmaRex>	 yeah
[12:59:19] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:59:37] <wikibugs>	 (03PS1) 10Cathal Mooney: QoS: Map packets marked with DSCP CS1 into low-prirority class [homer/public] - 10https://gerrit.wikimedia.org/r/1279334 (https://phabricator.wikimedia.org/T424640)
[12:59:53] <wikibugs>	 (03CR) 10Federico Ceratto: "I added ask_confirmation and more detailed log messages and phabricator updatate. Can I add x1 and x3 as the s* sections or with a differe" [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto)
[12:59:59] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2204: after reimage to trixie
[13:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260429T1300).
[13:00:05] <jouncebot>	 codenamenoreste, stephanebisson, Tran, and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:09] <stephanebisson>	 o/
[13:00:12] <Tran>	 o/
[13:00:22] <codenamenoreste>	 i'm here
[13:00:33] <stephanebisson>	 codenamenoreste can you do your patch?
[13:00:55] <wikibugs>	 (03PS4) 10Federico Ceratto: sre.mysql.global-read-only Set all sections as RO/RW [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874)
[13:00:58] <stephanebisson>	 Or I can help
[13:01:51] <wikibugs>	 (03PS5) 10Federico Ceratto: sre.mysql.global-read-only Set all sections as RO/RW [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874)
[13:02:32] <logmsgbot>	 !log jmm@deploy1003 helmfile [codfw] START helmfile.d/services/proton: apply
[13:02:32] <stephanebisson>	 codenamenoreste are you able/willing to deploy your own patch or do you want someone else to do it?
[13:02:53] <wikibugs>	 (03PS1) 10STran: Instrument link clicks on success pages per spec [extensions/ReportIncident] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279335 (https://phabricator.wikimedia.org/T424075)
[13:02:54] <wikibugs>	 (03CR) 10Bartosz Wójtowicz: [C:03+2] Add 50051 to istio ingressgateway ports for ml-staging-codfw. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279327 (https://phabricator.wikimedia.org/T424049) (owner: 10Bartosz Wójtowicz)
[13:03:35] <wikibugs>	 (03CR) 10Mszwarc: [C:03+1] Instrument link clicks on success pages per spec [extensions/ReportIncident] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279335 (https://phabricator.wikimedia.org/T424075) (owner: 10STran)
[13:03:50] <logmsgbot>	 !log jmm@deploy1003 helmfile [codfw] DONE helmfile.d/services/proton: apply
[13:03:57] <wikibugs>	 (03PS1) 10JMeybohm: Add wikikube-worker13[73-82] to site.pp and preseed [puppet] - 10https://gerrit.wikimedia.org/r/1279336 (https://phabricator.wikimedia.org/T423719)
[13:04:20] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:04:34] <jinxer-wm>	 FIRING: [4x] CertAlmostExpired: Certificate for service grafana:443 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[13:04:50] <stephanebisson>	 I'll start with my patch in the meantime
[13:06:10] <wikibugs>	 06SRE, 10DNS, 06Traffic: [Update DNS Record Request] - wikimedia.org - Add TXT verification for Anthropic - https://phabricator.wikimedia.org/T424785#11871180 (10ssingh) a:03CDobbins
[13:06:52] <stephanebisson>	 I can't reach deploy1003.eqiad.wmnet. Is there another server I should use?
[13:08:43] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] Add wikikube-worker13[73-82] to site.pp and preseed [puppet] - 10https://gerrit.wikimedia.org/r/1279336 (https://phabricator.wikimedia.org/T423719) (owner: 10JMeybohm)
[13:09:00] <jinxer-wm>	 FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta
[13:09:18] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Add wikikube-worker13[73-82] to site.pp and preseed [puppet] - 10https://gerrit.wikimedia.org/r/1279336 (https://phabricator.wikimedia.org/T423719) (owner: 10JMeybohm)
[13:10:14] <logmsgbot>	 !log jmm@deploy1003 helmfile [eqiad] START helmfile.d/services/proton: apply
[13:10:25] <wikibugs>	 (03PS1) 10Cathal Mooney: Network QoS: adjust configuration to mark low-priority traffic as CS1 [puppet] - 10https://gerrit.wikimedia.org/r/1279339 (https://phabricator.wikimedia.org/T424640)
[13:10:31] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13[73-82] - https://phabricator.wikimedia.org/T423719#11871190 (10JMeybohm) >>! In T423719#11871050, @Jclark-ctr wrote: > @jmeybohm can you update site.pp. it only...
[13:11:01] <wikibugs>	 (03Merged) 10jenkins-bot: Add 50051 to istio ingressgateway ports for ml-staging-codfw. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279327 (https://phabricator.wikimedia.org/T424049) (owner: 10Bartosz Wójtowicz)
[13:11:46] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Multi-bit memory errors on wikikube-worker1039.eqiad.wmnet - https://phabricator.wikimedia.org/T424797#11871199 (10Jclark-ctr) Dimm has been swapped Thank you   old dimm ` BankLabelA CacheSizeInformation Not Available CPUAffinity1 Cur...
[13:12:01] <Tran>	 stephanebisson: Are you still having trouble?
[13:12:44] <stephanebisson>	 OK, my problem is resolved.
[13:12:59] <stephanebisson>	 codenamenoreste are you able to deploy your change or do you need help?
[13:13:23] <MatmaRex>	 i don't think they have deployment access
[13:13:37] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[13:13:44] <codenamenoreste>	 I was about to say that ^^
[13:13:59] <stephanebisson>	 codenamenoreste OK I'm starting with your patch
[13:14:02] <icinga-wm>	 RECOVERY - Host wikikube-worker1039 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms
[13:14:17] <stephanebisson>	 Sorry for the delay
[13:14:25] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271215 (https://phabricator.wikimedia.org/T423100) (owner: 10Codename Noreste)
[13:14:41] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11871223 (10SLyngshede-WMF)
[13:15:18] <wikibugs>	 10SRE-swift-storage, 10Ceph, 06Data-Persistence: Migrate Data Persistence Envoy TLS proxy services to the 2026 discovery intermediate - https://phabricator.wikimedia.org/T424674#11871225 (10Eevans)
[13:15:28] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[13:15:30] <wikibugs>	 (03CR) 10Codename Noreste: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271215 (https://phabricator.wikimedia.org/T423100) (owner: 10Codename Noreste)
[13:15:36] <wikibugs>	 (03Merged) 10jenkins-bot: lbwiki: Limit ContentTranslation extension to autoconfirmed and confirmed users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271215 (https://phabricator.wikimedia.org/T423100) (owner: 10Codename Noreste)
[13:15:40] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Multi-bit memory errors on wikikube-worker1039.eqiad.wmnet - https://phabricator.wikimedia.org/T424797#11871228 (10Jclark-ctr) 05Open→03Resolved
[13:16:04] <logmsgbot>	 !log sbisson@deploy1003 Started scap sync-world: Backport for [[gerrit:1271215|lbwiki: Limit ContentTranslation extension to autoconfirmed and confirmed users (T423100)]]
[13:16:08] <stashbot>	 T423100: [lbwiki] Limit ContentTranslation to autoconfirmed and confirmed users - https://phabricator.wikimedia.org/T423100
[13:16:12] <logmsgbot>	 !log jmm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/proton: apply
[13:16:48] <stephanebisson>	 codenamenoreste will you be able to test your change against the test servers using the WikimediaDebug browser extension?
[13:17:57] <logmsgbot>	 !log sbisson@deploy1003 sbisson, codenamenoreste: Backport for [[gerrit:1271215|lbwiki: Limit ContentTranslation extension to autoconfirmed and confirmed users (T423100)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:18:05] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/ReportIncident] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279335 (https://phabricator.wikimedia.org/T424075) (owner: 10STran)
[13:18:13] <codenamenoreste>	 I'm going to log in to my alt test account to verify the changes on lbwiki
[13:18:23] <Tran>	 ^ that patch is just going to go with my current stack
[13:18:40] <stephanebisson>	 codenamenoreste ready for you to test
[13:19:26] <wikibugs>	 (03PS1) 10Muehlenhoff: tlsproxy::envoy: Bump default now that services have moved [puppet] - 10https://gerrit.wikimedia.org/r/1279340 (https://phabricator.wikimedia.org/T420993)
[13:19:37] <wikibugs>	 (03PS2) 10Cathal Mooney: Network QoS: adjust configuration to mark low-priority traffic as CS1 [puppet] - 10https://gerrit.wikimedia.org/r/1279339 (https://phabricator.wikimedia.org/T424640)
[13:19:53] <wikibugs>	 (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1279339 (https://phabricator.wikimedia.org/T424640) (owner: 10Cathal Mooney)
[13:20:55] <codenamenoreste>	 using incognito and my alternate account, without the change the content translation extension lists article suggestions, but with the patch activated, it doesn't display anything
[13:21:10] <codenamenoreste>	 ^ such suggestions, I meant
[13:21:37] <moritzm>	 !log installing tiff security updates
[13:21:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:59] <stephanebisson>	 codenamenoreste there appears to be a problem with the suggestions system at the moment
[13:22:12] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13[73-82] - https://phabricator.wikimedia.org/T423719#11871249 (10Jclark-ctr) @jmeybohm. Sorry for the duplicate work.  I just finished moving and cabling everythi...
[13:22:20] <stephanebisson>	 But the patch looks good I think we can go ahead with the change
[13:22:28] <codenamenoreste>	 Go ahead :)
[13:22:44] <logmsgbot>	 !log sbisson@deploy1003 sbisson, codenamenoreste: Continuing with deployment
[13:22:52] <wikibugs>	 (03CR) 10Bking: [C:03+2] wcqs: Migrate to new discovery intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1278610 (https://phabricator.wikimedia.org/T420993) (owner: 10Bking)
[13:23:51] <wikibugs>	 (03PS3) 10Cathal Mooney: Network QoS: adjust configuration to mark low-priority traffic as CS1 [puppet] - 10https://gerrit.wikimedia.org/r/1279339 (https://phabricator.wikimedia.org/T424640)
[13:24:34] <wikibugs>	 (03PS1) 10Muehlenhoff: Add bast5005 [puppet] - 10https://gerrit.wikimedia.org/r/1279343 (https://phabricator.wikimedia.org/T421863)
[13:26:21] <wikibugs>	 (03CR) 10Majavah: zookeeper: allow overriding the zookeeper host ID (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1278524 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott)
[13:26:35] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T419961)', diff saved to https://phabricator.wikimedia.org/P91928 and previous config saved to /var/cache/conftool/dbconfig/20260429-132635-fceratto.json
[13:26:37] <logmsgbot>	 !log sbisson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1271215|lbwiki: Limit ContentTranslation extension to autoconfirmed and confirmed users (T423100)]] (duration: 10m 33s)
[13:26:42] <stashbot>	 T423100: [lbwiki] Limit ContentTranslation to autoconfirmed and confirmed users - https://phabricator.wikimedia.org/T423100
[13:27:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278584 (https://phabricator.wikimedia.org/T417200) (owner: 10Sbisson)
[13:27:59] <wikibugs>	 (03Merged) 10jenkins-bot: testwiki: Article Guidance experiment config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278584 (https://phabricator.wikimedia.org/T417200) (owner: 10Sbisson)
[13:28:26] <logmsgbot>	 !log sbisson@deploy1003 Started scap sync-world: Backport for [[gerrit:1278584|testwiki: Article Guidance experiment config (T417200)]]
[13:28:30] <stashbot>	 T417200: Deploy Article Guidance extension to production (testwiki) - https://phabricator.wikimedia.org/T417200
[13:29:20] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:30:16] <logmsgbot>	 !log sbisson@deploy1003 sbisson: Backport for [[gerrit:1278584|testwiki: Article Guidance experiment config (T417200)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:31:33] <wikibugs>	 (03PS5) 10Andrew Bogott: zookeeper: allow overriding the zookeeper host ID [puppet] - 10https://gerrit.wikimedia.org/r/1278524 (https://phabricator.wikimedia.org/T422646)
[13:31:34] <wikibugs>	 (03PS3) 10Andrew Bogott: Designate: use zookeeper as the tooz backend, everywhere [puppet] - 10https://gerrit.wikimedia.org/r/1278528 (https://phabricator.wikimedia.org/T422646)
[13:31:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] alertmanager: add frack networks to iptables allow on 9093 [puppet] - 10https://gerrit.wikimedia.org/r/1269672 (https://phabricator.wikimedia.org/T422888) (owner: 10Dwisehaupt)
[13:31:52] <wikibugs>	 (03PS1) 10Tiziano Fogli: Revert "prometheus::pop: enable rsyncd on eqsin" [puppet] - 10https://gerrit.wikimedia.org/r/1279345
[13:32:52] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1007.eqiad.wmnet with OS trixie
[13:33:04] <wikibugs>	 (03CR) 10Andrew Bogott: zookeeper: allow overriding the zookeeper host ID (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1278524 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott)
[13:33:04] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13{75-84} - https://phabricator.wikimedia.org/T423719#11871286 (10Jclark-ctr)
[13:33:29] <logmsgbot>	 !log sbisson@deploy1003 sbisson: Continuing with deployment
[13:34:12] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.dns.netbox
[13:34:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:34:31] <stephanebisson>	 Tran will you do your changes or do you want me to?
[13:34:34] <jinxer-wm>	 RESOLVED: [2x] CertAlmostExpired: Certificate for service wcqs:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wcqs:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[13:34:35] <Tran>	 I can do it
[13:35:27] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service wcqs:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wcqs:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[13:35:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] Revert "prometheus::pop: enable rsyncd on eqsin" [puppet] - 10https://gerrit.wikimedia.org/r/1279345 (owner: 10Tiziano Fogli)
[13:35:43] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] Revert "prometheus::pop: enable rsyncd on eqsin" [puppet] - 10https://gerrit.wikimedia.org/r/1279345 (owner: 10Tiziano Fogli)
[13:36:11] <wikibugs>	 (03PS2) 10Tiziano Fogli: prom5003: clean up firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1279255 (https://phabricator.wikimedia.org/T424024)
[13:36:32] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1223: after reimage to trixie
[13:36:40] <codenamenoreste>	 stephanebisson I have one more patch to deploy, https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1274928
[13:36:43] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P91933 and previous config saved to /var/cache/conftool/dbconfig/20260429-133643-fceratto.json
[13:36:58] <wikibugs>	 (03CR) 10Codename Noreste: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1274928 (https://phabricator.wikimedia.org/T423461) (owner: 10Codename Noreste)
[13:37:17] <logmsgbot>	 !log sbisson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1278584|testwiki: Article Guidance experiment config (T417200)]] (duration: 08m 51s)
[13:37:22] <stashbot>	 T417200: Deploy Article Guidance extension to production (testwiki) - https://phabricator.wikimedia.org/T417200
[13:37:52] <stephanebisson>	 Tran over to you
[13:38:11] <stephanebisson>	 codenamenoreste if there is time at the end of the window
[13:38:21] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] prom5003: clean up firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1279255 (https://phabricator.wikimedia.org/T424024) (owner: 10Tiziano Fogli)
[13:38:36] <codenamenoreste>	 it's 8:38 a.m. where I live right now, so we might still have time
[13:38:48] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[13:38:49] <wikibugs>	 10SRE-swift-storage, 10Ceph, 06Data-Persistence: Migrate Data Persistence Envoy TLS proxy services to the 2026 discovery intermediate - https://phabricator.wikimedia.org/T424674#11871308 (10MatthewVernon)
[13:39:10] <Tran>	 starting
[13:39:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:39:29] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy1003 using scap backport" [extensions/CheckUser] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1278380 (https://phabricator.wikimedia.org/T420517) (owner: 10STran)
[13:39:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy1003 using scap backport" [extensions/ReportIncident] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279279 (https://phabricator.wikimedia.org/T424220) (owner: 10STran)
[13:39:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy1003 using scap backport" [extensions/ReportIncident] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279280 (https://phabricator.wikimedia.org/T424075) (owner: 10STran)
[13:39:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy1003 using scap backport" [extensions/ReportIncident] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279335 (https://phabricator.wikimedia.org/T424075) (owner: 10STran)
[13:39:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279269 (https://phabricator.wikimedia.org/T424075) (owner: 10STran)
[13:39:52] <wikibugs>	 10SRE-swift-storage, 10Ceph, 06Data-Persistence: Migrate Data Persistence Envoy TLS proxy services to the 2026 discovery intermediate - https://phabricator.wikimedia.org/T424674#11871310 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon Rune in the description probably should be more like `open...
[13:40:14] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[13:40:28] <jinxer-wm>	 RESOLVED: [2x] CertAlmostExpired: Certificate for service wcqs:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wcqs:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[13:40:48] <wikibugs>	 (03Merged) 10jenkins-bot: Enable staggered rollout for IRS on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279269 (https://phabricator.wikimedia.org/T424075) (owner: 10STran)
[13:41:31] <wikibugs>	 (03Merged) 10jenkins-bot: Update action parameter for bulk blocking instrumented events [extensions/CheckUser] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1278380 (https://phabricator.wikimedia.org/T420517) (owner: 10STran)
[13:41:33] <wikibugs>	 (03Merged) 10jenkins-bot: Support staggered rollout via Test Kitchen [extensions/ReportIncident] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279279 (https://phabricator.wikimedia.org/T424220) (owner: 10STran)
[13:42:05] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[13:42:12] <wikibugs>	 (03CR) 10Codename Noreste: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278363 (https://phabricator.wikimedia.org/T424445) (owner: 10VadymTS1)
[13:43:02] <wikibugs>	 (03Merged) 10jenkins-bot: Update IRS instrumentation [extensions/ReportIncident] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279280 (https://phabricator.wikimedia.org/T424075) (owner: 10STran)
[13:43:06] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1197: after reimage to trixie
[13:43:17] <wikibugs>	 (03Merged) 10jenkins-bot: Instrument link clicks on success pages per spec [extensions/ReportIncident] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279335 (https://phabricator.wikimedia.org/T424075) (owner: 10STran)
[13:43:48] <logmsgbot>	 !log stran@deploy1003 Started scap sync-world: Backport for [[gerrit:1278380|Update action parameter for bulk blocking instrumented events (T420517)]], [[gerrit:1279279|Support staggered rollout via Test Kitchen (T424220)]], [[gerrit:1279280|Update IRS instrumentation (T424075)]], [[gerrit:1279335|Instrument link clicks on success pages per spec (T424075)]], [[gerrit:1279269|Enable staggered rollout for IRS on testwiki (T
[13:43:48] <logmsgbot>	 424075)]]
[13:43:58] <stashbot>	 T420517: Instrument bulk blocking of connected temporary accounts - https://phabricator.wikimedia.org/T420517
[13:43:58] <stashbot>	 T424220: IRS should support full deployment and experiment rollout percentages - https://phabricator.wikimedia.org/T424220
[13:43:58] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2177: after reimage to trixie
[13:43:59] <stashbot>	 T424075: Update instrumentation MVP for enwiki 5% rollout - https://phabricator.wikimedia.org/T424075
[13:44:03] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2204 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1279346 (https://phabricator.wikimedia.org/T424848)
[13:44:05] <wikibugs>	 (03CR) 10Codename Noreste: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278363 (https://phabricator.wikimedia.org/T424445) (owner: 10VadymTS1)
[13:44:20] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:45:10] <wikibugs>	 (03PS1) 10Elukey: role::crm: update postfix's cfssl pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1279347 (https://phabricator.wikimedia.org/T420993)
[13:45:24] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2204: after reimage to trixie
[13:45:40] <logmsgbot>	 !log stran@deploy1003 stran: Backport for [[gerrit:1278380|Update action parameter for bulk blocking instrumented events (T420517)]], [[gerrit:1279279|Support staggered rollout via Test Kitchen (T424220)]], [[gerrit:1279280|Update IRS instrumentation (T424075)]], [[gerrit:1279335|Instrument link clicks on success pages per spec (T424075)]], [[gerrit:1279269|Enable staggered rollout for IRS on testwiki (T424075)]] synced t
[13:45:40] <logmsgbot>	 o the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:46:03] <wikibugs>	 (03CR) 10Codename Noreste: "Done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278363 (https://phabricator.wikimedia.org/T424445) (owner: 10VadymTS1)
[13:46:16] <Tran>	 testing now
[13:46:51] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P91937 and previous config saved to /var/cache/conftool/dbconfig/20260429-134651-fceratto.json
[13:47:01] <wikibugs>	 (03PS1) 10Marostegui: db1157,db2156: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279351 (https://phabricator.wikimedia.org/T424792)
[13:47:15] <codenamenoreste>	 so, I still have a patch to check for ukwiki which is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1274928
[13:47:23] <Tran>	 tests look good, continuing
[13:47:26] <logmsgbot>	 !log stran@deploy1003 stran: Continuing with deployment
[13:47:49] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1157.eqiad.wmnet with reason: Reimage to Trixie
[13:47:50] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2156.codfw.wmnet with reason: Reimage to Trixie
[13:47:51] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1157,db2156: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279351 (https://phabricator.wikimedia.org/T424792) (owner: 10Marostegui)
[13:47:54] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1157: Reimage to Trixie
[13:47:56] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2156: Reimage to Trixie
[13:48:14] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2156: Reimage to Trixie
[13:48:22] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1157: Reimage to Trixie
[13:48:38] <wikibugs>	 06SRE, 10Observability-Metrics, 13Patch-For-Review, 06SRE Observability (FY2025/2026-Q4): Migrate prometheus5002 to prometheus5003 - https://phabricator.wikimedia.org/T424024#11871399 (10tappof)
[13:49:54] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2156.codfw.wmnet with OS trixie
[13:50:10] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1157.eqiad.wmnet with OS trixie
[13:51:14] <logmsgbot>	 !log stran@deploy1003 Finished scap sync-world: Backport for [[gerrit:1278380|Update action parameter for bulk blocking instrumented events (T420517)]], [[gerrit:1279279|Support staggered rollout via Test Kitchen (T424220)]], [[gerrit:1279280|Update IRS instrumentation (T424075)]], [[gerrit:1279335|Instrument link clicks on success pages per spec (T424075)]], [[gerrit:1279269|Enable staggered rollout for IRS on testwiki (
[13:51:14] <logmsgbot>	 T424075)]] (duration: 07m 26s)
[13:51:28] <stashbot>	 T420517: Instrument bulk blocking of connected temporary accounts - https://phabricator.wikimedia.org/T420517
[13:51:29] <stashbot>	 T424220: IRS should support full deployment and experiment rollout percentages - https://phabricator.wikimedia.org/T424220
[13:51:29] <stashbot>	 T424075: Update instrumentation MVP for enwiki 5% rollout - https://phabricator.wikimedia.org/T424075
[13:51:33] <Tran>	 done. I think MatmaRex is next?
[13:52:01] <MatmaRex>	 i don't have deployment access, can anyone else ship the changes?
[13:52:14] <Tran>	 yeah I'm still in spiderpig. Can you test?
[13:52:16] <stephanebisson>	 MatmaRex I can do it
[13:52:22] <Tran>	 oh sure, feel free
[13:52:22] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1007.eqiad.wmnet with reason: host reimage
[13:52:32] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1279330 (owner: 10Bartosz Dziewoński)
[13:52:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279331 (owner: 10Bartosz Dziewoński)
[13:52:52] <codenamenoreste>	 one more reminder, I still have one more patch to deploy for ukwiki
[13:52:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1279347 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey)
[13:53:45] <logmsgbot>	 !log tappof@cumin1003 START - Cookbook sre.hosts.decommission for hosts prometheus5002.eqsin.wmnet
[13:54:25] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.dns.netbox
[13:54:42] <wikibugs>	 (03PS1) 10Elukey: pki: add the discovery2026 intermediate in cloud-pki [puppet] - 10https://gerrit.wikimedia.org/r/1279356 (https://phabricator.wikimedia.org/T424549)
[13:55:13] <wikibugs>	 (03CR) 10Elukey: [C:03+2] role::crm: update postfix's cfssl pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1279347 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey)
[13:55:23] <wikibugs>	 (03CR) 10Ottomata: [C:03+1] stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279247 (https://phabricator.wikimedia.org/T424624) (owner: 10JavierMonton)
[13:55:39] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[13:55:51] <wikibugs>	 (03Merged) 10jenkins-bot: CentralAuthTokenSessionProvider: Add security context to "centralauthtoken is invalid" logs [extensions/CentralAuth] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1279330 (owner: 10Bartosz Dziewoński)
[13:55:52] <wikibugs>	 (03Merged) 10jenkins-bot: CentralAuthTokenSessionProvider: Add security context to "centralauthtoken is invalid" logs [extensions/CentralAuth] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279331 (owner: 10Bartosz Dziewoński)
[13:56:03] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1157,db2156: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1279359
[13:56:25] <logmsgbot>	 !log sbisson@deploy1003 Started scap sync-world: Backport for [[gerrit:1279330|CentralAuthTokenSessionProvider: Add security context to "centralauthtoken is invalid" logs]], [[gerrit:1279331|CentralAuthTokenSessionProvider: Add security context to "centralauthtoken is invalid" logs]]
[13:56:45] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[13:57:00] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T419961)', diff saved to https://phabricator.wikimedia.org/P91940 and previous config saved to /var/cache/conftool/dbconfig/20260429-135659-fceratto.json
[13:57:04] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[13:57:13] <wikibugs>	 (03PS1) 10Bartosz Wójtowicz: ml-services: Use gRPC port for staging outlink-topic-model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279360 (https://phabricator.wikimedia.org/T424049)
[13:58:14] <logmsgbot>	 !log sbisson@deploy1003 matmarex, sbisson: Backport for [[gerrit:1279330|CentralAuthTokenSessionProvider: Add security context to "centralauthtoken is invalid" logs]], [[gerrit:1279331|CentralAuthTokenSessionProvider: Add security context to "centralauthtoken is invalid" logs]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:58:36] <logmsgbot>	 !log tappof@cumin1003 START - Cookbook sre.dns.netbox
[13:58:40] <stephanebisson>	 MatmaRex can you test?
[13:58:46] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2026-04-14-215402 to 2026-04-21-184122 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279362 (https://phabricator.wikimedia.org/T402956)
[13:58:57] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2026-04-15-195941 to 2026-04-29-001940 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279363 (https://phabricator.wikimedia.org/T400517)
[13:58:59] <wikibugs>	 06SRE-OnFire, 10SRE-swift-storage, 07Sustainability (Incident Followup): Swift proxy server misbehaviour (no longer calling `accept`?) - https://phabricator.wikimedia.org/T360913#11871485 (10hnowlan)
[13:59:01] <MatmaRex>	 yep, looking
[13:59:25] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1007.eqiad.wmnet with reason: host reimage
[13:59:56] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+1] ml-services: Use gRPC port for staging outlink-topic-model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279360 (https://phabricator.wikimedia.org/T424049) (owner: 10Bartosz Wójtowicz)
[14:00:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260429T1400)
[14:00:24] <stephanebisson>	 We're wrapping config deployment
[14:00:34] <stephanebisson>	 *wrapping up
[14:00:37] <MatmaRex>	 stephanebisson: thanks, looks good
[14:00:42] <logmsgbot>	 !log sbisson@deploy1003 matmarex, sbisson: Continuing with deployment
[14:01:20] <MatmaRex>	 did we finish codenamenoreste's deployments? i saw some message about it earleir
[14:01:51] <wikibugs>	 (03PS1) 10Elukey: CHANGELOG: add changelogs for release v12.5.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1279367
[14:01:57] <codenamenoreste>	 ¯\_(ツ)_/¯
[14:02:40] <wikibugs>	 (03CR) 10Genoveva Galarza: [C:03+2] wikifunctions: Upgrade evaluators from 2026-04-14-215402 to 2026-04-21-184122 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279362 (https://phabricator.wikimedia.org/T402956) (owner: 10Jforrester)
[14:03:05] <wikibugs>	 (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Use gRPC port for staging outlink-topic-model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279360 (https://phabricator.wikimedia.org/T424049) (owner: 10Bartosz Wójtowicz)
[14:03:18] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1157.eqiad.wmnet with reason: host reimage
[14:03:25] <jinxer-wm>	 FIRING: [16x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:03:34] <logmsgbot>	 !log tappof@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: prometheus5002.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - tappof@cumin1003"
[14:04:09] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.dns.netbox
[14:04:20] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job envoy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:04:34] <logmsgbot>	 !log sbisson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1279330|CentralAuthTokenSessionProvider: Add security context to "centralauthtoken is invalid" logs]], [[gerrit:1279331|CentralAuthTokenSessionProvider: Add security context to "centralauthtoken is invalid" logs]] (duration: 08m 08s)
[14:04:50] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2026-04-14-215402 to 2026-04-21-184122 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279362 (https://phabricator.wikimedia.org/T402956) (owner: 10Jforrester)
[14:04:51] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Cookbook for rack depool - https://phabricator.wikimedia.org/T327300#11871539 (10FCeratto-WMF) @ayounsi an amount of data is exposed by https://zarcillo.wikimedia.org/apidocs#/default/get_sections_data_api_v0_sections_get but we can create a simp...
[14:05:14] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: Use gRPC port for staging outlink-topic-model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279360 (https://phabricator.wikimedia.org/T424049) (owner: 10Bartosz Wójtowicz)
[14:05:20] <jinxer-wm>	 FIRING: [3x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in 3d 23h 49m 25s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry
[14:05:30] <wikibugs>	 (03CR) 10Elukey: [C:03+2] pki: add the discovery2026 intermediate in cloud-pki [puppet] - 10https://gerrit.wikimedia.org/r/1279356 (https://phabricator.wikimedia.org/T424549) (owner: 10Elukey)
[14:05:40] <wikibugs>	 (03CR) 10Elukey: pki: add the discovery2026 intermediate in cloud-pki [puppet] - 10https://gerrit.wikimedia.org/r/1279356 (https://phabricator.wikimedia.org/T424549) (owner: 10Elukey)
[14:06:11] <logmsgbot>	 !log bwojtowicz@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' .
[14:06:16] <logmsgbot>	 !log gengh@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[14:06:40] <logmsgbot>	 tappof@cumin1003 decommission (PID 2577680) is awaiting input
[14:06:44] <wikibugs>	 (03CR) 10Elukey: [C:03+2] CHANGELOG: add changelogs for release v12.5.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1279367 (owner: 10Elukey)
[14:06:45] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:07:45] <wikibugs>	 (03PS1) 10Elukey: Upstream release v12.5.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1279371
[14:08:01] <wikibugs>	 (03CR) 10Elukey: [V:03+2 C:03+2] Upstream release v12.5.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1279371 (owner: 10Elukey)
[14:08:40] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1157.eqiad.wmnet with reason: host reimage
[14:09:03] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] role::crm: update postfix's cfssl pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1279347 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey)
[14:09:10] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2156.codfw.wmnet with reason: host reimage
[14:09:20] <jinxer-wm>	 RESOLVED: [3x] JobUnavailable: Reduced availability for job envoy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:11:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1279356 (https://phabricator.wikimedia.org/T424549) (owner: 10Elukey)
[14:11:56] <wikibugs>	 (03CR) 10Elukey: [C:03+2] pki: add the discovery2026 intermediate in cloud-pki [puppet] - 10https://gerrit.wikimedia.org/r/1279356 (https://phabricator.wikimedia.org/T424549) (owner: 10Elukey)
[14:13:10] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2156.codfw.wmnet with reason: host reimage
[14:14:01] <wikibugs>	 (03CR) 10Jforrester: "Let's keep these ones this way around, and the new (replacement, Rust-based) ones can be "better named"?" [dns] - 10https://gerrit.wikimedia.org/r/1277099 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey)
[14:15:10] <icinga-wm>	 RECOVERY - Postfix SMTP on crm2001 is OK: OK - Certificate crm2001.codfw.wmnet will expire on Wed 27 May 2026 01:53:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting
[14:16:34] <wikibugs>	 (03PS1) 10MVernon: swift: remove 2 drained nodes from rings for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1279372 (https://phabricator.wikimedia.org/T354872)
[14:16:57] <elukey>	 !log uploaded spicerack_12.5.0 to apt.wikimedia.org bookworm-wikimedia
[14:16:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:43] <wikibugs>	 (03CR) 10Elukey: "sure!" [dns] - 10https://gerrit.wikimedia.org/r/1277099 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey)
[14:18:09] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release wikifunctions/python-evaluator on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[14:18:41] <wikibugs>	 (03PS1) 10Gkyziridis: ml-services: Use concurrency knative metric for rr-multilingual model on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279373 (https://phabricator.wikimedia.org/T415892)
[14:18:43] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1375
[14:18:51] <wikibugs>	 (03PS1) 10Bartosz Wójtowicz: Enable Knative HTTP/2 auto-detection on ml-staging-codfw. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279375 (https://phabricator.wikimedia.org/T424049)
[14:19:24] <logmsgbot>	 !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[14:19:32] <logmsgbot>	 !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[14:19:56] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1375
[14:19:57] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+1] Enable Knative HTTP/2 auto-detection on ml-staging-codfw. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279375 (https://phabricator.wikimedia.org/T424049) (owner: 10Bartosz Wójtowicz)
[14:20:22] <elukey>	 James_F: o/ I haven't deployed the new mesh/ingress changes yet to prod, they are relatively safe to push forward but I am missing https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1279315. Ping me if you deploy to prod so we can check together
[14:20:58] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance
[14:21:05] <logmsgbot>	 !log gengh@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[14:21:06] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2152 (T419961)', diff saved to https://phabricator.wikimedia.org/P91941 and previous config saved to /var/cache/conftool/dbconfig/20260429-142105-fceratto.json
[14:21:10] <logmsgbot>	 !log gengh@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[14:21:16] <James_F>	 elukey: Ack. This is our weekly deploy window now.
[14:21:22] <logmsgbot>	 !log tappof@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: prometheus5002.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - tappof@cumin1003"
[14:21:22] <logmsgbot>	 !log tappof@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:21:24] <logmsgbot>	 !log tappof@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts prometheus5002.eqsin.wmnet
[14:21:38] <wikibugs>	 06SRE, 10Observability-Metrics, 13Patch-For-Review, 06SRE Observability (FY2025/2026-Q4): Migrate prometheus5002 to prometheus5003 - https://phabricator.wikimedia.org/T424024#11871640 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by tappof@cumin1003 for hosts: `prometheus5002.eqsin.wmnet...
[14:21:46] <logmsgbot>	 !log gengh@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[14:22:25] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+2] "Merging this HotFix on staging. Tested on experimental." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279373 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis)
[14:22:30] <James_F>	 elukey: Do we need to stop deploying before the admin_ng bit is merged?
[14:22:41] <wikibugs>	 (03PS2) 10MVernon: swift: remove 2 drained nodes from rings, set for new-style storage [puppet] - 10https://gerrit.wikimedia.org/r/1279372 (https://phabricator.wikimedia.org/T354872)
[14:23:02] <wikibugs>	 (03CR) 10Bartosz Wójtowicz: [C:03+2] Enable Knative HTTP/2 auto-detection on ml-staging-codfw. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279375 (https://phabricator.wikimedia.org/T424049) (owner: 10Bartosz Wójtowicz)
[14:23:07] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1007.eqiad.wmnet with OS trixie
[14:23:18] <logmsgbot>	 !log gengh@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[14:24:23] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: Use concurrency knative metric for rr-multilingual model on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279373 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis)
[14:24:44] <logmsgbot>	 !log gengh@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[14:25:06] <wikibugs>	 (03PS1) 10Santiago Faci: Test Kitchen UI: Deploy v1.3.0 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279377 (https://phabricator.wikimedia.org/T419511)
[14:25:26] <logmsgbot>	 !log gengh@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[14:25:53] <logmsgbot>	 !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[14:25:54] <elukey>	 James_F: in theory no, the new ingress stuff will just sit there on the side
[14:25:57] <James_F>	 Ack.
[14:26:03] <James_F>	 So far looks good.
[14:26:22] <wikibugs>	 (03CR) 10Genoveva Galarza: [C:03+2] wikifunctions: Upgrade orchestrator from 2026-04-15-195941 to 2026-04-29-001940 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279363 (https://phabricator.wikimedia.org/T400517) (owner: 10Jforrester)
[14:26:43] <wikibugs>	 (03PS1) 10Elukey: sre.hosts: fix ipmi() calls after spicerack 12.5.0 [cookbooks] - 10https://gerrit.wikimedia.org/r/1279379 (https://phabricator.wikimedia.org/T418929)
[14:27:12] <wikibugs>	 (03CR) 10Elukey: "Related change: https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/1271631" [cookbooks] - 10https://gerrit.wikimedia.org/r/1279379 (https://phabricator.wikimedia.org/T418929) (owner: 10Elukey)
[14:27:31] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13{75-84} - https://phabricator.wikimedia.org/T423719#11871688 (10Jclark-ctr) netbox has been updated  , network ports  configured.  Pending ru...
[14:28:47] <wikibugs>	 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 6 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11871716 (10A_smart_kitten) Prompted by {T424511}, I'm probably gonna try and work a bit from (subsets of) [[https://codesearch.wmclo...
[14:29:17] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T419961)', diff saved to https://phabricator.wikimedia.org/P91942 and previous config saved to /var/cache/conftool/dbconfig/20260429-142916-fceratto.json
[14:29:33] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db1157,db2156: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1279359 (owner: 10Marostegui)
[14:29:58] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1157.eqiad.wmnet with OS trixie
[14:30:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260429T1400)
[14:30:05] <jouncebot>	 Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260429T1430)
[14:30:40] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Knative HTTP/2 auto-detection on ml-staging-codfw. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279375 (https://phabricator.wikimedia.org/T424049) (owner: 10Bartosz Wójtowicz)
[14:30:51] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Degraded RAID on an-worker1199 - https://phabricator.wikimedia.org/T424654#11871742 (10Jclark-ctr) @BTullis @RKemper   Parts have Arrived 2x drives.  for replacement of  Physical Disk 0:1:4 Physical Disk 0:1:5    Please let me kno...
[14:32:08] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[14:33:43] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.3 point update - https://phabricator.wikimedia.org/T414179#11871791 (10MoritzMuehlenhoff)
[14:33:56] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[14:34:26] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2026-04-15-195941 to 2026-04-29-001940 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279363 (https://phabricator.wikimedia.org/T400517) (owner: 10Jforrester)
[14:34:41] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1157: after reimage to trixie
[14:34:42] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[14:35:20] <moritzm>	 !log installing zsh updates from Trixie point release
[14:35:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:35:44] <logmsgbot>	 !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[14:36:08] <logmsgbot>	 !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[14:36:29] <logmsgbot>	 !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[14:36:33] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[14:37:01] <logmsgbot>	 !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[14:37:07] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2156.codfw.wmnet with OS trixie
[14:37:07] <inflatador>	 !log bking@cloudelastic1010 run smartctl against all physical disks T424852 
[14:37:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:37:20] <stashbot>	 T424852: Investigate performance issues in cloudelastic - https://phabricator.wikimedia.org/T424852
[14:37:20] <logmsgbot>	 !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[14:37:49] <logmsgbot>	 !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[14:37:54] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.3 point update - https://phabricator.wikimedia.org/T414179#11871842 (10MoritzMuehlenhoff)
[14:38:33] <wikibugs>	 (03CR) 10Phuedx: [C:03+1] Test Kitchen UI: Deploy v1.3.0 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279377 (https://phabricator.wikimedia.org/T419511) (owner: 10Santiago Faci)
[14:39:24] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P91944 and previous config saved to /var/cache/conftool/dbconfig/20260429-143924-fceratto.json
[14:40:10] <logmsgbot>	 !log bking@cumin2002 conftool action : set/weight=10; selector: name=cloudelastic1007.
[14:40:12] <logmsgbot>	 !log mstyles@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply
[14:40:20] <logmsgbot>	 !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' .
[14:40:29] <logmsgbot>	 !log mstyles@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[14:40:36] <logmsgbot>	 !log bking@cumin2002 conftool action : set/pooled/yes; selector: dc=eqiad,cluster=cloudelastic,name=cloudelastic1007.
[14:40:43] <wikibugs>	 (03CR) 10Jforrester: [C:03+1] admin_ng: add extra TLS SANs for wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279315 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey)
[14:40:44] <logmsgbot>	 !log mstyles@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply
[14:40:54] <logmsgbot>	 !log mstyles@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[14:40:56] <wikibugs>	 (03CR) 10Jforrester: [C:03+1] wmnet: add new CNAMEs for wikifunctions evaluators [dns] - 10https://gerrit.wikimedia.org/r/1277099 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey)
[14:41:05] <logmsgbot>	 !log mstyles@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[14:41:11] <logmsgbot>	 !log mstyles@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[14:41:22] <logmsgbot>	 !log mstyles@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply
[14:41:29] <logmsgbot>	 !log mstyles@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[14:41:33] <logmsgbot>	 !log mstyles@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[14:41:37] <logmsgbot>	 !log mstyles@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[14:41:53] <logmsgbot>	 !log bking@cumin2002 conftool action : set/pooled=yes; selector: name=cloudelastic1007.eqiad.wmnet
[14:42:07] <logmsgbot>	 !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: sync
[14:42:30] <logmsgbot>	 !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: sync
[14:43:05] <wikibugs>	 (03PS1) 10Gehel: wdqs: remove duplicate entry in allow list [puppet] - 10https://gerrit.wikimedia.org/r/1279383 (https://phabricator.wikimedia.org/T417573)
[14:43:38] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2156: after reimage to trixie
[14:43:44] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+2] Test Kitchen UI: Deploy v1.3.0 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279377 (https://phabricator.wikimedia.org/T419511) (owner: 10Santiago Faci)
[14:44:01] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Add bast5005 [puppet] - 10https://gerrit.wikimedia.org/r/1279343 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff)
[14:44:02] <wikibugs>	 (03CR) 10Elukey: [C:03+2] admin_ng: add extra TLS SANs for wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279315 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey)
[14:44:28] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1279383 (https://phabricator.wikimedia.org/T417573) (owner: 10Gehel)
[14:44:37] <wikibugs>	 (03PS2) 10Jforrester: wikifunctions: Double the number of evaluators from 2 to 4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271942 (https://phabricator.wikimedia.org/T419933)
[14:45:39] <wikibugs>	 (03Merged) 10jenkins-bot: Test Kitchen UI: Deploy v1.3.0 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279377 (https://phabricator.wikimedia.org/T419511) (owner: 10Santiago Faci)
[14:46:05] <wikibugs>	 (03CR) 10Gehel: [C:03+2] wdqs: remove duplicate entry in allow list [puppet] - 10https://gerrit.wikimedia.org/r/1279383 (https://phabricator.wikimedia.org/T417573) (owner: 10Gehel)
[14:46:40] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:46:43] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'sync'.
[14:46:51] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'.
[14:47:02] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'sync'.
[14:47:18] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
[14:47:43] <wikibugs>	 (03CR) 10SBassett: [C:03+2] miscweb: updated image for security landing page [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278570 (https://phabricator.wikimedia.org/T423940) (owner: 10Mstyles)
[14:48:16] <elukey>	 James_F: ingress works nice in staging now!
[14:48:26] <James_F>	 Excellent.
[14:48:40] <elukey>	 `curl https://wikifunctions-javascript-evaluator.k8s-staging.discovery.wmnet:30443/_info -i` for example
[14:49:33] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P91947 and previous config saved to /var/cache/conftool/dbconfig/20260429-144932-fceratto.json
[14:50:05] <logmsgbot>	 !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/admin 'sync'.
[14:50:10] <logmsgbot>	 !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'sync'.
[14:50:14] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb: updated image for security landing page [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278570 (https://phabricator.wikimedia.org/T423940) (owner: 10Mstyles)
[14:50:26] <wikibugs>	 (03CR) 10Nikerabbit: [C:03+1] cxserver: Update cxserver to 2026-04-23-114216-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277294 (https://phabricator.wikimedia.org/T423002) (owner: 10KartikMistry)
[14:50:43] <logmsgbot>	 !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen-next: apply
[14:50:46] <logmsgbot>	 !log elukey@deploy1003 helmfile [codfw] START helmfile.d/admin 'sync'.
[14:50:58] <logmsgbot>	 !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'sync'.
[14:51:01] <logmsgbot>	 !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen-next: apply
[14:51:07] <elukey>	 James_F: synced also in prod, I'll wait for your deployments to test ingress in there too
[14:51:17] <wikibugs>	 (03PS1) 10Gkyziridis: changeprop: Configure RevertRisk multilingual model on changeprop. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279385 (https://phabricator.wikimedia.org/T415892)
[14:51:30] <James_F>	 elukey: We're deployed in staging and prod for the week; want me to re-deploy?
[14:52:06] <logmsgbot>	 !log mstyles@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply
[14:52:29] <logmsgbot>	 !log mstyles@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[14:52:33] <logmsgbot>	 !log mstyles@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply
[14:52:53] <logmsgbot>	 !log mstyles@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[14:52:58] <logmsgbot>	 !log mstyles@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[14:53:20] <logmsgbot>	 !log mstyles@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[14:53:24] <logmsgbot>	 !log mstyles@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply
[14:53:32] <logmsgbot>	 !log mstyles@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[14:53:41] <logmsgbot>	 !log mstyles@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[14:53:45] <logmsgbot>	 !log mstyles@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[14:54:53] <wikibugs>	 (03CR) 10Elukey: [C:03+2] wmnet: add new CNAMEs for wikifunctions evaluators [dns] - 10https://gerrit.wikimedia.org/r/1277099 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey)
[14:55:11] <logmsgbot>	 !log elukey@dns1004 START - running authdns-update
[14:56:49] <logmsgbot>	 !log elukey@dns1004 END - running authdns-update
[14:59:17] <elukey>	 James_F: oh nice perfect!
[14:59:41] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T419961)', diff saved to https://phabricator.wikimedia.org/P91950 and previous config saved to /var/cache/conftool/dbconfig/20260429-145940-fceratto.json
[15:00:02] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance
[15:00:10] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2154 (T419961)', diff saved to https://phabricator.wikimedia.org/P91951 and previous config saved to /var/cache/conftool/dbconfig/20260429-150010-fceratto.json
[15:00:37] <James_F>	 elukey: If this means we can use a shorter string than 'https://function-evaluator-python-evaluator-tls-service.wikifunctions.svc.cluster.local:4970/1/v1/evaluate/' in values-main-orchestrator.yaml I'll be delighted, but having the auditing of the traffic is enough. :-)
[15:00:56] <logmsgbot>	 !log elukey@dns1004 START - running authdns-update
[15:02:26] <logmsgbot>	 !log elukey@dns1004 END - running authdns-update
[15:04:38] <wikibugs>	 (03CR) 10JavierMonton: [C:03+2] stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279247 (https://phabricator.wikimedia.org/T424624) (owner: 10JavierMonton)
[15:05:32] <logmsgbot>	 !log eevans@cumin1003 DONE (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for cassandra-dev2001.codfw.wmnet: Renew puppet certificate - eevans@cumin1003
[15:06:37] <wikibugs>	 (03Merged) 10jenkins-bot: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279247 (https://phabricator.wikimedia.org/T424624) (owner: 10JavierMonton)
[15:07:11] <logmsgbot>	 !log eevans@cumin1003 DONE (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for cassandra-dev2001.codfw.wmnet: Renew puppet certificate - eevans@cumin1003
[15:07:19] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T419961)', diff saved to https://phabricator.wikimedia.org/P91953 and previous config saved to /var/cache/conftool/dbconfig/20260429-150719-fceratto.json
[15:08:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add bast5005 [puppet] - 10https://gerrit.wikimedia.org/r/1279343 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff)
[15:09:36] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.dns.wipe-cache wikifunctions-javascript-evaluator.discovery.wmnet on all recursors
[15:09:40] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikifunctions-javascript-evaluator.discovery.wmnet on all recursors
[15:09:49] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.dns.wipe-cache wikifunctions-python-evaluator.discovery.wmnet on all recursors
[15:09:53] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikifunctions-python-evaluator.discovery.wmnet on all recursors
[15:10:07] <wikibugs>	 (03PS2) 10CDanis: mwscript-k8s: add --output-file flag [puppet] - 10https://gerrit.wikimedia.org/r/1273905
[15:10:08] <wikibugs>	 (03PS3) 10CDanis: deployment_server: add kubectl wait-job plugin [puppet] - 10https://gerrit.wikimedia.org/r/1273926
[15:10:08] <wikibugs>	 (03PS10) 10CDanis: fundraising_data_import maintenance script wrapper & timer [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948)
[15:11:21] <elukey>	 James_F: at the moment it becomes wikifunctions-javascript-evaluator.discovery.wmnet:30443, but we'll not call it, but it's mesh equivalent (so once configured, localhost:port). Even shorter :D
[15:11:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:11:47] <logmsgbot>	 !log eevans@cumin1003 DONE (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for cassandra-dev2001.codfw.wmnet: Renew puppet certificate - eevans@cumin1003
[15:11:57] <James_F>	 elukey: Excellent!
[15:12:28] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mwscript-k8s: add --output-file flag [puppet] - 10https://gerrit.wikimedia.org/r/1273905 (owner: 10CDanis)
[15:12:38] <wikibugs>	 (03Abandoned) 10Daniel Kinzler: rest gateways: EXPERIMENT: set rate limit by referer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276404 (owner: 10Daniel Kinzler)
[15:12:48] <wikibugs>	 (03PS1) 10Gkyziridis: ml-services: Deploy the latest version of rr-multilingual model server on prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279388 (https://phabricator.wikimedia.org/T415892)
[15:12:55] <elukey>	 James_F: one question for you - I'll configure envoy (the mesh sidecar on the orchestrator pod) to be able to call the evaluators, but I'll need some details like max timeout allowed etc.. 
[15:13:04] <elukey>	 think about it and lemme know :)
[15:13:13] <elukey>	 even tomorrow
[15:13:26] <logmsgbot>	 !log eevans@cumin1003 DONE (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for cassandra-dev2001.codfw.wmnet: Renew puppet certificate - eevans@cumin1003
[15:13:27] <James_F>	 The orchestrator->evaluator network timeout is currently configured at 10s. Is that sufficient for you?
[15:13:32] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2205 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1279390 (https://phabricator.wikimedia.org/T424864)
[15:13:42] <wikibugs>	 (03Abandoned) 10Daniel Kinzler: rest_gateway: Rename the user_class descriptor key to ratelimit_class. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203786 (https://phabricator.wikimedia.org/T409155) (owner: 10Daniel Kinzler)
[15:15:31] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[15:15:45] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[15:16:06] <wikibugs>	 (03CR) 10Gkyziridis: "Should I also add more cpu/memory at the revertrisk-multilingual-pre-save ?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279388 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis)
[15:16:19] <wikibugs>	 (03PS1) 10JMeybohm: site.pp: Fix names of repurposed tools-k8s hosts [puppet] - 10https://gerrit.wikimedia.org/r/1279391 (https://phabricator.wikimedia.org/T423719)
[15:16:53] <wikibugs>	 (03CR) 10Eevans: [C:03+1] swift: remove 2 drained nodes from rings, set for new-style storage [puppet] - 10https://gerrit.wikimedia.org/r/1279372 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon)
[15:17:27] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P91955 and previous config saved to /var/cache/conftool/dbconfig/20260429-151727-fceratto.json
[15:18:52] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] site.pp: Fix names of repurposed tools-k8s hosts [puppet] - 10https://gerrit.wikimedia.org/r/1279391 (https://phabricator.wikimedia.org/T423719) (owner: 10JMeybohm)
[15:19:21] <wikibugs>	 (03CR) 10AikoChou: changeprop: Configure RevertRisk multilingual model on changeprop. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279385 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis)
[15:20:05] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1157: after reimage to trixie
[15:21:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:22:40] <wikibugs>	 (03CR) 10MVernon: [C:03+2] swift: remove 2 drained nodes from rings, set for new-style storage [puppet] - 10https://gerrit.wikimedia.org/r/1279372 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon)
[15:24:51] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.swift.convert-disks for host ms-be2066
[15:27:35] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P91957 and previous config saved to /var/cache/conftool/dbconfig/20260429-152735-fceratto.json
[15:28:45] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13{75-84} - https://phabricator.wikimedia.org/T423719#11872131 (10JMeybohm) Thanks for noticing. I've fixed site.pp
[15:29:03] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2156: after reimage to trixie
[15:32:58] <wikibugs>	 (03CR) 10AikoChou: "No, the *-pre-save is a separate isvc with a different endpoint. page_change events won’t go to it, so there’s no need to change anything." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279388 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis)
[15:33:10] <wikibugs>	 (03PS1) 10CDobbins: wikimedia.org: Add TXT verification for Claude [dns] - 10https://gerrit.wikimedia.org/r/1279402 (https://phabricator.wikimedia.org/T424785)
[15:37:45] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T419961)', diff saved to https://phabricator.wikimedia.org/P91960 and previous config saved to /var/cache/conftool/dbconfig/20260429-153743-fceratto.json
[15:38:06] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2161.codfw.wmnet with reason: Maintenance
[15:38:15] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2161 (T419961)', diff saved to https://phabricator.wikimedia.org/P91961 and previous config saved to /var/cache/conftool/dbconfig/20260429-153814-fceratto.json
[15:40:40] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.swift.convert-disks for host ms-be2067
[15:45:19] <wikibugs>	 (03CR) 10RLazarus: "Oh, yep, thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/1278792 (owner: 10RLazarus)
[15:45:23] <wikibugs>	 (03Abandoned) 10RLazarus: interfaces: Update playbook link [alerts] - 10https://gerrit.wikimedia.org/r/1278792 (owner: 10RLazarus)
[15:45:26] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T419961)', diff saved to https://phabricator.wikimedia.org/P91962 and previous config saved to /var/cache/conftool/dbconfig/20260429-154525-fceratto.json
[15:48:03] <wikibugs>	 (03PS2) 10Elukey: _cookbook: fix parallel test failures with pytest-xdist (-n auto) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1270380 (https://phabricator.wikimedia.org/T420475)
[15:48:09] <wikibugs>	 (03CR) 10Elukey: _cookbook: fix parallel test failures with pytest-xdist (-n auto) (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1270380 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey)
[15:49:50] <wikibugs>	 (03PS1) 10Atsuko: deployment_server: define more opensearch configs [puppet] - 10https://gerrit.wikimedia.org/r/1279410 (https://phabricator.wikimedia.org/T424248)
[15:50:48] <wikibugs>	 (03CR) 10Nikerabbit: [C:03+1] Don't load general modules  as style modules [extensions/Translate] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279079 (https://phabricator.wikimedia.org/T424618) (owner: 10Abijeet Patro)
[15:51:00] <wikibugs>	 (03PS2) 10Atsuko: deployment_server: define more opensearch configs [puppet] - 10https://gerrit.wikimedia.org/r/1279410 (https://phabricator.wikimedia.org/T424248)
[15:52:07] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1279410 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko)
[15:52:38] <wikibugs>	 (03CR) 10Bking: [C:03+1] deployment_server: define more opensearch configs [puppet] - 10https://gerrit.wikimedia.org/r/1279410 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko)
[15:52:44] <wikibugs>	 (03CR) 10Atsuko: [C:03+2] deployment_server: define more opensearch configs [puppet] - 10https://gerrit.wikimedia.org/r/1279410 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko)
[15:55:34] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P91963 and previous config saved to /var/cache/conftool/dbconfig/20260429-155533-fceratto.json
[15:56:10] <atsukoito>	 hi Emperor, there are unapplied changes on puppet, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1279372
[15:56:24] <atsukoito>	 can I apply it?
[15:58:49] <wikibugs>	 (03PS1) 10Elukey: wmcs: add the pki discovery2026 intermediate public cert [puppet] - 10https://gerrit.wikimedia.org/r/1279413 (https://phabricator.wikimedia.org/T424549)
[16:01:02] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] sre.hosts: fix ipmi() calls after spicerack 12.5.0 [cookbooks] - 10https://gerrit.wikimedia.org/r/1279379 (https://phabricator.wikimedia.org/T418929) (owner: 10Elukey)
[16:05:42] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P91964 and previous config saved to /var/cache/conftool/dbconfig/20260429-160541-fceratto.json
[16:09:20] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:09:42] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11872316 (10Ahoelzl)
[16:13:17] <logmsgbot>	 mvernon@cumin2002 convert-disks (PID 462469) is awaiting input
[16:13:54] <wikibugs>	 (03CR) 10Elukey: [C:03+2] wmcs: add the pki discovery2026 intermediate public cert [puppet] - 10https://gerrit.wikimedia.org/r/1279413 (https://phabricator.wikimedia.org/T424549) (owner: 10Elukey)
[16:15:19] <logmsgbot>	 !log mvernon@cumin2002 END (FAIL) - Cookbook sre.swift.convert-disks (exit_code=99) for host ms-be2066
[16:15:50] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T419961)', diff saved to https://phabricator.wikimedia.org/P91965 and previous config saved to /var/cache/conftool/dbconfig/20260429-161549-fceratto.json
[16:15:55] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2066.codfw.wmnet with OS bullseye
[16:16:08] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 13Patch-For-Review: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11872347 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2066.codfw.wm...
[16:16:11] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance
[16:16:15] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.move-vlan for host ms-be2066
[16:16:20] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2163 (T419961)', diff saved to https://phabricator.wikimedia.org/P91966 and previous config saved to /var/cache/conftool/dbconfig/20260429-161619-fceratto.json
[16:16:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:16:25] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.dns.netbox
[16:16:31] <wikibugs>	 (03PS1) 10Elukey: wmcs/cloud: add the discovery2026 pki intermediate config [puppet] - 10https://gerrit.wikimedia.org/r/1279417 (https://phabricator.wikimedia.org/T424549)
[16:17:13] <wikibugs>	 (03CR) 10Elukey: [C:03+2] wmcs/cloud: add the discovery2026 pki intermediate config [puppet] - 10https://gerrit.wikimedia.org/r/1279417 (https://phabricator.wikimedia.org/T424549) (owner: 10Elukey)
[16:19:24] <wikibugs>	 (03PS4) 10Tiziano Fogli: rsyslog: forward thanos-query-frontend logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/1275799 (https://phabricator.wikimedia.org/T423986)
[16:19:24] <wikibugs>	 (03PS3) 10Tiziano Fogli: logstash/filter: increase sockets-timeout for unit tests [puppet] - 10https://gerrit.wikimedia.org/r/1278501 (https://phabricator.wikimedia.org/T423986)
[16:19:24] <wikibugs>	 (03PS7) 10Tiziano Fogli: logstash: add thanos-query-frontend filter [puppet] - 10https://gerrit.wikimedia.org/r/1275800 (https://phabricator.wikimedia.org/T423986)
[16:20:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10observability, 13Patch-For-Review: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11872383 (10elukey) Next steps: - Deploy the new spicerack release and https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1279379 - Add a workaro...
[16:21:35] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2066 - mvernon@cumin2002"
[16:21:40] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2066 - mvernon@cumin2002"
[16:21:41] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:21:41] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.dns.wipe-cache ms-be2066.codfw.wmnet 209.0.192.10.in-addr.arpa 9.0.2.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[16:21:45] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ms-be2066.codfw.wmnet 209.0.192.10.in-addr.arpa 9.0.2.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[16:21:46] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be2066
[16:21:58] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be2066
[16:21:58] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ms-be2066
[16:23:38] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T419961)', diff saved to https://phabricator.wikimedia.org/P91967 and previous config saved to /var/cache/conftool/dbconfig/20260429-162337-fceratto.json
[16:28:35] <wikibugs>	 (03CR) 10VadymTS1: [C:03+1] enwikiversity: Add some user rights to the curator user group on English Wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278363 (https://phabricator.wikimedia.org/T424445) (owner: 10VadymTS1)
[16:28:49] <logmsgbot>	 mvernon@cumin2002 convert-disks (PID 473740) is awaiting input
[16:29:19] <wikibugs>	 (03PS1) 10Atsuko: dse-k8s: adding more opensearch namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279423 (https://phabricator.wikimedia.org/T424248)
[16:33:46] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P91968 and previous config saved to /var/cache/conftool/dbconfig/20260429-163345-fceratto.json
[16:34:20] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:36:20] <wikibugs>	 (03CR) 10CI reject: [V:04-1] dse-k8s: adding more opensearch namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279423 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko)
[16:36:52] <wikibugs>	 (03Abandoned) 10Andrew Bogott: Openstack: use debian.net repo rather than the wmf-hosted repo [puppet] - 10https://gerrit.wikimedia.org/r/1272837 (https://phabricator.wikimedia.org/T423598) (owner: 10Andrew Bogott)
[16:38:06] <logmsgbot>	 !log mvernon@cumin2002 END (FAIL) - Cookbook sre.swift.convert-disks (exit_code=99) for host ms-be2067
[16:38:31] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2067.codfw.wmnet with OS bullseye
[16:38:41] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11872467 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2067.codfw.wmnet with OS bullseye
[16:38:55] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.move-vlan for host ms-be2067
[16:39:11] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.dns.netbox
[16:40:15] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2066.codfw.wmnet with reason: host reimage
[16:43:07] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2067 - mvernon@cumin2002"
[16:43:13] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2067 - mvernon@cumin2002"
[16:43:13] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:43:14] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.dns.wipe-cache ms-be2067.codfw.wmnet 160.16.192.10.in-addr.arpa 0.6.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[16:43:17] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ms-be2067.codfw.wmnet 160.16.192.10.in-addr.arpa 0.6.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[16:43:18] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be2067
[16:43:40] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be2067
[16:43:41] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ms-be2067
[16:43:54] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P91969 and previous config saved to /var/cache/conftool/dbconfig/20260429-164353-fceratto.json
[16:44:28] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2066.codfw.wmnet with reason: host reimage
[16:45:32] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11872473 (10MatthewVernon)
[16:51:16] <wikibugs>	 (03PS2) 10Gkyziridis: changeprop: Configure RevertRisk multilingual model on changeprop. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279385 (https://phabricator.wikimedia.org/T415892)
[16:52:04] <wikibugs>	 (03CR) 10Novem Linguae: "Hmm. The dblist securepollglobal contains officewiki but doesn't contain arbcom_zhwiki. Maybe I should revert to PS1." [puppet] - 10https://gerrit.wikimedia.org/r/1279281 (https://phabricator.wikimedia.org/T419309) (owner: 10Novem Linguae)
[16:52:09] <wikibugs>	 (03PS3) 10Gkyziridis: changeprop: Configure RevertRisk multilingual model on changeprop. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279385 (https://phabricator.wikimedia.org/T415892)
[16:54:02] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T419961)', diff saved to https://phabricator.wikimedia.org/P91970 and previous config saved to /var/cache/conftool/dbconfig/20260429-165401-fceratto.json
[16:54:24] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance
[16:54:25] <wikibugs>	 (03PS1) 10AKhatun: stream: move mw-page-html-feature-counts-change-enrich to v1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279429 (https://phabricator.wikimedia.org/T424624)
[16:54:32] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2164 (T419961)', diff saved to https://phabricator.wikimedia.org/P91971 and previous config saved to /var/cache/conftool/dbconfig/20260429-165431-fceratto.json
[16:54:45] <wikibugs>	 (03PS1) 10VadymTS1: enwikiversity: Enable the abuse filter block action on English Wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279430 (https://phabricator.wikimedia.org/T424053)
[16:58:03] <wikibugs>	 (03CR) 10Dragoniez: [C:03+1] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279430 (https://phabricator.wikimedia.org/T424053) (owner: 10VadymTS1)
[16:58:22] <wikibugs>	 (03PS2) 10Gkyziridis: ml-services: Deploy the latest version of rr-multilingual model server on prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279388 (https://phabricator.wikimedia.org/T415892)
[16:59:12] <wikibugs>	 (03CR) 10Codename Noreste: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279430 (https://phabricator.wikimedia.org/T424053) (owner: 10VadymTS1)
[17:00:05] <jouncebot>	 jasmine_: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki infrastructure (UTC late) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260429T1700).
[17:01:53] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2067.codfw.wmnet with reason: host reimage
[17:01:58] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T419961)', diff saved to https://phabricator.wikimedia.org/P91972 and previous config saved to /var/cache/conftool/dbconfig/20260429-170157-fceratto.json
[17:03:28] <wikibugs>	 (03CR) 10VadymTS1: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279430 (https://phabricator.wikimedia.org/T424053) (owner: 10VadymTS1)
[17:03:45] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2066.codfw.wmnet with OS bullseye
[17:03:46] <wikibugs>	 (03CR) 10VadymTS1: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279430 (https://phabricator.wikimedia.org/T424053) (owner: 10VadymTS1)
[17:04:00] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11872566 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2066.codfw.wmnet with OS bullseye compl...
[17:08:59] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2067.codfw.wmnet with reason: host reimage
[17:09:00] <jinxer-wm>	 FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta
[17:12:06] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P91973 and previous config saved to /var/cache/conftool/dbconfig/20260429-171205-fceratto.json
[17:13:24] <wikibugs>	 (03PS1) 10Sbisson: Load TestKitchen earlier [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279431 (https://phabricator.wikimedia.org/T424876)
[17:14:30] <wikibugs>	 (03PS1) 10Andrew Bogott: magnum: include helm package for magnum-cluster-api driver [puppet] - 10https://gerrit.wikimedia.org/r/1279432
[17:14:30] <wikibugs>	 (03PS1) 10Andrew Bogott: magnum-cluster-api: update versions for worker cluster [puppet] - 10https://gerrit.wikimedia.org/r/1279433
[17:15:15] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] magnum: include helm package for magnum-cluster-api driver [puppet] - 10https://gerrit.wikimedia.org/r/1279432 (owner: 10Andrew Bogott)
[17:16:35] <wikibugs>	 (03CR) 10Esanders: [C:03+1] Enable mobile editor abandonment survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277569 (https://phabricator.wikimedia.org/T423923) (owner: 10DLynch)
[17:16:58] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] magnum-cluster-api: update versions for worker cluster [puppet] - 10https://gerrit.wikimedia.org/r/1279433 (owner: 10Andrew Bogott)
[17:18:32] <wikibugs>	 (03CR) 10ArielGlenn: [C:03+1] "This is a small behaviour change which we should probably watch once this is live. Anyways, good catch between you and Bartosz!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277703 (owner: 10Daniel Kinzler)
[17:19:34] <wikibugs>	 (03CR) 10Phuedx: [C:03+1] Load TestKitchen earlier [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279431 (https://phabricator.wikimedia.org/T424876) (owner: 10Sbisson)
[17:22:15] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P91975 and previous config saved to /var/cache/conftool/dbconfig/20260429-172214-fceratto.json
[17:27:12] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2067.codfw.wmnet with OS bullseye
[17:27:20] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11872731 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2067.codfw.wmnet with OS bullseye compl...
[17:30:20] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279431 (https://phabricator.wikimedia.org/T424876) (owner: 10Sbisson)
[17:32:23] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T419961)', diff saved to https://phabricator.wikimedia.org/P91976 and previous config saved to /var/cache/conftool/dbconfig/20260429-173222-fceratto.json
[17:32:45] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance
[17:32:54] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2166 (T419961)', diff saved to https://phabricator.wikimedia.org/P91977 and previous config saved to /var/cache/conftool/dbconfig/20260429-173253-fceratto.json
[17:36:09] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C:03+1] rest gateway: remove redundant bearerPayload case [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277703 (owner: 10Daniel Kinzler)
[17:40:17] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T419961)', diff saved to https://phabricator.wikimedia.org/P91978 and previous config saved to /var/cache/conftool/dbconfig/20260429-174016-fceratto.json
[17:50:25] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P91979 and previous config saved to /var/cache/conftool/dbconfig/20260429-175024-fceratto.json
[17:58:25] <wikibugs>	 (03PS1) 10CDanis: turnilo: webrequest: add ja4h sub-component dimensions [puppet] - 10https://gerrit.wikimedia.org/r/1279439
[18:00:04] <jouncebot>	 jeena and dduvall: MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260429T1800). Please do the needful.
[18:00:33] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P91980 and previous config saved to /var/cache/conftool/dbconfig/20260429-180032-fceratto.json
[18:03:25] <jinxer-wm>	 FIRING: [16x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:05:20] <jinxer-wm>	 FIRING: [3x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in 3d 19h 49m 25s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry
[18:07:45] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: verify cables - https://phabricator.wikimedia.org/T424601#11872843 (10VRiley-WMF) 05Open→03Resolved https://netbox.wikimedia.org/dcim/cables/4533/ This cable is connected  the other two should have cables for them now. I did have to add a dummy console for the test server
[18:09:22] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 to 1.46.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279442 (https://phabricator.wikimedia.org/T423877)
[18:09:25] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Initiated by jhuneidi@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279442 (https://phabricator.wikimedia.org/T423877) (owner: 10TrainBranchBot)
[18:10:21] <wikibugs>	 (03Merged) 10jenkins-bot: group1 to 1.46.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279442 (https://phabricator.wikimedia.org/T423877) (owner: 10TrainBranchBot)
[18:10:41] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T419961)', diff saved to https://phabricator.wikimedia.org/P91981 and previous config saved to /var/cache/conftool/dbconfig/20260429-181041-fceratto.json
[18:10:57] <wikibugs>	 (03CR) 10Ottomata: [C:03+1] stream: move mw-page-html-feature-counts-change-enrich to v1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279429 (https://phabricator.wikimedia.org/T424624) (owner: 10AKhatun)
[18:11:03] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance
[18:11:12] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2167 (T419961)', diff saved to https://phabricator.wikimedia.org/P91982 and previous config saved to /var/cache/conftool/dbconfig/20260429-181111-fceratto.json
[18:14:37] <wikibugs>	 (03PS1) 10Andrew Bogott: cluster-api worker: use latest kubeadm, set up k8s env before using helm [puppet] - 10https://gerrit.wikimedia.org/r/1279445
[18:15:45] <wikibugs>	 (03PS2) 10Andrew Bogott: cluster-api worker: use latest kubeadm, set up k8s env before using helm [puppet] - 10https://gerrit.wikimedia.org/r/1279445
[18:15:59] <wikibugs>	 (03CR) 10RLazarus: turnilo: webrequest: add ja4h sub-component dimensions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1279439 (owner: 10CDanis)
[18:16:03] <logmsgbot>	 !log jhuneidi@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.46.0-wmf.26  refs T423877
[18:16:08] <stashbot>	 T423877: 1.46.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T423877
[18:16:50] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] cluster-api worker: use latest kubeadm, set up k8s env before using helm [puppet] - 10https://gerrit.wikimedia.org/r/1279445 (owner: 10Andrew Bogott)
[18:18:24] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release wikifunctions/python-evaluator on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[18:18:29] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T419961)', diff saved to https://phabricator.wikimedia.org/P91983 and previous config saved to /var/cache/conftool/dbconfig/20260429-181829-fceratto.json
[18:22:40] <wikibugs>	 (03PS1) 10CDanis: turnilo: webrequest: add ja4h sub-component dimensions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279447
[18:28:10] <jinxer-wm>	 RESOLVED: HelmReleaseBadStatus: Helm release wikifunctions/python-evaluator on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[18:28:38] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P91984 and previous config saved to /var/cache/conftool/dbconfig/20260429-182837-fceratto.json
[18:30:58] <wikibugs>	 (03CR) 10Brouberol: "My bad, this should have been deleted from the puppet repo. The canonical configuration now lives in https://gerrit.wikimedia.org/r/plugin" [puppet] - 10https://gerrit.wikimedia.org/r/1279439 (owner: 10CDanis)
[18:38:46] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P91985 and previous config saved to /var/cache/conftool/dbconfig/20260429-183845-fceratto.json
[18:39:34] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277569 (https://phabricator.wikimedia.org/T423923) (owner: 10DLynch)
[18:42:09] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release wikifunctions/python-evaluator on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[18:42:24] <wikibugs>	 (03PS1) 10Medelius: Abandon editor survey: UI updates [extensions/MobileFrontend] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279448 (https://phabricator.wikimedia.org/T422931)
[18:42:41] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/MobileFrontend] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279448 (https://phabricator.wikimedia.org/T422931) (owner: 10Medelius)
[18:44:09] <wikibugs>	 (03CR) 10AKhatun: alerts: mw-page-html-feature-counts-change-enrich (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1278559 (https://phabricator.wikimedia.org/T424224) (owner: 10AKhatun)
[18:48:56] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T419961)', diff saved to https://phabricator.wikimedia.org/P91986 and previous config saved to /var/cache/conftool/dbconfig/20260429-184854-fceratto.json
[18:49:18] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance
[18:49:25] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2181 (T419961)', diff saved to https://phabricator.wikimedia.org/P91987 and previous config saved to /var/cache/conftool/dbconfig/20260429-184925-fceratto.json
[18:49:37] <wikibugs>	 (03CR) 10Dragoniez: [C:03+1] "Just so you know, you need to backport this because this repo isn’t deployed in the weekly deployment train. See https://wikitech.wikimedi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279430 (https://phabricator.wikimedia.org/T424053) (owner: 10VadymTS1)
[18:49:51] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11873039 (10wiki_willy) Hey @elukey - do you have the Supermicro case number for this one?  Thanks, Willy
[18:50:06] <wikibugs>	 (03PS2) 10Anzx: enwikiquote: enable UseSandboxLink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279404 (https://phabricator.wikimedia.org/T424863)
[18:50:13] <wikibugs>	 (03PS2) 10Anzx: arbcom_enwiki: update logo, icon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279422 (https://phabricator.wikimedia.org/T424555)
[18:50:18] <wikibugs>	 (03PS3) 10Anzx: cswiki: lift IP cap for edithathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279406 (https://phabricator.wikimedia.org/T424843)
[18:50:31] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279406 (https://phabricator.wikimedia.org/T424843) (owner: 10Anzx)
[18:50:56] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279422 (https://phabricator.wikimedia.org/T424555) (owner: 10Anzx)
[18:51:05] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279404 (https://phabricator.wikimedia.org/T424863) (owner: 10Anzx)
[18:54:02] <wikibugs>	 (03PS1) 10C. Scott Ananian: Deploy Parsoid Read Views to 6 language converter wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279450 (https://phabricator.wikimedia.org/T423785)
[18:56:24] <wikibugs>	 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 6 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11873042 (10Nux) I went through this [[ https://global-search.toolforge.org/?q=%5C%2Fthumb%5C%2F%28%5B%5E%5C%2F%5D%2B%3F%5C%2F%29%7B3...
[18:56:34] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T419961)', diff saved to https://phabricator.wikimedia.org/P91988 and previous config saved to /var/cache/conftool/dbconfig/20260429-185634-fceratto.json
[18:56:35] <wikibugs>	 (03PS1) 10C. Scott Ananian: Deploy Parsoid Read Views to 12 small wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279451 (https://phabricator.wikimedia.org/T424590)
[18:57:29] <wikibugs>	 (03Abandoned) 10C. Scott Ananian: Deploy Parsoid Read Views to 12 small wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279451 (https://phabricator.wikimedia.org/T424590) (owner: 10C. Scott Ananian)
[18:57:53] <wikibugs>	 (03CR) 10C. Scott Ananian: [C:03+1] Deploy PRV to 12 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277770 (https://phabricator.wikimedia.org/T424590) (owner: 10Arlolra)
[18:57:54] <wikibugs>	 (03PS2) 10AKhatun: alerts: mw-page-html-feature-counts-change-enrich [alerts] - 10https://gerrit.wikimedia.org/r/1278559 (https://phabricator.wikimedia.org/T424224)
[18:58:26] <wikibugs>	 (03PS3) 10AKhatun: alerts: mw-page-html-feature-counts-change-enrich [alerts] - 10https://gerrit.wikimedia.org/r/1278559 (https://phabricator.wikimedia.org/T424224)
[18:58:38] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11873046 (10Papaul)
[18:59:08] <wikibugs>	 (03PS3) 10Arlolra: Deploy PRV to 12 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277770 (https://phabricator.wikimedia.org/T424590)
[19:00:46] <wikibugs>	 (03PS2) 10C. Scott Ananian: Deploy Parsoid Read Views to 6 language converter wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279450 (https://phabricator.wikimedia.org/T423785)
[19:02:13] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] turnilo: webrequest: add ja4h sub-component dimensions (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279447 (owner: 10CDanis)
[19:06:42] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P91989 and previous config saved to /var/cache/conftool/dbconfig/20260429-190641-fceratto.json
[19:07:22] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be209[7,8] - https://phabricator.wikimedia.org/T424892 (10RobH) 03NEW
[19:07:38] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be209[7,8] - https://phabricator.wikimedia.org/T424892#11873092 (10RobH)
[19:09:01] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be209[7,8] - https://phabricator.wikimedia.org/T424892#11873096 (10RobH) a:03MatthewVernon Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations) and...
[19:11:29] <wikibugs>	 (03PS1) 10C. Scott Ananian: Enable Parsoid Read Views for 20% of enwiki mobile web traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279452 (https://phabricator.wikimedia.org/T424880)
[19:11:31] <wikibugs>	 (03PS1) 10C. Scott Ananian: Increase Parsoid Read Views to 60% of enwiki mobile web traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279453 (https://phabricator.wikimedia.org/T424880)
[19:11:33] <wikibugs>	 (03PS1) 10C. Scott Ananian: Increase Parsoid Read Views to 100% of enwiki mobile web traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279454 (https://phabricator.wikimedia.org/T424880)
[19:12:54] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277770 (https://phabricator.wikimedia.org/T424590) (owner: 10Arlolra)
[19:13:09] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279450 (https://phabricator.wikimedia.org/T423785) (owner: 10C. Scott Ananian)
[19:13:42] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279452 (https://phabricator.wikimedia.org/T424880) (owner: 10C. Scott Ananian)
[19:15:31] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be1098, ms-be1099, ms-be1100 - https://phabricator.wikimedia.org/T424895 (10RobH) 03NEW
[19:15:53] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be1098, ms-be1099, ms-be1100 - https://phabricator.wikimedia.org/T424895#11873155 (10RobH)
[19:16:24] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be1098, ms-be1099, ms-be1100 - https://phabricator.wikimedia.org/T424895#11873159 (10RobH) a:03MatthewVernon Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/D...
[19:16:51] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P91990 and previous config saved to /var/cache/conftool/dbconfig/20260429-191650-fceratto.json
[19:17:02] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279430 (https://phabricator.wikimedia.org/T424053) (owner: 10VadymTS1)
[19:20:27] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[19:22:05] <wikibugs>	 (03CR) 10Ottomata: [C:03+1] alerts: mw-page-html-feature-counts-change-enrich [alerts] - 10https://gerrit.wikimedia.org/r/1278559 (https://phabricator.wikimedia.org/T424224) (owner: 10AKhatun)
[19:24:35] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns infor for asw1-23-ulsfo - pt1979@cumin2002"
[19:25:04] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns infor for asw1-23-ulsfo - pt1979@cumin2002"
[19:25:04] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:25:17] <wikibugs>	 (03CR) 10CDanis: turnilo: webrequest: add ja4h sub-component dimensions (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279447 (owner: 10CDanis)
[19:26:59] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T419961)', diff saved to https://phabricator.wikimedia.org/P91992 and previous config saved to /var/cache/conftool/dbconfig/20260429-192658-fceratto.json
[19:27:19] <wikibugs>	 (03Abandoned) 10CDanis: turnilo: webrequest: add ja4h sub-component dimensions [puppet] - 10https://gerrit.wikimedia.org/r/1279439 (owner: 10CDanis)
[19:27:21] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance
[19:27:31] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2195 (T419961)', diff saved to https://phabricator.wikimedia.org/P91993 and previous config saved to /var/cache/conftool/dbconfig/20260429-192729-fceratto.json
[19:28:34] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11873200 (10Papaul)
[19:28:51] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278363 (https://phabricator.wikimedia.org/T424445) (owner: 10VadymTS1)
[19:29:21] <wikibugs>	 (03PS2) 10Atsuko: dse-k8s: adding more opensearch namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279423 (https://phabricator.wikimedia.org/T424248)
[19:30:57] <wikibugs>	 (03PS1) 10Dzahn: zuul: remove zuul-nodepool config, user, stop service [puppet] - 10https://gerrit.wikimedia.org/r/1279461 (https://phabricator.wikimedia.org/T424879)
[19:32:52] <wikibugs>	 (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1279461/8490/zuul1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1279461 (https://phabricator.wikimedia.org/T424879) (owner: 10Dzahn)
[19:34:32] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T419961)', diff saved to https://phabricator.wikimedia.org/P91994 and previous config saved to /var/cache/conftool/dbconfig/20260429-193431-fceratto.json
[19:38:39] <wikibugs>	 (03PS1) 10CDanis: haproxy: webrequest: capture ratelimiting headers [puppet] - 10https://gerrit.wikimedia.org/r/1279465 (https://phabricator.wikimedia.org/T419736)
[19:38:48] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1279465 (https://phabricator.wikimedia.org/T419736) (owner: 10CDanis)
[19:38:49] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11873230 (10Papaul)
[19:40:20] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265959 (https://phabricator.wikimedia.org/T419309) (owner: 101F616EMO)
[19:41:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:41:40] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11873235 (10Papaul)
[19:44:39] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P91995 and previous config saved to /var/cache/conftool/dbconfig/20260429-194439-fceratto.json
[19:44:59] <wikibugs>	 (03PS1) 10Dzahn: zuul: create profile for new zuul-builder replacing nodepool [puppet] - 10https://gerrit.wikimedia.org/r/1279470 (https://phabricator.wikimedia.org/T424879)
[19:45:27] <wikibugs>	 (03PS2) 10Dzahn: zuul: create profile for new zuul-builder replacing nodepool [puppet] - 10https://gerrit.wikimedia.org/r/1279470 (https://phabricator.wikimedia.org/T424879)
[19:46:03] <wikibugs>	 (03CR) 10Bking: [C:03+1] dse-k8s: adding more opensearch namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279423 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko)
[19:46:11] <wikibugs>	 (03CR) 10VadymTS1: [C:03+1] ukwiki: Remove the patroller user group and adjust various user rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1274928 (https://phabricator.wikimedia.org/T423461) (owner: 10Codename Noreste)
[19:46:50] <wikibugs>	 (03CR) 10Dzahn: [C:04-1] "missing the config template, equivalent to modules/profile/templates/zuul/nodepool.conf.erb and nodepool.yaml.erb" [puppet] - 10https://gerrit.wikimedia.org/r/1279470 (https://phabricator.wikimedia.org/T424879) (owner: 10Dzahn)
[19:47:53] <wikibugs>	 (03PS2) 10Dzahn: zuul: remove zuul-nodepool config, user, stop service [puppet] - 10https://gerrit.wikimedia.org/r/1279461 (https://phabricator.wikimedia.org/T424879)
[19:48:33] <wikibugs>	 (03CR) 10Dzahn: "should we just do this now before even upgrading? should it wait until after builder is installed? does it matter?" [puppet] - 10https://gerrit.wikimedia.org/r/1279461 (https://phabricator.wikimedia.org/T424879) (owner: 10Dzahn)
[19:54:48] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P91996 and previous config saved to /var/cache/conftool/dbconfig/20260429-195447-fceratto.json
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260429T2000).
[20:00:05] <jouncebot>	 phuedx, cmede, anzx, cscott, VadymTS1, and ZhaoFJx: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:07] <anzx>	 o/
[20:00:09] <ZhaoFJx>	 o/
[20:00:10] <cmede>	 o/
[20:00:11] <phuedx>	 o/
[20:00:33] <wikibugs>	 (03PS1) 10CDanis: base::kernel: ban algif_aead [puppet] - 10https://gerrit.wikimedia.org/r/1279473
[20:01:51] <wikibugs>	 (03PS1) 10Andrew Bogott: setup_capi.sh.erb: don't manually install certmanager [puppet] - 10https://gerrit.wikimedia.org/r/1279474
[20:02:36] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] setup_capi.sh.erb: don't manually install certmanager [puppet] - 10https://gerrit.wikimedia.org/r/1279474 (owner: 10Andrew Bogott)
[20:03:45] <dancy>	 I can help with deployments.
[20:04:14] <dancy>	 anzx: Can your changes go out all at once?
[20:04:20] <anzx>	 ok
[20:04:34] <dancy>	 rephrasing: Is it safe for yours to go out all at once
[20:04:53] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] base::kernel: ban algif_aead [puppet] - 10https://gerrit.wikimedia.org/r/1279473 (owner: 10CDanis)
[20:04:55] <anzx>	 sure, no problem if it sync at once
[20:04:56] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T419961)', diff saved to https://phabricator.wikimedia.org/P91997 and previous config saved to /var/cache/conftool/dbconfig/20260429-200455-fceratto.json
[20:05:03] <wikibugs>	 (03CR) 10CDanis: [C:03+2] base::kernel: ban algif_aead [puppet] - 10https://gerrit.wikimedia.org/r/1279473 (owner: 10CDanis)
[20:05:21] <wikibugs>	 (03CR) 10CDanis: [V:03+1 C:03+2] "100.0% (2431/2431) of nodes failed to execute command #1: 'lsmod | grep algif'" [puppet] - 10https://gerrit.wikimedia.org/r/1279473 (owner: 10CDanis)
[20:05:25] <dancy>	 Alright
[20:05:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279406 (https://phabricator.wikimedia.org/T424843) (owner: 10Anzx)
[20:05:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279422 (https://phabricator.wikimedia.org/T424555) (owner: 10Anzx)
[20:05:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279404 (https://phabricator.wikimedia.org/T424863) (owner: 10Anzx)
[20:08:54] <wikibugs>	 (03PS4) 10Tiziano Fogli: logstash/filter: increase sockets-timeout for unit tests [puppet] - 10https://gerrit.wikimedia.org/r/1278501 (https://phabricator.wikimedia.org/T423986)
[20:09:55] <wikibugs>	 (03Merged) 10jenkins-bot: cswiki: lift IP cap for edithathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279406 (https://phabricator.wikimedia.org/T424843) (owner: 10Anzx)
[20:09:59] <wikibugs>	 (03Merged) 10jenkins-bot: arbcom_enwiki: update logo, icon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279422 (https://phabricator.wikimedia.org/T424555) (owner: 10Anzx)
[20:10:02] <wikibugs>	 (03Merged) 10jenkins-bot: enwikiquote: enable UseSandboxLink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279404 (https://phabricator.wikimedia.org/T424863) (owner: 10Anzx)
[20:10:32] <logmsgbot>	 !log dancy@deploy1003 Started scap sync-world: Backport for [[gerrit:1279406|cswiki: lift IP cap for edithathon (T424843)]], [[gerrit:1279422|arbcom_enwiki: update logo, icon (T424555)]], [[gerrit:1279404|enwikiquote: enable UseSandboxLink (T424863)]]
[20:10:39] <stashbot>	 T424843: Lift IP cap on 2026-05-14 for an editathon - cs.wikipedia - https://phabricator.wikimedia.org/T424843
[20:10:40] <stashbot>	 T424555: Requesting logo change for arbcom-en.wikipedia.org - https://phabricator.wikimedia.org/T424555
[20:10:40] <stashbot>	 T424863: Enable the SandboxLink extension on English Wikiquote - https://phabricator.wikimedia.org/T424863
[20:12:25] <logmsgbot>	 !log dancy@deploy1003 anzx, dancy: Backport for [[gerrit:1279406|cswiki: lift IP cap for edithathon (T424843)]], [[gerrit:1279422|arbcom_enwiki: update logo, icon (T424555)]], [[gerrit:1279404|enwikiquote: enable UseSandboxLink (T424863)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:12:30] <anzx>	 checking
[20:13:11] <wikibugs>	 (03PS1) 10Phuedx: JS SDK: Remove compat deprecation warnings [extensions/TestKitchen] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279476
[20:13:28] <anzx>	 dancy: looks good, ok to sync 
[20:13:30] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 30 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/TestKitchen] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279476 (owner: 10Phuedx)
[20:13:37] <dancy>	 OK
[20:13:40] <logmsgbot>	 !log dancy@deploy1003 anzx, dancy: Continuing with deployment
[20:13:59] <dancy>	 ZhaoFJx: You'll be next
[20:14:12] <rzl>	 ominous
[20:14:16] <dancy>	 haha
[20:14:21] <cmede>	 lol
[20:14:25] <dancy>	 Deployment of doom
[20:14:40] <jeena>	 😨
[20:15:26] <ZhaoFJx>	 dancy thanks
[20:17:31] <logmsgbot>	 !log dancy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1279406|cswiki: lift IP cap for edithathon (T424843)]], [[gerrit:1279422|arbcom_enwiki: update logo, icon (T424555)]], [[gerrit:1279404|enwikiquote: enable UseSandboxLink (T424863)]] (duration: 06m 59s)
[20:17:41] <stashbot>	 T424843: Lift IP cap on 2026-05-14 for an editathon - cs.wikipedia - https://phabricator.wikimedia.org/T424843
[20:17:41] <stashbot>	 T424555: Requesting logo change for arbcom-en.wikipedia.org - https://phabricator.wikimedia.org/T424555
[20:17:42] <stashbot>	 T424863: Enable the SandboxLink extension on English Wikiquote - https://phabricator.wikimedia.org/T424863
[20:17:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265959 (https://phabricator.wikimedia.org/T419309) (owner: 101F616EMO)
[20:18:36] <wikibugs>	 (03PS5) 10Cwhite: logstash CI: increase sockets-timeout for e2e testing [puppet] - 10https://gerrit.wikimedia.org/r/1278501 (https://phabricator.wikimedia.org/T423986) (owner: 10Tiziano Fogli)
[20:19:34] <wikibugs>	 (03Merged) 10jenkins-bot: arbcom_zhwiki: Enable SecurePoll without PII rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265959 (https://phabricator.wikimedia.org/T419309) (owner: 101F616EMO)
[20:19:56] <wikibugs>	 (03PS1) 10VadymTS1: nlwiki: Modify autoconfirmed requirements for nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279477 (https://phabricator.wikimedia.org/T424898)
[20:19:58] <logmsgbot>	 !log dancy@deploy1003 Started scap sync-world: Backport for [[gerrit:1265959|arbcom_zhwiki: Enable SecurePoll without PII rights (T419309)]]
[20:20:02] <stashbot>	 T419309: Enable SecurePoll extension on arbcom_zh - https://phabricator.wikimedia.org/T419309
[20:20:03] <anzx>	 dancy: thanks for deploying, please run above to purge logos https://www.irccloud.com/pastebin/VvNfoWen/
[20:20:21] <dancy>	 ok, stand by
[20:20:49] <dancy>	 Done.
[20:20:59] <anzx>	 thank you 
[20:21:21] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] logstash CI: increase sockets-timeout for e2e testing (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1278501 (https://phabricator.wikimedia.org/T423986) (owner: 10Tiziano Fogli)
[20:21:51] <logmsgbot>	 !log dancy@deploy1003 1f616emo, dancy: Backport for [[gerrit:1265959|arbcom_zhwiki: Enable SecurePoll without PII rights (T419309)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:22:05] <ZhaoFJx>	 checking
[20:23:18] <ZhaoFJx>	 dancy checked!
[20:23:21] <ZhaoFJx>	 works great
[20:23:25] <dancy>	 OK. Moving on
[20:23:29] <logmsgbot>	 !log dancy@deploy1003 1f616emo, dancy: Continuing with deployment
[20:23:53] <wikibugs>	 (03CR) 10Ahmon Dancy: [C:03+2] Abandon editor survey: UI updates [extensions/MobileFrontend] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279448 (https://phabricator.wikimedia.org/T422931) (owner: 10Medelius)
[20:24:15] <dancy>	 cmede: You're next in line.
[20:24:21] <cmede>	 thank you, less ominous
[20:25:23] <wikibugs>	 (03Merged) 10jenkins-bot: Abandon editor survey: UI updates [extensions/MobileFrontend] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279448 (https://phabricator.wikimedia.org/T422931) (owner: 10Medelius)
[20:27:10] <ZhaoFJx>	 lol
[20:27:14] <logmsgbot>	 !log dancy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1265959|arbcom_zhwiki: Enable SecurePoll without PII rights (T419309)]] (duration: 07m 16s)
[20:27:19] <stashbot>	 T419309: Enable SecurePoll extension on arbcom_zh - https://phabricator.wikimedia.org/T419309
[20:27:23] <dancy>	 cmede: OK for your two changes to go out together?
[20:27:28] <cmede>	 yep!
[20:28:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277569 (https://phabricator.wikimedia.org/T423923) (owner: 10DLynch)
[20:28:17] <ZhaoFJx>	 works great without mwdebug on
[20:28:21] <ZhaoFJx>	 dancy thanks a lot
[20:28:33] <dancy>	 ZhaoFJx: You're welcome
[20:29:00] <cscott>	 o/
[20:29:06] <cscott>	 i'm late, sorry. lost track of time.
[20:29:26] <wikibugs>	 (03PS1) 10Dzahn: admin: extend expiry_date for sarmbruster by 1 month [puppet] - 10https://gerrit.wikimedia.org/r/1279482 (https://phabricator.wikimedia.org/T424402)
[20:29:29] <dancy>	 No problem
[20:32:23] <wikibugs>	 (03Merged) 10jenkins-bot: Enable mobile editor abandonment survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277569 (https://phabricator.wikimedia.org/T423923) (owner: 10DLynch)
[20:32:30] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Extend wmde/nda LDAP access for Sarmbruster - https://phabricator.wikimedia.org/T424402#11873384 (10Dzahn) 05Open→03In progress p:05Triage→03Medium
[20:32:50] <logmsgbot>	 !log dancy@deploy1003 Started scap sync-world: Backport for [[gerrit:1277569|Enable mobile editor abandonment survey on enwiki (T423923)]], [[gerrit:1279448|Abandon editor survey: UI updates (T422931)]]
[20:32:58] <stashbot>	 T423923: Deploy config change to start "Exit the editor" survey (v1.0) - https://phabricator.wikimedia.org/T423923
[20:32:58] <stashbot>	 T422931: Implement the "Exit the editor" survey - https://phabricator.wikimedia.org/T422931
[20:34:48] <logmsgbot>	 !log dancy@deploy1003 dancy, caro, kemayo: Backport for [[gerrit:1277569|Enable mobile editor abandonment survey on enwiki (T423923)]], [[gerrit:1279448|Abandon editor survey: UI updates (T422931)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:34:57] <cmede>	 checking~
[20:35:58] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11873424 (10Papaul)
[20:38:29] <cmede>	 looks good
[20:40:48] <logmsgbot>	 !log dancy@deploy1003 dancy, caro, kemayo: Continuing with deployment
[20:44:33] <logmsgbot>	 !log dancy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1277569|Enable mobile editor abandonment survey on enwiki (T423923)]], [[gerrit:1279448|Abandon editor survey: UI updates (T422931)]] (duration: 11m 43s)
[20:44:39] <stashbot>	 T423923: Deploy config change to start "Exit the editor" survey (v1.0) - https://phabricator.wikimedia.org/T423923
[20:44:39] <stashbot>	 T422931: Implement the "Exit the editor" survey - https://phabricator.wikimedia.org/T422931
[20:45:07] <dancy>	 phuedx: Do you want to handle your own deployment?
[20:45:58] <cmede>	 thank you!
[20:46:08] <dancy>	 cmede: You got it
[20:46:22] <phuedx>	 dancy: Can do
[20:46:26] <dancy>	 VadymTS1: Are you lurking?
[20:46:39] <VaymTS1>	 No
[20:46:59] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279431 (https://phabricator.wikimedia.org/T424876) (owner: 10Sbisson)
[20:47:22] <dancy>	 VaymTS1: We can process your changes after phuedx is done.
[20:47:51] <VaymTS1>	 Ok
[20:48:20] <wikibugs>	 (03Merged) 10jenkins-bot: Load TestKitchen earlier [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279431 (https://phabricator.wikimedia.org/T424876) (owner: 10Sbisson)
[20:48:43] <logmsgbot>	 !log phuedx@deploy1003 Started scap sync-world: Backport for [[gerrit:1279431|Load TestKitchen earlier (T424876)]]
[20:48:48] <stashbot>	 T424876: TestKitchen and other extensions loading order may influence group assignments - https://phabricator.wikimedia.org/T424876
[20:50:37] <logmsgbot>	 !log phuedx@deploy1003 phuedx, sbisson: Backport for [[gerrit:1279431|Load TestKitchen earlier (T424876)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:50:46] <phuedx>	 Checking
[20:53:34] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11873477 (10Papaul)
[20:55:46] <phuedx>	 Quick browse of enwiki, dewiki, wikidata. Things appear to be working correctly and the logs look clean
[20:56:26] <logmsgbot>	 !log phuedx@deploy1003 phuedx, sbisson: Continuing with deployment
[21:00:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260429T2100)
[21:00:16] <logmsgbot>	 !log phuedx@deploy1003 Finished scap sync-world: Backport for [[gerrit:1279431|Load TestKitchen earlier (T424876)]] (duration: 11m 33s)
[21:00:21] <stashbot>	 T424876: TestKitchen and other extensions loading order may influence group assignments - https://phabricator.wikimedia.org/T424876
[21:00:35] <phuedx>	 dancy: Back to you
[21:00:44] <dancy>	 Thanks.  VadymTS1 ready?
[21:00:47] <VadymTS1>	 yes
[21:01:03] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279430 (https://phabricator.wikimedia.org/T424053) (owner: 10VadymTS1)
[21:01:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278363 (https://phabricator.wikimedia.org/T424445) (owner: 10VadymTS1)
[21:04:10] <wikibugs>	 (03Merged) 10jenkins-bot: enwikiversity: Enable the abuse filter block action on English Wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279430 (https://phabricator.wikimedia.org/T424053) (owner: 10VadymTS1)
[21:04:13] <wikibugs>	 (03Merged) 10jenkins-bot: enwikiversity: Add some user rights to the curator user group on English Wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278363 (https://phabricator.wikimedia.org/T424445) (owner: 10VadymTS1)
[21:04:40] <logmsgbot>	 !log dancy@deploy1003 Started scap sync-world: Backport for [[gerrit:1279430|enwikiversity: Enable the abuse filter block action on English Wikiversity (T424053)]], [[gerrit:1278363|enwikiversity: Add some user rights to the curator user group on English Wikiversity (T424445)]]
[21:04:46] <stashbot>	 T424053: Enable the abuse filter block action on English Wikiversity - https://phabricator.wikimedia.org/T424053
[21:04:47] <stashbot>	 T424445: Add some user rights to the curator user group on English Wikiversity - https://phabricator.wikimedia.org/T424445
[21:06:32] <logmsgbot>	 !log dancy@deploy1003 vadymts1, dancy: Backport for [[gerrit:1279430|enwikiversity: Enable the abuse filter block action on English Wikiversity (T424053)]], [[gerrit:1278363|enwikiversity: Add some user rights to the curator user group on English Wikiversity (T424445)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:08:59] <dancy>	 VadymTS1: Are you running checks?
[21:09:00] <jinxer-wm>	 FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta
[21:09:15] <VadymTS1>	 please wait a minute
[21:09:19] <dancy>	 ok
[21:10:17] <wikibugs>	 (03PS1) 10Aleksandar Mastilovic: Add x_trusted_request and x_wmf_ratelimit_class to webrequest live streams [puppet] - 10https://gerrit.wikimedia.org/r/1279489 (https://phabricator.wikimedia.org/T419736)
[21:10:41] <icinga-wm>	 PROBLEM - Host mr1-magru.oob is DOWN: PING CRITICAL - Packet loss = 100%
[21:10:53] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add x_trusted_request and x_wmf_ratelimit_class to webrequest live streams [puppet] - 10https://gerrit.wikimedia.org/r/1279489 (https://phabricator.wikimedia.org/T419736) (owner: 10Aleksandar Mastilovic)
[21:13:44] <VadymTS1>	 I can't start the test because I recently switched to a Mac and I can't use mwdebug here, can you help me?
[21:13:45] <wikibugs>	 (03PS1) 10Bking: cumin: install gnutls-bin package [puppet] - 10https://gerrit.wikimedia.org/r/1279491 (https://phabricator.wikimedia.org/T424672)
[21:13:50] <wikibugs>	 (03PS2) 10Aleksandar Mastilovic: Add x_trusted_request and x_wmf_ratelimit_class to webrequest live streams [puppet] - 10https://gerrit.wikimedia.org/r/1279489 (https://phabricator.wikimedia.org/T419736)
[21:14:06] <dancy>	 VadymTS1: Sure. Let me know what you need me to do
[21:16:06] <VadymTS1>	 Im activated the WikimediaDebug here: https://wikitech.wikimedia.org/wiki/Special:WikimediaDebug and idk what to do next
[21:16:28] <VadymTS1>	 This my first try to do this
[21:17:58] <dancy>	 Just to make sure I understand, are you saying you got the debug extension working?
[21:19:09] <VadymTS1>	 yes I'm activate the Wikimedia debug cookie at this site
[21:20:31] <VadymTS1>	 I was guided this by: https://wikitech.wikimedia.org/wiki/WikimediaDebug
[21:20:43] <dancy>	 Ok good.  So what you do next is visit a URl that is affected by your changes, enable the extension, and select k8s-mwdebug in the pulldown (it's probably set that way), then reload the page.  
[21:20:55] <icinga-wm>	 RECOVERY - Host mr1-magru.oob is UP: PING OK - Packet loss = 0%, RTA = 117.42 ms
[21:21:01] <dancy>	 And verify that whatever effects you expected your changes to have are actually happening.
[21:23:06] <wikibugs>	 (03CR) 10Aleksandar Mastilovic: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1279465 (https://phabricator.wikimedia.org/T419736) (owner: 10CDanis)
[21:28:04] <VadymTS1>	 My promise, I need to dowloand Chrome browser, Safari not in this extension
[21:28:25] <cscott>	 dancy: (go ahead with ZhaoFJx ahead of me once this deploy completes, I need to be away from keyboard for a few minutes)
[21:28:51] <dancy>	 cscott: OK. I'll let you know when we're unstuck
[21:30:21] <dancy>	 Is anyone around who can continue to help VadymTS1? I need to get out of here.
[21:30:31] <dancy>	 If not, I will roll back and revert the two changes.
[21:32:31] <jeena>	 yes I can
[21:32:46] <dancy>	 oh good.  Thanks Jeena.  The deployment is still active in SpiderPig.   
[21:33:22] <jeena>	 you're welcome! VadymTS1 let me know when to continue
[21:34:02] <VadymTS1>	 ok, now I'm dowloand chrome and turn button of extension debugger
[21:34:18] <jeena>	 👍
[21:41:59] <jeena>	 VadymTS1: Do you need any help? Or is it still downloading?
[21:42:24] <VadymTS1>	 Idk to checked the rights
[21:42:39] <VadymTS1>	 *edits
[21:43:37] <VadymTS1>	 What exactly do I need to do to confirm these changes?
[21:44:11] <jeena>	 let me see if I can find out
[21:47:48] <jeena>	 VadymTS1: can you go to Special:AbuseFilter and see if the block action is enabled?
[21:48:04] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11873578 (10Papaul) @ssingh important note: The public subnet mask for servers in rack 103.02.22 will be changing for /28 to /27 so will will have to manually...
[21:50:14] <jeena>	 Oh, I guess that only shows up if someone is blocked? I'm not sure
[21:50:18] <jeena>	 still looking
[21:54:48] <jeena>	 VadymTS1: if you have the correct rights, I think there should be block user option on the Special:AbuseFilter page under actions
[21:55:57] <VadymTS1>	 I don't have rights in Wikiversity also I have to see the groups rights on special pages (about curator)
[21:58:08] <jeena>	 okay, let me try to check
[21:58:14] <VadymTS1>	 Yes all is correct
[21:58:35] <VadymTS1>	 The curator have new rights
[21:59:34] <wikibugs>	 (03CR) 10AKhatun: [C:03+2] stream: move mw-page-html-feature-counts-change-enrich to v1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279429 (https://phabricator.wikimedia.org/T424624) (owner: 10AKhatun)
[22:00:05] <jouncebot>	 Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260429T2200)
[22:01:17] <jeena>	 VadymTS1: so you were able to check the curator rights? What about the abuse filter? unfortunately it doesn't look like I have permissions
[22:01:37] <wikibugs>	 (03Merged) 10jenkins-bot: stream: move mw-page-html-feature-counts-change-enrich to v1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279429 (https://phabricator.wikimedia.org/T424624) (owner: 10AKhatun)
[22:02:58] <VadymTS1>	 in my opinion everything works and appeared
[22:03:07] <jeena>	 okay thanks I will proceed
[22:03:15] <logmsgbot>	 !log dancy@deploy1003 vadymts1, dancy: Continuing with deployment
[22:03:25] <jinxer-wm>	 FIRING: [16x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:05:20] <jinxer-wm>	 FIRING: [3x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in 3d 15h 49m 25s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry
[22:07:06] <logmsgbot>	 !log dancy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1279430|enwikiversity: Enable the abuse filter block action on English Wikiversity (T424053)]], [[gerrit:1278363|enwikiversity: Add some user rights to the curator user group on English Wikiversity (T424445)]] (duration: 62m 26s)
[22:07:12] <stashbot>	 T424053: Enable the abuse filter block action on English Wikiversity - https://phabricator.wikimedia.org/T424053
[22:07:13] <stashbot>	 T424445: Add some user rights to the curator user group on English Wikiversity - https://phabricator.wikimedia.org/T424445
[22:07:52] <jeena>	 cscott: ready for you
[22:08:48] <jeena>	 VadymTS1: all deployed, thanks for your patience
[22:09:22] <VadymTS1>	 jeena Thanks you, for help absolutely
[22:09:50] <jeena>	 yw!
[22:10:47] <cscott>	 Ok, I think I'm next?  Let me check that everything else in the queue is merged now. 
[22:10:59] <jeena>	 yeah I think you're the last one!
[22:11:04] <cscott>	 I can spiderpig my own patch, so I think you're off the hook jeena. 
[22:11:37] <jeena>	 okay thanks!
[22:14:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277770 (https://phabricator.wikimedia.org/T424590) (owner: 10Arlolra)
[22:14:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279450 (https://phabricator.wikimedia.org/T423785) (owner: 10C. Scott Ananian)
[22:18:21] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy PRV to 12 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277770 (https://phabricator.wikimedia.org/T424590) (owner: 10Arlolra)
[22:18:24] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy Parsoid Read Views to 6 language converter wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279450 (https://phabricator.wikimedia.org/T423785) (owner: 10C. Scott Ananian)
[22:18:48] <logmsgbot>	 !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1277770|Deploy PRV to 12 wikis (T424590)]], [[gerrit:1279450|Deploy Parsoid Read Views to 6 language converter wikis (T423785)]]
[22:18:54] <stashbot>	 T424590: Parsoid Read Views to deploy ~2026-04-30 - https://phabricator.wikimedia.org/T424590
[22:18:55] <stashbot>	 T423785: Parsoid Read Views to deploy ~2026-04-20 (Language Converter wikis) - https://phabricator.wikimedia.org/T423785
[22:20:42] <logmsgbot>	 !log cscott@deploy1003 arlolra, cscott: Backport for [[gerrit:1277770|Deploy PRV to 12 wikis (T424590)]], [[gerrit:1279450|Deploy Parsoid Read Views to 6 language converter wikis (T423785)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[22:21:26] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11873638 (10Papaul) @RobH Remote hands instructions are ready @ https://docs.google.com/document/d/1EW6hxHCQjXPy1PXQWluwOTnCl_AHddI34iOYHdJuvek/edit?tab=t.0 Pl...
[22:35:04] <logmsgbot>	 !log cscott@deploy1003 arlolra, cscott: Continuing with deployment
[22:39:51] <logmsgbot>	 !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1277770|Deploy PRV to 12 wikis (T424590)]], [[gerrit:1279450|Deploy Parsoid Read Views to 6 language converter wikis (T423785)]] (duration: 21m 03s)
[22:39:57] <stashbot>	 T424590: Parsoid Read Views to deploy ~2026-04-30 - https://phabricator.wikimedia.org/T424590
[22:39:58] <stashbot>	 T423785: Parsoid Read Views to deploy ~2026-04-20 (Language Converter wikis) - https://phabricator.wikimedia.org/T423785
[22:41:20] <cscott>	 ok, one last patch to go (whew!)
[22:41:28] <cscott>	 this is the exciting one
[22:42:24] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release wikifunctions/python-evaluator on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[22:43:11] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Requesting logstash-access LDAP group access for HakanIST - https://phabricator.wikimedia.org/T424812#11873714 (10KFrancis) The NDA has been sent for signatures.  I'll confirm when it's complete. Thanks!
[22:45:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279452 (https://phabricator.wikimedia.org/T424880) (owner: 10C. Scott Ananian)
[22:46:27] <dancy>	 cscott: Congrats!
[22:46:34] <dancy>	 Jeena: Thanks again!
[22:46:57] <jeena>	 np!
[22:50:01] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Parsoid Read Views for 20% of enwiki mobile web traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279452 (https://phabricator.wikimedia.org/T424880) (owner: 10C. Scott Ananian)
[22:50:27] <logmsgbot>	 !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1279452|Enable Parsoid Read Views for 20% of enwiki mobile web traffic (T424880)]]
[22:50:32] <stashbot>	 T424880: Parsoid Read Views to deploy 2026-04-29-2026-04-30 (enwiki mobile web) - https://phabricator.wikimedia.org/T424880
[22:52:22] <logmsgbot>	 !log cscott@deploy1003 cscott: Backport for [[gerrit:1279452|Enable Parsoid Read Views for 20% of enwiki mobile web traffic (T424880)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[22:59:54] <logmsgbot>	 !log cscott@deploy1003 cscott: Continuing with deployment
[23:04:30] <logmsgbot>	 !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1279452|Enable Parsoid Read Views for 20% of enwiki mobile web traffic (T424880)]] (duration: 14m 03s)
[23:04:35] <stashbot>	 T424880: Parsoid Read Views to deploy 2026-04-29-2026-04-30 (enwiki mobile web) - https://phabricator.wikimedia.org/T424880
[23:11:42] <wikibugs>	 (03PS1) 10Dduvall: zuul: Upgrade to Zuul 14.2.0 [puppet] - 10https://gerrit.wikimedia.org/r/1279500 (https://phabricator.wikimedia.org/T424879)
[23:13:31] <wikibugs>	 (03PS2) 10Dduvall: zuul: Upgrade to Zuul 14.2.0 [puppet] - 10https://gerrit.wikimedia.org/r/1279500 (https://phabricator.wikimedia.org/T424879)
[23:15:42] <wikibugs>	 (03CR) 10ArielGlenn: "Generally seems ok, a few questions left inline" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272765 (https://phabricator.wikimedia.org/T413448) (owner: 10Daniel Kinzler)
[23:16:26] <cscott>	 i'm done, and Parsoid Read Views is live on enwiki mobile web now (yay) for 20% of pages.
[23:18:09] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host tools-k8s-worker1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[23:18:24] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host tools-k8s-worker1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[23:18:27] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host tools-k8s-worker1008.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[23:18:29] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host tools-k8s-worker1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[23:18:31] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host tools-k8s-ctrl1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[23:18:33] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host tools-k8s-worker1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[23:18:38] <wikibugs>	 (03PS1) 10Papaul: Add BGP peering from asw1-23 to core routers and mr1 [homer/public] - 10https://gerrit.wikimedia.org/r/1279501 (https://phabricator.wikimedia.org/T408892)
[23:19:49] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host tools-k8s-worker1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[23:19:49] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host tools-k8s-worker1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[23:19:50] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host tools-k8s-ctrl1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[23:19:52] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host tools-k8s-worker1008.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[23:19:58] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add BGP peering from asw1-23 to core routers and mr1 [homer/public] - 10https://gerrit.wikimedia.org/r/1279501 (https://phabricator.wikimedia.org/T408892) (owner: 10Papaul)
[23:24:11] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.dns.netbox
[23:25:59] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host tools-k8s-worker1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[23:27:04] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host tools-k8s-worker1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[23:27:38] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1375.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[23:28:01] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker1375 to eqiad - jclark@cumin1003"
[23:28:04] <wikibugs>	 (03PS2) 10Papaul: Add BGP peering from asw1-23 to core routers and mr1 [homer/public] - 10https://gerrit.wikimedia.org/r/1279501 (https://phabricator.wikimedia.org/T408892)
[23:28:06] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker1375 to eqiad - jclark@cumin1003"
[23:28:06] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[23:28:12] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1376.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[23:28:22] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1377.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[23:28:28] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1375.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[23:28:44] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1377.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[23:29:00] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1378.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[23:29:10] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1378.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[23:29:34] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1379.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[23:30:03] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1380.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[23:30:24] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1380.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[23:31:36] <wikibugs>	 (03CR) 10Papaul: [C:03+2] Add BGP peering from asw1-23 to core routers and mr1 [homer/public] - 10https://gerrit.wikimedia.org/r/1279501 (https://phabricator.wikimedia.org/T408892) (owner: 10Papaul)
[23:33:04] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1375.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[23:33:32] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1378.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[23:34:15] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1377.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[23:34:29] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1380.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[23:35:49] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1378.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[23:36:02] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1376.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[23:36:17] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1381.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[23:36:20] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1377.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[23:36:39] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1382.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[23:36:40] <wikibugs>	 (03CR) 10Cwhite: [C:04-1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1275800 (https://phabricator.wikimedia.org/T423986) (owner: 10Tiziano Fogli)
[23:37:14] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1379.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[23:38:02] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1384.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[23:39:50] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1375.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[23:39:57] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1279502
[23:39:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1279502 (owner: 10TrainBranchBot)
[23:41:13] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1380.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[23:41:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:42:44] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1375.eqiad.wmnet with OS trixie
[23:42:46] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1376.eqiad.wmnet with OS trixie
[23:42:49] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11873863 (10Papaul)
[23:42:51] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13{75-84} - https://phabricator.wikimedia.org/T423719#11873864 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclar...
[23:42:55] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13{75-84} - https://phabricator.wikimedia.org/T423719#11873865 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclar...
[23:43:03] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1381.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[23:43:24] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1382.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[23:43:31] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1379.eqiad.wmnet with OS trixie
[23:43:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13{75-84} - https://phabricator.wikimedia.org/T423719#11873866 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclar...
[23:44:18] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1381.eqiad.wmnet with OS trixie
[23:44:20] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1380.eqiad.wmnet with OS trixie
[23:44:32] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13{75-84} - https://phabricator.wikimedia.org/T423719#11873868 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclar...
[23:44:33] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13{75-84} - https://phabricator.wikimedia.org/T423719#11873869 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclar...
[23:44:41] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1384.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[23:45:23] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1382.eqiad.wmnet with OS trixie
[23:45:32] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13{75-84} - https://phabricator.wikimedia.org/T423719#11873871 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclar...
[23:51:04] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1279502 (owner: 10TrainBranchBot)
[23:53:14] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11873875 (10Papaul)
[23:54:43] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1376.eqiad.wmnet with reason: host reimage
[23:54:50] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1375.eqiad.wmnet with reason: host reimage
[23:55:28] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1379.eqiad.wmnet with reason: host reimage
[23:56:04] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1381.eqiad.wmnet with reason: host reimage
[23:56:08] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1380.eqiad.wmnet with reason: host reimage
[23:57:13] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1382.eqiad.wmnet with reason: host reimage
[23:58:50] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1376.eqiad.wmnet with reason: host reimage