[00:27:07] (03PS3) 10Neriah: upload: Return 400 instead of 429 for non-standard thumbnail sizes [puppet] - 10https://gerrit.wikimedia.org/r/1253662 (https://phabricator.wikimedia.org/T414805) [00:58:50] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 183207792 and 18 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:00:50] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 50776 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:08:16] PROBLEM - MariaDB Replica Lag: pc5 on pc2015 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 313.21 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:08:36] PROBLEM - MariaDB Replica Lag: pc1 on pc2021 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 319.32 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:08:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta [01:08:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, ... [01:08:51] IC-313592 51ms 10Gbps wave) {#11372}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=eqiad+prometheus%2Fops&var-device=cr2-eqord:9804&var-interface=xe-0%2F1%2F3 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [01:08:58] !ack [01:08:58] 7879 (ACKED) TransitPeeringTransportOutSaturation network sre (cr2-eqord:9804 Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372} xe-0/1/3 gnmi eqiad) [01:09:50] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 131868000 and 10 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:10:50] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 4600 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:13:23] rzl, are you able to talk me through what you're looking at? Or screenshare? [01:13:51] FIRING: [2x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) #page - https://w.wiki/Gbyf - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [01:17:36] RECOVERY - MariaDB Replica Lag: pc1 on pc2021 is OK: OK slave_sql_lag Replication lag: 0.35 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:18:16] RECOVERY - MariaDB Replica Lag: pc5 on pc2015 is OK: OK slave_sql_lag Replication lag: 0.32 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:23:51] RESOLVED: [2x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) #page - https://w.wiki/Gbyf - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [01:28:57] !log andrew@cumin2002 START - Cookbook sre.dns.admin DNS admin: depool ulsfo for service: upload-addrs [reason: no reason specified, no task ID specified] [01:29:10] !log andrew@cumin2002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool ulsfo for service: upload-addrs [reason: no reason specified, no task ID specified] [01:34:49] FIRING: [17x] CertAlmostExpired: Certificate for service apus:443 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:53:25] FIRING: [16x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:53:31] FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [02:00:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:02:27] (03PS1) 10RLazarus: interfaces: Update playbook link [alerts] - 10https://gerrit.wikimedia.org/r/1278792 [02:03:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 04 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265959 (https://phabricator.wikimedia.org/T419309) (owner: 101F616EMO) [02:05:20] FIRING: [3x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in 4d 11h 49m 25s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [02:08:28] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 271 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 765, active_shards: 1262, relocating_shards: 0, initializing_shards: 8, unassigned_shards: [02:08:28] ayed_unassigned_shards: 0, number_of_pending_tasks: 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 113, active_shards_percent_as_number: 82.32224396607958 https://wikitech.wikimedia.org/wiki/Search%23Administration [02:09:19] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:09:28] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1332, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 196, delayed_unassign [02:09:28] s: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 86.88845401174167 https://wikitech.wikimedia.org/wiki/Search%23Administration [02:12:14] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install rdb201[34] - https://phabricator.wikimedia.org/T418922#11869336 (10Jclark-ctr) @jhancock.wm eqiad servers failed install also. @jijiki when you make change can you fix eqiad and codfw? [02:16:26] PROBLEM - Host wikikube-worker1039 is DOWN: PING CRITICAL - Packet loss = 100% [02:21:33] FIRING: KubernetesCalicoDown: wikikube-worker1039.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1039.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [02:28:14] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1008.eqiad.wmnet with OS trixie [02:29:16] PROBLEM - WMF Cloud -Omega Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration [02:29:16] PROBLEM - WMF Cloud -Omega Cluster- - Prod MW AppServer Port - SSL Expiry on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration [02:29:45] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [02:34:19] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:34:45] RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [02:36:14] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 277 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 763, active_shards: 1256, relocating_shards: 0, initializing_shards: 8, unassigned_shards: [02:36:14] ayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 81.93085453359426 https://wikitech.wikimedia.org/wiki/Search%23Administration [02:36:28] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1007 is CRITICAL: CRITICAL - elasticsearch inactive shards 273 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 765, active_shards: 1260, relocating_shards: 0, initializing_shards: 6, unassigned_shards: 267, delayed_unassigned_shards: 0, [02:36:28] of_pending_tasks: 4, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 1036, active_shards_percent_as_number: 82.1917808219178 https://wikitech.wikimedia.org/wiki/Search%23Administration [02:36:30] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 270 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 765, active_shards: 1263, relocating_shards: 0, initializing_shards: 6, unassigned_shards: [02:36:30] ayed_unassigned_shards: 0, number_of_pending_tasks: 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 38, active_shards_percent_as_number: 82.38747553816047 https://wikitech.wikimedia.org/wiki/Search%23Administration [02:36:36] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 265 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1268, relocating_shards: 0, initializing_shards: 5, unassigned_shard [02:36:36] delayed_unassigned_shards: 0, number_of_pending_tasks: 3, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 321, active_shards_percent_as_number: 82.7136333985649 https://wikitech.wikimedia.org/wiki/Search%23Administration [02:36:38] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch inactive shards 265 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1268, relocating_shards: 0, initializing_shards: 5, unassigned_shard [02:36:38] delayed_unassigned_shards: 0, number_of_pending_tasks: 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 128, active_shards_percent_as_number: 82.7136333985649 https://wikitech.wikimedia.org/wiki/Search%23Administration [02:37:14] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1314, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 215, delayed_unassign [02:37:14] s: 0, number_of_pending_tasks: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.71428571428571 https://wikitech.wikimedia.org/wiki/Search%23Administration [02:37:28] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 766, active_shards: 1329, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 199, delayed_unassigned_shards: 0, number_of_pending_ta [02:37:28] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 86.69275929549902 https://wikitech.wikimedia.org/wiki/Search%23Administration [02:37:30] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1331, relocating_shards: 0, initializing_shards: 3, unassigned_shards: 199, delayed_unassign [02:37:30] s: 0, number_of_pending_tasks: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 86.8232224396608 https://wikitech.wikimedia.org/wiki/Search%23Administration [02:37:36] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1337, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 191, delayed_unassign [02:37:36] s: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 87.21461187214612 https://wikitech.wikimedia.org/wiki/Search%23Administration [02:37:38] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1337, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 191, delayed_unassign [02:37:38] s: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 87.21461187214612 https://wikitech.wikimedia.org/wiki/Search%23Administration [02:50:50] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 36572304 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:51:24] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1008.eqiad.wmnet with reason: host reimage [02:51:50] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 116536 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:55:28] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1008.eqiad.wmnet with reason: host reimage [03:16:21] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1008.eqiad.wmnet with OS trixie [03:23:31] FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [03:41:40] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:50:39] (03PS1) 10Jasmine: role::kafka::main: move to Confluent Kafka 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/1278832 (https://phabricator.wikimedia.org/T419216) [03:51:57] (03CR) 10Jasmine: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1278832 (https://phabricator.wikimedia.org/T419216) (owner: 10Jasmine) [04:09:51] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 129657464 and 9 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [04:10:51] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 17704 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [04:16:38] (03Abandoned) 10Ryan Kemper: growthbook: Bump vendored job templ 1.0.1 → 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270558 (https://phabricator.wikimedia.org/T420691) (owner: 10Ryan Kemper) [04:33:44] (03PS2) 10Jasmine: kafka-main: set main-eqiad cluster brokers to Confluent distro 77 (3.7) [puppet] - 10https://gerrit.wikimedia.org/r/1278832 (https://phabricator.wikimedia.org/T419216) [04:35:00] (03PS3) 10Jasmine: kafka-main: set main-eqiad cluster brokers to Confluent distro 77 (3.7) [puppet] - 10https://gerrit.wikimedia.org/r/1278832 (https://phabricator.wikimedia.org/T419216) [04:35:50] (03PS4) 10Jasmine: kafka-main: set main-eqiad cluster brokers to Confluent distro 77 (3.7) [puppet] - 10https://gerrit.wikimedia.org/r/1278832 (https://phabricator.wikimedia.org/T419216) [04:38:43] (03Abandoned) 10Ryan Kemper: growthbook: Add automation API key placeholders [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269245 (https://phabricator.wikimedia.org/T420696) (owner: 10Ryan Kemper) [04:41:51] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 269825928 and 33 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [04:42:51] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 2641264 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [05:00:27] (03CR) 10WAN233: [C:03+1] change logo at zh-classical wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276284 (https://phabricator.wikimedia.org/T424128) (owner: 10WAN233) [05:02:18] (03CR) 10WAN233: [C:03+1] change logo at zh-classical wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276284 (https://phabricator.wikimedia.org/T424128) (owner: 10WAN233) [05:07:03] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 23 hosts with reason: Primary switchover s5 T424550 [05:07:07] T424550: Switchover s5 master (db1230 -> db1210) - https://phabricator.wikimedia.org/T424550 [05:07:09] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 66%, RTA = 5469.73 ms [05:07:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set db1210 with weight 0 T424550', diff saved to https://phabricator.wikimedia.org/P91814 and previous config saved to /var/cache/conftool/dbconfig/20260429-050718-marostegui.json [05:07:34] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1210 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1277598 (https://phabricator.wikimedia.org/T424550) (owner: 10Gerrit maintenance bot) [05:08:10] !log Starting s5 eqiad failover from db1230 to db1210 - T424550 [05:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:09:00] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta [05:09:31] 06SRE, 10SRE-Access-Requests: Update SSH key for production access – Surbhi Gupta - https://phabricator.wikimedia.org/T422363#11869411 (10SGupta-WMF) 05Resolved→03Open Hi, I’ve configured my SSH setup with the new key and can reach the bastion (bast1004.wikimedia.org). I can see my key being offered durin... [05:10:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set s5 eqiad as read-only for maintenance - T424550', diff saved to https://phabricator.wikimedia.org/P91815 and previous config saved to /var/cache/conftool/dbconfig/20260429-051032-marostegui.json [05:10:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote db1210 to s5 primary and set section read-write T424550', diff saved to https://phabricator.wikimedia.org/P91816 and previous config saved to /var/cache/conftool/dbconfig/20260429-051054-marostegui.json [05:11:38] !log marostegui@dns1004 START - running authdns-update [05:12:11] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 269.82 ms [05:12:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1230 T424550', diff saved to https://phabricator.wikimedia.org/P91817 and previous config saved to /var/cache/conftool/dbconfig/20260429-051244-marostegui.json [05:12:49] T424550: Switchover s5 master (db1230 -> db1210) - https://phabricator.wikimedia.org/T424550 [05:13:06] !log marostegui@dns1004 END - running authdns-update [05:14:31] PROBLEM - orchestrator resolve cache non-FQDNs on dborch1002 is CRITICAL: CRITICAL: 2 non-FQDN entries in orchestrator resolve cache: https://wikitech.wikimedia.org/wiki/Orchestrator [05:15:31] RECOVERY - orchestrator resolve cache non-FQDNs on dborch1002 is OK: OK: all orchestrator resolve cache entries are FQDNs https://wikitech.wikimedia.org/wiki/Orchestrator [05:16:33] (03PS1) 10Marostegui: db1230: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1278851 [05:17:13] (03PS1) 10Marostegui: wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1278874 (https://phabricator.wikimedia.org/T424550) [05:17:23] (03Abandoned) 10Marostegui: wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1277599 (https://phabricator.wikimedia.org/T424550) (owner: 10Gerrit maintenance bot) [05:18:05] (03CR) 10Marostegui: [C:03+2] wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1278874 (https://phabricator.wikimedia.org/T424550) (owner: 10Marostegui) [05:18:31] PROBLEM - orchestrator resolve cache non-FQDNs on dborch1002 is CRITICAL: CRITICAL: 2 non-FQDN entries in orchestrator resolve cache: https://wikitech.wikimedia.org/wiki/Orchestrator [05:18:54] !log marostegui@dns1004 START - running authdns-update [05:19:06] (03CR) 10Marostegui: [C:03+2] db1230: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1278851 (owner: 10Marostegui) [05:20:30] !log marostegui@dns1004 END - running authdns-update [05:20:48] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1230.eqiad.wmnet with reason: Reimage to Trixie [05:20:52] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1230: Reimage to Trixie [05:21:00] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1230: Reimage to Trixie [05:21:29] 06SRE, 10SRE-Access-Requests: Update SSH key for production access – Surbhi Gupta - https://phabricator.wikimedia.org/T422363#11869437 (10SGupta-WMF) 05Open→03Resolved [05:21:31] RECOVERY - orchestrator resolve cache non-FQDNs on dborch1002 is OK: OK: all orchestrator resolve cache entries are FQDNs https://wikitech.wikimedia.org/wiki/Orchestrator [05:22:27] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1230.eqiad.wmnet with OS trixie [05:24:19] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:24:31] PROBLEM - orchestrator resolve cache non-FQDNs on dborch1002 is CRITICAL: CRITICAL: 2 non-FQDN entries in orchestrator resolve cache: https://wikitech.wikimedia.org/wiki/Orchestrator [05:29:19] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:30:06] (03PS1) 10Marostegui: db1254,db2225: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279075 (https://phabricator.wikimedia.org/T424615) [05:30:57] (03CR) 10Marostegui: [C:03+2] db1254,db2225: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279075 (https://phabricator.wikimedia.org/T424615) (owner: 10Marostegui) [05:31:05] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2225.codfw.wmnet with reason: Reimage to Trixie [05:31:11] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2225: Reimage to Trixie [05:31:30] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2225: Reimage to Trixie [05:31:35] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1254.eqiad.wmnet with reason: Reimage to Trixie [05:31:39] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1254: Reimage to Trixie [05:32:08] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1254: Reimage to Trixie [05:32:56] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2225.codfw.wmnet with OS trixie [05:34:49] FIRING: [17x] CertAlmostExpired: Certificate for service apus:443 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:35:02] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1254.eqiad.wmnet with OS trixie [05:37:49] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1230.eqiad.wmnet with reason: host reimage [05:42:47] (03PS1) 10Abijeet Patro: Don't load general modules as style modules [extensions/Translate] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1279078 (https://phabricator.wikimedia.org/T424618) [05:43:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/Translate] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1279078 (https://phabricator.wikimedia.org/T424618) (owner: 10Abijeet Patro) [05:43:47] (03Abandoned) 10Abijeet Patro: Don't load general modules as style modules [extensions/Translate] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1279078 (https://phabricator.wikimedia.org/T424618) (owner: 10Abijeet Patro) [05:44:16] (03PS1) 10Abijeet Patro: Don't load general modules as style modules [extensions/Translate] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279079 (https://phabricator.wikimedia.org/T424618) [05:44:38] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1230.eqiad.wmnet with reason: host reimage [05:44:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/Translate] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279079 (https://phabricator.wikimedia.org/T424618) (owner: 10Abijeet Patro) [05:44:55] (03PS1) 10Marostegui: installserver: Do not reimage db2251 [puppet] - 10https://gerrit.wikimedia.org/r/1279080 (https://phabricator.wikimedia.org/T418979) [05:45:46] (03PS1) 10Marostegui: Revert "db1254,db2225: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1279081 [05:45:59] (03PS1) 10Marostegui: Revert "db1230: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1279082 [05:47:05] (03CR) 10Ayounsi: [C:04-1] "I've updated the Wikipage instead: https://wikitech.wikimedia.org/w/index.php?title=Network_monitoring&diff=2407118&oldid=2377392" [alerts] - 10https://gerrit.wikimedia.org/r/1278792 (owner: 10RLazarus) [05:47:54] (03CR) 10Marostegui: [C:03+2] installserver: Do not reimage db2251 [puppet] - 10https://gerrit.wikimedia.org/r/1279080 (https://phabricator.wikimedia.org/T418979) (owner: 10Marostegui) [05:49:16] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2225.codfw.wmnet with reason: host reimage [05:50:13] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1254.eqiad.wmnet with reason: host reimage [05:53:25] FIRING: [16x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:55:05] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2225.codfw.wmnet with reason: host reimage [05:57:42] (03CR) 10Marostegui: [C:03+2] Revert "db1230: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1279082 (owner: 10Marostegui) [05:59:44] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1254.eqiad.wmnet with reason: host reimage [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260429T0600) [06:05:20] FIRING: [3x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in 4d 7h 49m 25s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [06:06:21] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1230.eqiad.wmnet with OS trixie [06:08:29] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1230: after reimage to trixie [06:12:10] (03PS4) 10Ryan Kemper: growthbook: Drop dead SSO_CONFIG placeholder [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270559 (https://phabricator.wikimedia.org/T420696) [06:13:32] (03PS1) 10Marostegui: db1198,db2227: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279083 (https://phabricator.wikimedia.org/T424792) [06:15:28] (03CR) 10Marostegui: [C:03+2] Revert "db1254,db2225: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1279081 (owner: 10Marostegui) [06:15:42] (03CR) 10Marostegui: [C:03+2] db1198,db2227: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279083 (https://phabricator.wikimedia.org/T424792) (owner: 10Marostegui) [06:16:58] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1198.eqiad.wmnet with reason: Reimage to Trixie [06:17:03] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1198: Reimage to Trixie [06:17:12] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2227.codfw.wmnet with reason: Reimage to Trixie [06:17:18] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2227: Reimage to Trixie [06:17:36] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2227: Reimage to Trixie [06:18:00] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1198: Reimage to Trixie [06:18:16] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2225.codfw.wmnet with OS trixie [06:19:34] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1198.eqiad.wmnet with OS trixie [06:19:46] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2227.codfw.wmnet with OS trixie [06:20:56] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2225: after reimage to trixie [06:21:48] FIRING: KubernetesCalicoDown: wikikube-worker1039.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1039.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [06:22:12] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1254.eqiad.wmnet with OS trixie [06:25:29] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1254: after reimage to trixie [06:31:25] FIRING: [4x] SystemdUnitFailed: push_cross_cluster_settings_9200.service on cloudelastic1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:31:34] (03CR) 10Brouberol: [C:03+1] growthbook: Drop dead SSO_CONFIG placeholder [deployment-charts] - 10https://gerrit.wikimedia.org/r/1270559 (https://phabricator.wikimedia.org/T420696) (owner: 10Ryan Kemper) [06:32:51] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1278603 (https://phabricator.wikimedia.org/T415073) (owner: 10Ryan Kemper) [06:33:21] (03PS6) 10Elukey: services: Add TLS SANs to the evaluators' mesh configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277518 (https://phabricator.wikimedia.org/T424193) [06:33:51] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1198.eqiad.wmnet with reason: host reimage [06:34:13] (03CR) 10Elukey: "oh noeeessss! Sorry :( It turns out that my attention is not good if I do 10 things at the time (like renewing TLS certs). Hopefully final" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277518 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [06:35:52] (03CR) 10Elukey: [C:03+1] "Change looks good to me! I think that at this point the rollout is safe enough to proceed with eqiad first, but we could also tackle codfw" [puppet] - 10https://gerrit.wikimedia.org/r/1278832 (https://phabricator.wikimedia.org/T419216) (owner: 10Jasmine) [06:36:31] (03PS1) 10Marostegui: mariadb: Remove pc2011 [puppet] - 10https://gerrit.wikimedia.org/r/1279084 (https://phabricator.wikimedia.org/T424012) [06:36:44] PROBLEM - Check unit status of push_cross_cluster_settings_9200 on cloudelastic1011 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:36:44] (03CR) 10Elukey: [C:03+1] restbase: migrate envoy TLS proxy services to new intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1278554 (https://phabricator.wikimedia.org/T424674) (owner: 10Eevans) [06:37:10] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2227.codfw.wmnet with reason: host reimage [06:37:36] !log marostegui@cumin1003 START - Cookbook sre.hosts.decommission for hosts pc2011.codfw.wmnet [06:38:00] (03CR) 10Marostegui: [C:03+2] mariadb: Remove pc2011 [puppet] - 10https://gerrit.wikimedia.org/r/1279084 (https://phabricator.wikimedia.org/T424012) (owner: 10Marostegui) [06:38:35] (03PS1) 10Marostegui: Revert "db1198,db2227: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1279085 [06:39:04] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1198.eqiad.wmnet with reason: host reimage [06:39:54] (03CR) 10JMeybohm: [C:03+2] deployment_server: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1278516 (https://phabricator.wikimedia.org/T424671) (owner: 10Jasmine) [06:40:32] (03PS1) 10Muehlenhoff: Record LDAP access for dtorsani [puppet] - 10https://gerrit.wikimedia.org/r/1279086 [06:41:05] RECOVERY - WMF Cloud -Chi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 746 bytes in 0.012 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration [06:41:05] RECOVERY - WMF Cloud -Chi Cluster- - Public Internet Port - SSL Expiry on cloudelastic.wikimedia.org is OK: OK - Certificate cloudelastic.wikimedia.org will expire on Sun 05 Jul 2026 07:49:09 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Search%23Administration [06:41:05] RECOVERY - WMF Cloud -Chi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 746 bytes in 0.017 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration [06:41:05] RECOVERY - WMF Cloud -Chi Cluster- - Prod MW AppServer Port - SSL Expiry on cloudelastic.wikimedia.org is OK: OK - Certificate cloudelastic.wikimedia.org will expire on Sun 05 Jul 2026 07:49:09 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Search%23Administration [06:42:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269464 (https://phabricator.wikimedia.org/T420427) (owner: 10DCausse) [06:42:47] !log marostegui@cumin1003 START - Cookbook sre.dns.netbox [06:43:20] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2227.codfw.wmnet with reason: host reimage [06:44:04] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for dtorsani [puppet] - 10https://gerrit.wikimedia.org/r/1279086 (owner: 10Muehlenhoff) [06:44:16] PROBLEM - WMF Cloud -Chi Cluster- - Public Internet Port - SSL Expiry on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration [06:44:16] PROBLEM - WMF Cloud -Chi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration [06:44:16] PROBLEM - WMF Cloud -Chi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration [06:44:16] PROBLEM - WMF Cloud -Chi Cluster- - Prod MW AppServer Port - SSL Expiry on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration [06:46:25] FIRING: [4x] SystemdUnitFailed: push_cross_cluster_settings_9200.service on cloudelastic1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:46:44] RECOVERY - Check unit status of push_cross_cluster_settings_9200 on cloudelastic1011 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:48:35] marostegui@cumin1003 decommission (PID 2302352) is awaiting input [06:53:38] !log marostegui@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: pc2011.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [06:53:56] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1230: after reimage to trixie [06:54:09] (03PS2) 10Tiziano Fogli: logstash/filter: increase sockets-timeout for unit tests [puppet] - 10https://gerrit.wikimedia.org/r/1278501 (https://phabricator.wikimedia.org/T423986) [06:54:09] (03PS5) 10Tiziano Fogli: logstash: add thanos-query-frontend filter [puppet] - 10https://gerrit.wikimedia.org/r/1275800 (https://phabricator.wikimedia.org/T423986) [06:54:49] !log marostegui@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: pc2011.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [06:54:49] !log marostegui@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:54:50] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts pc2011.codfw.wmnet [06:55:59] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission pc2011.codfw.wmnet - https://phabricator.wikimedia.org/T424012#11869572 (10Marostegui) a:05Marostegui→03Jhancock.wm [06:56:05] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission pc2011.codfw.wmnet - https://phabricator.wikimedia.org/T424012#11869576 (10Marostegui) Ready for dc-ops [06:56:34] (03CR) 10Marostegui: [C:03+2] Revert "db1198,db2227: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1279085 (owner: 10Marostegui) [06:58:00] (03CR) 10Tiziano Fogli: logstash/filter: increase sockets-timeout for unit tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1278501 (https://phabricator.wikimedia.org/T423986) (owner: 10Tiziano Fogli) [06:58:52] (03Abandoned) 10Ryan Kemper: Revert wdqs deadlock remediation threshold to 600 [puppet] - 10https://gerrit.wikimedia.org/r/1263176 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [06:59:50] (03CR) 10Ryan Kemper: [C:03+2] dse-k8s: Also write set-rbd-readahead logs to journal [puppet] - 10https://gerrit.wikimedia.org/r/1255887 (https://phabricator.wikimedia.org/T419041) (owner: 10Ryan Kemper) [07:00:04] Amir1, Urbanecm, and awight: Time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260429T0700). [07:00:04] dcausse and abijeet: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:29] o/ [07:01:18] I can deploy [07:01:40] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1198.eqiad.wmnet with OS trixie [07:03:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1003 using scap backport" [extensions/WikibaseCirrusSearch] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1278509 (https://phabricator.wikimedia.org/T417648) (owner: 10DCausse) [07:03:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269464 (https://phabricator.wikimedia.org/T420427) (owner: 10DCausse) [07:04:31] (03Merged) 10jenkins-bot: search: add alt. completion indices to test keyword tokenizer (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269464 (https://phabricator.wikimedia.org/T420427) (owner: 10DCausse) [07:04:50] (03PS7) 10Bartosz Wójtowicz: Fix gRPC Gateway protocol to allow TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277436 (https://phabricator.wikimedia.org/T424049) [07:05:11] (03PS8) 10Bartosz Wójtowicz: Fix gRPC Gateway protocol to allow TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277436 (https://phabricator.wikimedia.org/T424049) [07:06:04] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2227.codfw.wmnet with OS trixie [07:06:23] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2225: after reimage to trixie [07:06:53] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1198: after reimage to trixie [07:10:41] (03PS1) 10Muehlenhoff: idp_clouddev: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279091 (https://phabricator.wikimedia.org/T424676) [07:11:00] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1254: after reimage to trixie [07:11:19] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2227: after reimage to trixie [07:14:05] (03PS1) 10Arnaudb: phabricator: add -ignore_readdir_race to clean_tmp_files service [puppet] - 10https://gerrit.wikimedia.org/r/1279092 (https://phabricator.wikimedia.org/T424796) [07:15:11] (03Merged) 10jenkins-bot: Completion: fix near match field name [extensions/WikibaseCirrusSearch] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1278509 (https://phabricator.wikimedia.org/T417648) (owner: 10DCausse) [07:15:19] (03CR) 10Muehlenhoff: [C:03+2] idp_clouddev: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279091 (https://phabricator.wikimedia.org/T424676) (owner: 10Muehlenhoff) [07:16:08] (03PS1) 10Arnaudb: vrts: switch to new discovery2026 intermediate CA [puppet] - 10https://gerrit.wikimedia.org/r/1279090 (https://phabricator.wikimedia.org/T424669) [07:16:14] (03CR) 10Arnaudb: [C:03+2] vrts: switch to new discovery2026 intermediate CA [puppet] - 10https://gerrit.wikimedia.org/r/1279090 (https://phabricator.wikimedia.org/T424669) (owner: 10Arnaudb) [07:17:19] !log dcausse@deploy1003 Started scap sync-world: Backport for [[gerrit:1278509|Completion: fix near match field name (T417648)]], [[gerrit:1269464|search: add alt. completion indices to test keyword tokenizer (1/2) (T420427)]] [07:17:24] T417648: [MEX] M4 - improve findability of properties on lookups - https://phabricator.wikimedia.org/T417648 [07:17:25] T420427: Search shouldn't trim trailing space when suggesting suggestions - https://phabricator.wikimedia.org/T420427 [07:19:18] !log dcausse@deploy1003 dcausse: Backport for [[gerrit:1278509|Completion: fix near match field name (T417648)]], [[gerrit:1269464|search: add alt. completion indices to test keyword tokenizer (1/2) (T420427)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:20:30] (03PS1) 10Arnaudb: lists: switch to new discovery2026 intermediate CA [puppet] - 10https://gerrit.wikimedia.org/r/1279089 (https://phabricator.wikimedia.org/T424669) [07:20:32] !log dcausse@deploy1003 dcausse: Continuing with deployment [07:21:26] 10ops-eqiad, 06DC-Ops, 10ServiceOps-Upgrades-Hardware: Multi-bit memory errors on wikikube-worker1039.eqiad.wmnet - https://phabricator.wikimedia.org/T424797 (10JMeybohm) 03NEW [07:21:35] 10ops-eqiad, 06DC-Ops, 10ServiceOps-Upgrades-Hardware: Multi-bit memory errors on wikikube-worker1039.eqiad.wmnet - https://phabricator.wikimedia.org/T424797#11869639 (10JMeybohm) [07:22:59] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker1039.eqiad.wmnet [07:23:12] (03CR) 10A smart kitten: "(in case you have any interest in reviewing logo patches, apologies if not `:)`)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276284 (https://phabricator.wikimedia.org/T424128) (owner: 10WAN233) [07:23:45] (03CR) 10Phuedx: [C:03+1] WikiLambdaApi: update stream configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278704 (https://phabricator.wikimedia.org/T415254) (owner: 10Santiago Faci) [07:23:45] FIRING: Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [07:24:26] !log dcausse@deploy1003 Finished scap sync-world: Backport for [[gerrit:1278509|Completion: fix near match field name (T417648)]], [[gerrit:1269464|search: add alt. completion indices to test keyword tokenizer (1/2) (T420427)]] (duration: 07m 07s) [07:24:32] T417648: [MEX] M4 - improve findability of properties on lookups - https://phabricator.wikimedia.org/T417648 [07:24:32] T420427: Search shouldn't trim trailing space when suggesting suggestions - https://phabricator.wikimedia.org/T420427 [07:24:38] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install cloudcephosd1053 - https://phabricator.wikimedia.org/T416394#11869641 (10ayounsi) That's correct. Those switches are also EOL and will be refreshed next FY. New switches will be 25G compatible. [07:25:36] !log jayme@cumin1003 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) depool for host wikikube-worker1039.eqiad.wmnet [07:26:08] I'm done deploying [07:26:44] (03PS1) 10Arnaudb: ci: switch to new discovery2026 intermediate CA [puppet] - 10https://gerrit.wikimedia.org/r/1279088 (https://phabricator.wikimedia.org/T424669) [07:29:38] (03PS1) 10Muehlenhoff: idm: Unconditionally use Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279095 [07:30:13] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker1039.eqiad.wmnet [07:30:15] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker1039.eqiad.wmnet [07:30:25] 10ops-eqiad, 06SRE, 06DC-Ops, 10ServiceOps-Upgrades-Hardware: Multi-bit memory errors on wikikube-worker1039.eqiad.wmnet - https://phabricator.wikimedia.org/T424797#11869648 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by jayme@cumin1003 depool for host wikikube-worker1039.eqi... [07:31:07] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [07:31:37] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Multi-bit memory errors on wikikube-worker1039.eqiad.wmnet - https://phabricator.wikimedia.org/T424797#11869655 (10JMeybohm) [07:31:38] (03CR) 10CI reject: [V:04-1] idm: Unconditionally use Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279095 (owner: 10Muehlenhoff) [07:31:44] (03PS1) 10Marostegui: db1233,db2189: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279096 (https://phabricator.wikimedia.org/T424615) [07:32:33] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2189.codfw.wmnet with reason: Reimage to Trixie [07:32:37] (03CR) 10Marostegui: [C:03+2] db1233,db2189: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279096 (https://phabricator.wikimedia.org/T424615) (owner: 10Marostegui) [07:32:39] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2189: Reimage to Trixie [07:32:41] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1233.eqiad.wmnet with reason: Reimage to Trixie [07:32:46] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1233: Reimage to Trixie [07:32:53] (03PS1) 10Arnaudb: gerrit: switch to new discovery2026 intermediate CA [puppet] - 10https://gerrit.wikimedia.org/r/1279087 (https://phabricator.wikimedia.org/T424669) [07:33:04] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1233: Reimage to Trixie [07:33:07] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2189: Reimage to Trixie [07:34:16] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2189.codfw.wmnet with OS trixie [07:34:31] (03PS1) 10Marostegui: Revert "db1233,db2189: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1279097 [07:34:38] (03CR) 10Marostegui: [C:04-2] "Not yet" [puppet] - 10https://gerrit.wikimedia.org/r/1279097 (owner: 10Marostegui) [07:34:51] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1233.eqiad.wmnet with OS trixie [07:36:09] (03PS2) 10Muehlenhoff: idm: Unconditionally use Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279095 [07:37:52] (03CR) 10Jelto: [C:03+1] "lgtm, thank you" [puppet] - 10https://gerrit.wikimedia.org/r/1279092 (https://phabricator.wikimedia.org/T424796) (owner: 10Arnaudb) [07:38:09] (03CR) 10CI reject: [V:04-1] idm: Unconditionally use Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279095 (owner: 10Muehlenhoff) [07:38:17] (03CR) 10Arnaudb: [C:03+2] phabricator: add -ignore_readdir_race to clean_tmp_files service [puppet] - 10https://gerrit.wikimedia.org/r/1279092 (https://phabricator.wikimedia.org/T424796) (owner: 10Arnaudb) [07:38:27] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1007.eqiad.wmnet with OS trixie [07:39:28] !log T422860 [cloudelastic] Restarted opensearch services on `cloudelastic1011` and `cloudelastic1012` (needed to pick up missing opensearch plugins, which have already been fixed in puppet) (note: this was done ~2h ago; logged in wrong channel) [07:39:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:32] T422860: Migrate Cloudelastic to OpenSearch 2.x - https://phabricator.wikimedia.org/T422860 [07:43:29] (03PS1) 10Arnaudb: envoyproxy: update verify-envoy-config logic [puppet] - 10https://gerrit.wikimedia.org/r/1278482 (https://phabricator.wikimedia.org/T421827) [07:43:29] (03CR) 10Arnaudb: "the initial change has been split into a relation chain, sorry for the spam!" [puppet] - 10https://gerrit.wikimedia.org/r/1278482 (https://phabricator.wikimedia.org/T421827) (owner: 10Arnaudb) [07:44:44] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 308 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1225, relocating_shards: 0, initializing_shards: 23, unassigned_shar [07:44:44] delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 79.90867579908677 https://wikitech.wikimedia.org/wiki/Search%23Administration [07:44:46] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch inactive shards 308 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1225, relocating_shards: 0, initializing_shards: 23, unassigned_shar [07:44:46] delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 79.90867579908677 https://wikitech.wikimedia.org/wiki/Search%23Administration [07:44:46] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 308 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1225, relocating_shards: 0, initializing_shards: 23, unassigned_shar [07:44:46] delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 79.90867579908677 https://wikitech.wikimedia.org/wiki/Search%23Administration [07:45:20] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 302 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1231, relocating_shards: 0, initializing_shards: 21, unassigned_shar [07:45:20] delayed_unassigned_shards: 0, number_of_pending_tasks: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 80.30006523157208 https://wikitech.wikimedia.org/wiki/Search%23Administration [07:45:32] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1008 is CRITICAL: CRITICAL - elasticsearch inactive shards 297 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1236, relocating_shards: 0, initializing_shards: 21, unassigned_shar [07:45:32] delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 80.62622309197651 https://wikitech.wikimedia.org/wiki/Search%23Administration [07:47:48] ^ cluster was green before reimage of a single host, this shouldn't have happened. investigating. note this is cloudelastic not prod-cirrus, so not a huge blast radius [07:49:20] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1304, relocating_shards: 0, initializing_shards: 21, unassigned_shards: 208, delayed_unassig [07:49:20] ds: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.06196999347684 https://wikitech.wikimedia.org/wiki/Search%23Administration [07:49:32] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1008 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1306, relocating_shards: 0, initializing_shards: 21, unassigned_shards: 206, delayed_unassig [07:49:32] ds: 0, number_of_pending_tasks: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.19243313763862 https://wikitech.wikimedia.org/wiki/Search%23Administration [07:49:44] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1311, relocating_shards: 0, initializing_shards: 21, unassigned_shards: 201, delayed_unassig [07:49:44] ds: 0, number_of_pending_tasks: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.51859099804305 https://wikitech.wikimedia.org/wiki/Search%23Administration [07:49:46] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1312, relocating_shards: 0, initializing_shards: 21, unassigned_shards: 200, delayed_unassig [07:49:46] ds: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.58382257012394 https://wikitech.wikimedia.org/wiki/Search%23Administration [07:49:46] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1312, relocating_shards: 0, initializing_shards: 21, unassigned_shards: 200, delayed_unassig [07:49:46] ds: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.58382257012394 https://wikitech.wikimedia.org/wiki/Search%23Administration [07:51:33] ah, I misread the original output; it went green->yellow not green->red. sorry for the noise, should quiet down now though [07:51:59] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1277503 (https://phabricator.wikimedia.org/T421827) (owner: 10Arnaudb) [07:51:59] (03CR) 10Elukey: [C:03+2] services: Add TLS SANs to the evaluators' mesh configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277518 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [07:52:06] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1233.eqiad.wmnet with reason: host reimage [07:52:09] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [07:52:17] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1198: after reimage to trixie [07:52:44] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [07:53:21] !log a-pizzata@deploy1003 Started deploy [analytics/refinery@d6a17a0] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@d6a17a0a] [07:53:38] jouncebot: nowandnext [07:53:38] For the next 0 hour(s) and 6 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260429T0700) [07:53:38] In 2 hour(s) and 6 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260429T1000) [07:53:49] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2189.codfw.wmnet with reason: host reimage [07:54:24] (03PS1) 10Marostegui: db1175,db2194: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279158 (https://phabricator.wikimedia.org/T424792) [07:55:19] !log a-pizzata@deploy1003 Finished deploy [analytics/refinery@d6a17a0] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@d6a17a0a] (duration: 01m 57s) [07:55:28] (03CR) 10Marostegui: [C:03+2] db1175,db2194: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279158 (https://phabricator.wikimedia.org/T424792) (owner: 10Marostegui) [07:55:32] !log a-pizzata@deploy1003 Started deploy [analytics/refinery@d6a17a0]: Regular analytics weekly train [analytics/refinery@d6a17a0a] [07:55:56] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1175.eqiad.wmnet with reason: Reimage to Trixie [07:56:01] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1175: Reimage to Trixie [07:56:08] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1175: Reimage to Trixie [07:56:38] (03CR) 10Elukey: [C:03+2] restbase: migrate envoy TLS proxy services to new intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1278554 (https://phabricator.wikimedia.org/T424674) (owner: 10Eevans) [07:56:44] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2227: after reimage to trixie [07:57:22] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2194.codfw.wmnet with reason: Reimage to Trixie [07:57:28] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2194: Reimage to Trixie [07:57:38] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): 2 devices deleted from netbox that where active - https://phabricator.wikimedia.org/T424019#11869732 (10ayounsi) Good job! The last step needed was to run the ImportPuppetDB Netbox script: https://netbox.wikimedia.org/extras/scrip... [07:57:46] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2194: Reimage to Trixie [07:58:29] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1175.eqiad.wmnet with OS trixie [07:59:15] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1233.eqiad.wmnet with reason: host reimage [07:59:23] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2194.codfw.wmnet with OS trixie [07:59:40] (03CR) 10Brouberol: [C:03+1] kafka-main: set main-eqiad cluster brokers to Confluent distro 77 (3.7) [puppet] - 10https://gerrit.wikimedia.org/r/1278832 (https://phabricator.wikimedia.org/T419216) (owner: 10Jasmine) [07:59:44] !log a-pizzata@deploy1003 Finished deploy [analytics/refinery@d6a17a0]: Regular analytics weekly train [analytics/refinery@d6a17a0a] (duration: 04m 12s) [08:01:13] (03PS1) 10Jelto: gitlab: rename backup-restore process [puppet] - 10https://gerrit.wikimedia.org/r/1279229 (https://phabricator.wikimedia.org/T424239) [08:01:35] (03PS1) 10MVernon: role::cephadm::rgw: use discovery2026 intermediate for envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279230 (https://phabricator.wikimedia.org/T424674) [08:02:23] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2189.codfw.wmnet with reason: host reimage [08:03:32] (03PS1) 10Marostegui: Revert "db1175,db2194: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1279231 [08:03:45] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8489/co" [puppet] - 10https://gerrit.wikimedia.org/r/1279229 (https://phabricator.wikimedia.org/T424239) (owner: 10Jelto) [08:06:50] !log a-pizzata@deploy1003 Started deploy [analytics/refinery@d6a17a0] (thin): Regular analytics weekly train THIN [analytics/refinery@d6a17a0a] [08:07:49] (03PS1) 10Dpogorzelski: ml-serve: fix gpu partitioning scheme [puppet] - 10https://gerrit.wikimedia.org/r/1279232 [08:08:45] !log a-pizzata@deploy1003 Finished deploy [analytics/refinery@d6a17a0] (thin): Regular analytics weekly train THIN [analytics/refinery@d6a17a0a] (duration: 01m 54s) [08:08:51] 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Cookbook for rack depool - https://phabricator.wikimedia.org/T327300#11869774 (10ayounsi) >>! In T327300#11843281, @FCeratto-WMF wrote: > In zarcillo we have the relation `host <-> role <-> rack` and we can label replicas and candidates as depool... [08:09:05] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [08:09:30] RECOVERY - WMF Cloud -Omega Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 750 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration [08:09:30] RECOVERY - WMF Cloud -Omega Cluster- - Prod MW AppServer Port - SSL Expiry on cloudelastic.wikimedia.org is OK: OK - Certificate cloudelastic.wikimedia.org will expire on Sun 05 Jul 2026 07:49:09 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Search%23Administration [08:09:34] FIRING: [17x] CertAlmostExpired: Certificate for service apus:443 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:12:39] RECOVERY - orchestrator resolve cache non-FQDNs on dborch1002 is OK: OK: all orchestrator resolve cache entries are FQDNs https://wikitech.wikimedia.org/wiki/Orchestrator [08:13:42] (03CR) 10Elukey: [C:03+1] role::cephadm::rgw: use discovery2026 intermediate for envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279230 (https://phabricator.wikimedia.org/T424674) (owner: 10MVernon) [08:14:11] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1175.eqiad.wmnet with reason: host reimage [08:14:27] (03PS1) 10Muehlenhoff: puppetserver: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279234 (https://phabricator.wikimedia.org/T424676) [08:14:34] FIRING: [17x] CertAlmostExpired: Certificate for service apus:443 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:14:42] (03PS2) 10Muehlenhoff: puppetserver: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279234 (https://phabricator.wikimedia.org/T424676) [08:15:25] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.179 second response time https://wikitech.wikimedia.org/wiki/Swift [08:15:29] PROBLEM - Swift https backend on ms-fe2023 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 4.752 second response time https://wikitech.wikimedia.org/wiki/Swift [08:15:35] PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [08:15:35] PROBLEM - Swift https backend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [08:15:41] PROBLEM - Swift https backend on ms-fe2022 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.176 second response time https://wikitech.wikimedia.org/wiki/Swift [08:15:41] PROBLEM - Swift https frontend on ms-fe2022 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.184 second response time https://wikitech.wikimedia.org/wiki/Swift [08:15:43] (03CR) 10Marostegui: Revert "db1233,db2189: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1279097 (owner: 10Marostegui) [08:15:49] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [08:15:49] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [08:15:49] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [08:16:04] (03CR) 10Marostegui: [C:03+2] Revert "db1233,db2189: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1279097 (owner: 10Marostegui) [08:16:25] RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.198 second response time https://wikitech.wikimedia.org/wiki/Swift [08:16:25] RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.202 second response time https://wikitech.wikimedia.org/wiki/Swift [08:16:35] PROBLEM - Swift https backend on ms-fe2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [08:16:39] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.170 second response time https://wikitech.wikimedia.org/wiki/Swift [08:16:39] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Swift [08:16:49] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [08:16:49] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [08:16:50] !log disable puppet in apus/codfw for TLS key rollover T424674 [08:16:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:54] T424674: Migrate Data Persistence Envoy TLS proxy services to the 2026 discovery intermediate - https://phabricator.wikimedia.org/T424674 [08:16:57] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:17:25] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 1.404 second response time https://wikitech.wikimedia.org/wiki/Swift [08:17:25] Emperor: expected bump? [08:17:39] FIRING: DiskSpace: Disk space cloudelastic1010:9100:/srv 13.17% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudelastic1010 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [08:17:39] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 571 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Swift [08:17:45] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 6.433 second response time https://wikitech.wikimedia.org/wiki/Swift [08:17:49] PROBLEM - Swift https frontend on ms-fe2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [08:17:49] PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [08:17:49] PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [08:17:52] no, I was working on apus, I just want to put that back, then I'll get to the page [08:17:53] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 499.56 ms [08:18:10] FIRING: [17x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:18:25] RECOVERY - Swift https backend on ms-fe2023 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.199 second response time https://wikitech.wikimedia.org/wiki/Swift [08:18:26] !log re-enable puppet in apus/codfw for TLS key rollover T424674 (no change, incident took over) [08:18:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:32] RECOVERY - Swift https backend on ms-fe2024 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 1.906 second response time https://wikitech.wikimedia.org/wiki/Swift [08:18:40] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Swift [08:18:40] RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Swift [08:18:40] RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Swift [08:18:40] RECOVERY - Swift https backend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.192 second response time https://wikitech.wikimedia.org/wiki/Swift [08:18:40] RECOVERY - Swift https frontend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.221 second response time https://wikitech.wikimedia.org/wiki/Swift [08:18:50] PROBLEM - Swift https backend on ms-fe2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [08:18:53] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1175.eqiad.wmnet with reason: host reimage [08:19:50] RECOVERY - Swift https frontend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 9.205 second response time https://wikitech.wikimedia.org/wiki/Swift [08:19:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, ... [08:19:51] IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F0%2F1%3A2 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [08:19:58] (03CR) 10JavierMonton: [C:03+1] alerts: mw-page-html-feature-counts-change-enrich (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1278559 (https://phabricator.wikimedia.org/T424224) (owner: 10AKhatun) [08:20:42] RECOVERY - Swift https backend on ms-fe2021 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Swift [08:21:27] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2194.codfw.wmnet with reason: host reimage [08:21:28] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1233.eqiad.wmnet with OS trixie [08:21:57] RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:22:10] (03CR) 10Elukey: [C:03+1] puppetserver: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279234 (https://phabricator.wikimedia.org/T424676) (owner: 10Muehlenhoff) [08:22:15] (03CR) 10Elukey: [C:03+2] puppetserver: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279234 (https://phabricator.wikimedia.org/T424676) (owner: 10Muehlenhoff) [08:24:34] FIRING: [16x] CertAlmostExpired: Certificate for service apus:443 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:24:42] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2189.codfw.wmnet with OS trixie [08:24:44] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2194.codfw.wmnet with reason: host reimage [08:24:48] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1233: after reimage to trixie [08:26:08] (03PS1) 10Elukey: role::config_master: move to pki intermediate discovery2026 [puppet] - 10https://gerrit.wikimedia.org/r/1279235 (https://phabricator.wikimedia.org/T424676) [08:27:09] (03CR) 10JMeybohm: [C:03+1] "+1 to do codfw first" [puppet] - 10https://gerrit.wikimedia.org/r/1278832 (https://phabricator.wikimedia.org/T419216) (owner: 10Jasmine) [08:28:14] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2189: after reimage to trixie [08:29:17] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [08:29:56] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: sync [08:29:59] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: sync [08:31:39] (03CR) 10Dpogorzelski: [C:03+2] ml-serve: fix gpu partitioning scheme [puppet] - 10https://gerrit.wikimedia.org/r/1279232 (owner: 10Dpogorzelski) [08:34:19] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 233.94 ms [08:36:07] (03CR) 10Marostegui: [C:03+2] Revert "db1175,db2194: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1279231 (owner: 10Marostegui) [08:36:43] (03PS1) 10Muehlenhoff: configmaster: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279237 (https://phabricator.wikimedia.org/T424676) [08:37:04] (03CR) 10Elukey: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/1279235 :P" [puppet] - 10https://gerrit.wikimedia.org/r/1279237 (https://phabricator.wikimedia.org/T424676) (owner: 10Muehlenhoff) [08:37:32] !log urbanecm@deploy1003 mwscript-k8s job started: extensions/Translate/scripts/moveTranslatableBundle.php --wiki=mediawikiwiki Wikimedia_Apps/Team/Android/TriviaGame 'Wikimedia Apps/Team/Android/Which' came 'first? Game' 'Martin Urbanec (WMF)' '--reason=per [[:phab:T423845]]' # T423845 [08:37:37] T423845: Request to move translatable page: Wikimedia Apps/Team/Android/TriviaGame - https://phabricator.wikimedia.org/T423845 [08:37:46] (03PS1) 10Kevin Bazira: inference-services: allow LLM isvcs to work on ml-serve1014 and ml-serve1015 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279238 (https://phabricator.wikimedia.org/T418350) [08:38:03] !log urbanecm@deploy1003 mwscript-k8s job started: extensions/Translate/scripts/moveTranslatableBundle.php --wiki=mediawikiwiki Wikimedia_Apps/Team/Android/TriviaGame 'Wikimedia Apps/Team/Android/"Which came first?" Game' 'Martin Urbanec (WMF)' '--reason=per [[:phab:T423845]]' # T423845 [08:38:53] !log urbanecm@deploy1003 mwscript-k8s job started: extensions/Translate/scripts/moveTranslatableBundle.php --wiki=mediawikiwiki Wikimedia_Apps/Team/Android/TriviaGame 'Wikimedia Apps/Team/Android/"Which came first?" Game' 'Martin Urbanec (WMF)' '--reason=per [[:phab:T423845]]' # T423845 [08:39:34] FIRING: [16x] CertAlmostExpired: Certificate for service apus:443 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:39:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/CheckUser] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1278380 (https://phabricator.wikimedia.org/T420517) (owner: 10STran) [08:40:43] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [08:40:50] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1175.eqiad.wmnet with OS trixie [08:41:17] (03PS6) 10Tiziano Fogli: logstash: add thanos-query-frontend filter [puppet] - 10https://gerrit.wikimedia.org/r/1275800 (https://phabricator.wikimedia.org/T423986) [08:41:33] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: rename backup-restore process [puppet] - 10https://gerrit.wikimedia.org/r/1279229 (https://phabricator.wikimedia.org/T424239) (owner: 10Jelto) [08:42:16] !log ryankemper@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1007.eqiad.wmnet with OS trixie [08:42:39] FIRING: DiskSpace: Disk space cloudelastic1010:9100:/srv 8.062% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudelastic1010 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [08:42:59] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1279235 (https://phabricator.wikimedia.org/T424676) (owner: 10Elukey) [08:43:27] (03CR) 10Muehlenhoff: "All great minds think alike :) +1d yours, gonna abandon mine" [puppet] - 10https://gerrit.wikimedia.org/r/1279237 (https://phabricator.wikimedia.org/T424676) (owner: 10Muehlenhoff) [08:43:34] (03Abandoned) 10Muehlenhoff: configmaster: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279237 (https://phabricator.wikimedia.org/T424676) (owner: 10Muehlenhoff) [08:45:13] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1175: after reimage to trixie [08:45:15] (03CR) 10Dpogorzelski: [C:03+1] inference-services: allow LLM isvcs to work on ml-serve1014 and ml-serve1015 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279238 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [08:46:04] (03CR) 10Kevin Bazira: [C:03+2] inference-services: allow LLM isvcs to work on ml-serve1014 and ml-serve1015 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279238 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [08:48:25] (03Merged) 10jenkins-bot: inference-services: allow LLM isvcs to work on ml-serve1014 and ml-serve1015 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279238 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [08:48:59] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2194.codfw.wmnet with OS trixie [08:51:17] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [08:51:50] !log dpogorzelski@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [08:53:31] RESOLVED: Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [08:54:15] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2194: after reimage to trixie [08:56:09] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 229.04 ms [08:56:47] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [08:56:54] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1174 (T419961)', diff saved to https://phabricator.wikimedia.org/P91854 and previous config saved to /var/cache/conftool/dbconfig/20260429-085654-fceratto.json [08:59:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, ... [08:59:51] IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F0%2F1%3A2 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [09:00:13] jmm@cumin2002 reimage (PID 197991) is awaiting input [09:01:07] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti5005.eqsin.wmnet with OS bookworm [09:01:22] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11869976 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti5005.eqsin.wmnet with OS bookworm [09:02:31] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [09:04:15] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 13Patch-For-Review: Migrate Data Persistence Envoy TLS proxy services to the 2026 discovery intermediate - https://phabricator.wikimedia.org/T424674#11869984 (10MoritzMuehlenhoff) [09:05:32] 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11869990 (10MoritzMuehlenhoff) [09:05:35] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T419961)', diff saved to https://phabricator.wikimedia.org/P91857 and previous config saved to /var/cache/conftool/dbconfig/20260429-090534-fceratto.json [09:06:07] (03CR) 10Arnaudb: [C:03+2] envoy: configure listener buffer and fast open queue length [puppet] - 10https://gerrit.wikimedia.org/r/1277503 (https://phabricator.wikimedia.org/T421827) (owner: 10Arnaudb) [09:07:40] (03PS2) 10Jelto: sre.gitlab.upgrade: add downtime for failing gitlab-backup-restore.service [cookbooks] - 10https://gerrit.wikimedia.org/r/1277234 (https://phabricator.wikimedia.org/T424239) [09:07:40] (03CR) 10Jelto: "I used some of the code from I3a0cc2c0ce747af5b31cdccdb6ad60d290bb2305" [cookbooks] - 10https://gerrit.wikimedia.org/r/1277234 (https://phabricator.wikimedia.org/T424239) (owner: 10Jelto) [09:07:49] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1278610 (https://phabricator.wikimedia.org/T420993) (owner: 10Bking) [09:09:00] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta [09:10:12] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1233: after reimage to trixie [09:10:49] (03PS1) 10Gkyziridis: ml-services: Deploy rr-multilingual latest model version on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279244 (https://phabricator.wikimedia.org/T415892) [09:11:55] jmm@cumin2002 reimage (PID 197991) is awaiting input [09:13:39] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2189: after reimage to trixie [09:15:43] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P91862 and previous config saved to /var/cache/conftool/dbconfig/20260429-091542-fceratto.json [09:15:52] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1279230 (https://phabricator.wikimedia.org/T424674) (owner: 10MVernon) [09:16:42] 06SRE, 10Observability-Metrics, 06SRE Observability (FY2025/2026-Q4): Migrate prometheus5002 to prometheus5003 - https://phabricator.wikimedia.org/T424024#11870035 (10tappof) [09:17:39] FIRING: [2x] DiskSpace: Disk space cloudelastic1010:9100:/srv 9.07% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudelastic1010 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [09:17:51] (03CR) 10Elukey: [C:03+2] role::config_master: move to pki intermediate discovery2026 [puppet] - 10https://gerrit.wikimedia.org/r/1279235 (https://phabricator.wikimedia.org/T424676) (owner: 10Elukey) [09:17:52] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 229.29 ms [09:19:06] (03PS1) 10Marostegui: db1229,db2175: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279245 (https://phabricator.wikimedia.org/T424615) [09:19:46] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2175.codfw.wmnet with reason: Reimage to Trixie [09:19:51] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2175: Reimage to Trixie [09:20:03] (03CR) 10Marostegui: [C:03+2] db1229,db2175: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279245 (https://phabricator.wikimedia.org/T424615) (owner: 10Marostegui) [09:20:11] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1229.eqiad.wmnet with reason: Reimage to Trixie [09:20:16] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1229: Reimage to Trixie [09:20:19] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2175: Reimage to Trixie [09:21:04] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1229: Reimage to Trixie [09:21:54] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2175.codfw.wmnet with OS trixie [09:22:10] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1229.eqiad.wmnet with OS trixie [09:22:23] (03CR) 10Arnaudb: "this will be an improvement for the upgrade process, thanks! I think I spotted a small issue, let me know if that does not make sense" [cookbooks] - 10https://gerrit.wikimedia.org/r/1277234 (https://phabricator.wikimedia.org/T424239) (owner: 10Jelto) [09:23:33] (03PS3) 10Jelto: sre.gitlab.upgrade: add downtime for failing gitlab-backup-restore.service [cookbooks] - 10https://gerrit.wikimedia.org/r/1277234 (https://phabricator.wikimedia.org/T424239) [09:24:34] FIRING: [15x] CertAlmostExpired: Certificate for service apus:443 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:25:41] !log ayounsi@cumin1003 START - Cookbook sre.dns.admin DNS admin: pool ulsfo [reason: primary network link stable, no task ID specified] [09:25:51] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P91866 and previous config saved to /var/cache/conftool/dbconfig/20260429-092551-fceratto.json [09:25:59] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool ulsfo [reason: primary network link stable, no task ID specified] [09:27:48] (03CR) 10Jelto: sre.gitlab.upgrade: add downtime for failing gitlab-backup-restore.service (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1277234 (https://phabricator.wikimedia.org/T424239) (owner: 10Jelto) [09:28:25] (03PS1) 10JavierMonton: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279247 (https://phabricator.wikimedia.org/T424624) [09:28:26] (03PS1) 10Elukey: Update Yarn, Analytics Webserver, Eventschemas and Matomo [puppet] - 10https://gerrit.wikimedia.org/r/1279246 (https://phabricator.wikimedia.org/T424672) [09:28:32] (03CR) 10Arnaudb: [C:03+1] "thanks for the change and the quick fix, lgtm!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1277234 (https://phabricator.wikimedia.org/T424239) (owner: 10Jelto) [09:28:44] 06SRE, 10Observability-Metrics, 06SRE Observability (FY2025/2026-Q4): Migrate prometheus5002 to prometheus5003 - https://phabricator.wikimedia.org/T424024#11870080 (10tappof) [09:30:14] (03PS1) 10Marostegui: Revert "db1229,db2175: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1279249 [09:30:38] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1175: after reimage to trixie [09:30:57] (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1279246 (https://phabricator.wikimedia.org/T424672) (owner: 10Elukey) [09:31:13] (03CR) 10Jelto: [C:03+2] sre.gitlab.upgrade: add downtime for failing gitlab-backup-restore.service [cookbooks] - 10https://gerrit.wikimedia.org/r/1277234 (https://phabricator.wikimedia.org/T424239) (owner: 10Jelto) [09:31:37] (03CR) 10Elukey: [C:03+2] Update Yarn, Analytics Webserver, Eventschemas and Matomo [puppet] - 10https://gerrit.wikimedia.org/r/1279246 (https://phabricator.wikimedia.org/T424672) (owner: 10Elukey) [09:32:39] RESOLVED: [2x] DiskSpace: Disk space cloudelastic1010:9100:/srv 9.095% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudelastic1010 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [09:33:47] (03PS1) 10Marostegui: db1166,db2190: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279250 (https://phabricator.wikimedia.org/T424792) [09:34:01] (03Merged) 10jenkins-bot: sre.gitlab.upgrade: add downtime for failing gitlab-backup-restore.service [cookbooks] - 10https://gerrit.wikimedia.org/r/1277234 (https://phabricator.wikimedia.org/T424239) (owner: 10Jelto) [09:34:11] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1166.eqiad.wmnet with reason: Reimage to Trixie [09:34:16] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1166: Reimage to Trixie [09:34:20] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [09:34:26] (03CR) 10Marostegui: [C:03+2] Revert "db1229,db2175: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1279249 (owner: 10Marostegui) [09:34:34] FIRING: [15x] CertAlmostExpired: Certificate for service apus:443 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:34:35] (03CR) 10Marostegui: Revert "db1229,db2175: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1279249 (owner: 10Marostegui) [09:34:43] (03CR) 10Marostegui: [C:03+2] db1166,db2190: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279250 (https://phabricator.wikimedia.org/T424792) (owner: 10Marostegui) [09:34:44] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1166: Reimage to Trixie [09:35:56] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1166.eqiad.wmnet with OS trixie [09:35:58] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T419961)', diff saved to https://phabricator.wikimedia.org/P91869 and previous config saved to /var/cache/conftool/dbconfig/20260429-093557-fceratto.json [09:36:17] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance [09:36:25] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1191 (T419961)', diff saved to https://phabricator.wikimedia.org/P91870 and previous config saved to /var/cache/conftool/dbconfig/20260429-093624-fceratto.json [09:37:04] (03PS1) 10Tiziano Fogli: prom5003: assign prometheus::pop role [puppet] - 10https://gerrit.wikimedia.org/r/1279251 (https://phabricator.wikimedia.org/T424024) [09:37:06] (03PS1) 10Tiziano Fogli: prom5003: setup firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1279252 (https://phabricator.wikimedia.org/T424024) [09:37:08] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11870110 (10SLyngshede-WMF) Depooling command: ` $ ssh cumin1003.eqiad.wmnet $ sudo cookbook sre.dns.admin depool ulsfo ` [09:37:08] (03PS1) 10Tiziano Fogli: prometheus::pop: enable rsyncd on eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1279253 (https://phabricator.wikimedia.org/T424024) [09:37:10] (03PS1) 10Tiziano Fogli: prometheus/eqsin: remove 5002, add 5003 [puppet] - 10https://gerrit.wikimedia.org/r/1279254 (https://phabricator.wikimedia.org/T424024) [09:37:34] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1229.eqiad.wmnet with reason: host reimage [09:39:08] (03PS1) 10Tiziano Fogli: prom5003: clean up firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1279255 (https://phabricator.wikimedia.org/T424024) [09:39:23] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING WARNING - Packet loss = 0%, RTA = 562.65 ms [09:39:40] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2194: after reimage to trixie [09:40:31] !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Test noop upgrade on the replica [09:40:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:41:16] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2190.codfw.wmnet with reason: Reimage to Trixie [09:41:21] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2190: Reimage to Trixie [09:41:40] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2190: Reimage to Trixie [09:42:09] (03PS1) 10Tiziano Fogli: prometheus/eqsin: update svc record [dns] - 10https://gerrit.wikimedia.org/r/1279256 (https://phabricator.wikimedia.org/T424024) [09:42:15] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1229.eqiad.wmnet with reason: host reimage [09:43:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T419961)', diff saved to https://phabricator.wikimedia.org/P91873 and previous config saved to /var/cache/conftool/dbconfig/20260429-094333-fceratto.json [09:44:06] !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Test noop upgrade on the replica [09:44:28] (03PS1) 10Marostegui: Revert "db1166,db2190: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1279257 [09:44:34] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2190.codfw.wmnet with reason: Reimage to Trixie [09:44:34] FIRING: [14x] CertAlmostExpired: Certificate for service apus:443 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:44:39] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2190: Reimage to Trixie [09:44:46] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.depool (exit_code=99) depool db2190: Reimage to Trixie [09:45:52] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2190.codfw.wmnet with OS trixie [09:46:18] (03PS1) 10Arnaudb: jenkins: add log monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1278362 (https://phabricator.wikimedia.org/T421827) [09:46:18] (03CR) 10Arnaudb: [C:03+2] "self merging that change, I've tested the monitoring script in my homedir on contint1002 with no issue" [puppet] - 10https://gerrit.wikimedia.org/r/1278362 (https://phabricator.wikimedia.org/T421827) (owner: 10Arnaudb) [09:51:22] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1166.eqiad.wmnet with reason: host reimage [09:51:51] (03PS1) 10Volans: cloud management: add RO Netbox for Spicerack [puppet] - 10https://gerrit.wikimedia.org/r/1279258 [09:52:03] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1279258 (owner: 10Volans) [09:52:51] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:53:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P91874 and previous config saved to /var/cache/conftool/dbconfig/20260429-095341-fceratto.json [09:53:54] (03PS1) 10Btullis: Update the PKI intermediate for the cephosd clusters [puppet] - 10https://gerrit.wikimedia.org/r/1279260 (https://phabricator.wikimedia.org/T424672) [09:53:56] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:54:30] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100% [09:54:32] !log marostegui@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db2175.codfw.wmnet with OS trixie [09:55:26] (03PS2) 10Volans: cloud management: add RO Netbox for Spicerack [puppet] - 10https://gerrit.wikimedia.org/r/1279258 [09:55:30] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1279258 (owner: 10Volans) [09:55:37] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:55:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:56:00] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1279260 (https://phabricator.wikimedia.org/T424672) (owner: 10Btullis) [09:56:57] FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:57:15] RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 37.17 ms [09:57:32] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1166.eqiad.wmnet with reason: host reimage [09:57:55] (03CR) 10Volans: "PCC seems happy:" [puppet] - 10https://gerrit.wikimedia.org/r/1279258 (owner: 10Volans) [09:58:10] FIRING: [19x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:58:18] Emperor: XioNoX ^^ My bad, I refreshed a dashboard for a test and launched heavyweight queries. [09:58:29] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1279260 (https://phabricator.wikimedia.org/T424672) (owner: 10Btullis) [09:58:59] (03Abandoned) 10Arnaudb: gerrit: disable connection reuse on the httpd → jetty layer [puppet] - 10https://gerrit.wikimedia.org/r/1269479 (https://phabricator.wikimedia.org/T421827) (owner: 10Arnaudb) [09:59:14] !incidents [09:59:14] 7882 (UNACKED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [09:59:15] 7881 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [09:59:15] 7880 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [09:59:15] 7879 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr2-eqord:9804 Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372} xe-0/1/3 gnmi eqiad) [09:59:15] 7877 (RESOLVED) kafka-jumbo1013/Kafka Broker Server (paged) [09:59:46] !ack 7882 [09:59:46] 7882 (ACKED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260429T1000) [10:00:15] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2175.codfw.wmnet with OS trixie [10:00:16] (03CR) 10Btullis: [C:03+2] Update the PKI intermediate for the cephosd clusters [puppet] - 10https://gerrit.wikimedia.org/r/1279260 (https://phabricator.wikimedia.org/T424672) (owner: 10Btullis) [10:00:57] tappof: thanks for letting us know. You expect it to self-resolve, or will something need kicking? [10:01:08] Emperor: XioNoX It should recover soon. [10:01:57] RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:02:23] (03CR) 10Filippo Giunchedi: [C:03+1] cloud management: add RO Netbox for Spicerack [puppet] - 10https://gerrit.wikimedia.org/r/1279258 (owner: 10Volans) [10:03:10] FIRING: [18x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:03:50] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P91875 and previous config saved to /var/cache/conftool/dbconfig/20260429-100349-fceratto.json [10:04:23] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2190.codfw.wmnet with reason: host reimage [10:04:27] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1229.eqiad.wmnet with OS trixie [10:05:20] FIRING: [3x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in 4d 3h 49m 25s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [10:05:37] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:07:14] (03PS1) 10Marostegui: db1229: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279262 [10:07:51] (03CR) 10Marostegui: [C:03+2] db1229: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279262 (owner: 10Marostegui) [10:07:55] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1229: after reimage to trixie [10:08:21] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1279251 (https://phabricator.wikimedia.org/T424024) (owner: 10Tiziano Fogli) [10:08:36] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:08:37] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1279252 (https://phabricator.wikimedia.org/T424024) (owner: 10Tiziano Fogli) [10:08:51] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1279253 (https://phabricator.wikimedia.org/T424024) (owner: 10Tiziano Fogli) [10:09:14] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:09:18] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2190.codfw.wmnet with reason: host reimage [10:12:03] !log disable puppet in apus/codfw rgws for TLS key rollover T424674 [10:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:08] T424674: Migrate Data Persistence Envoy TLS proxy services to the 2026 discovery intermediate - https://phabricator.wikimedia.org/T424674 [10:12:31] (03CR) 10MVernon: [C:03+2] role::cephadm::rgw: use discovery2026 intermediate for envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279230 (https://phabricator.wikimedia.org/T424674) (owner: 10MVernon) [10:12:56] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ganeti5005.eqsin.wmnet with OS bookworm [10:13:05] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11870286 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti5005.eqsin.wmnet with OS bookworm executed with errors... [10:13:34] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti5005.eqsin.wmnet with OS bookworm [10:13:44] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11870287 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti5005.eqsin.wmnet with OS bookworm [10:13:58] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T419961)', diff saved to https://phabricator.wikimedia.org/P91877 and previous config saved to /var/cache/conftool/dbconfig/20260429-101358-fceratto.json [10:14:10] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw [10:14:18] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance [10:14:26] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1194 (T419961)', diff saved to https://phabricator.wikimedia.org/P91878 and previous config saved to /var/cache/conftool/dbconfig/20260429-101426-fceratto.json [10:15:33] !log disable puppet in apus/eqiad rgws for TLS key rollover T424674 [10:15:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:11] (03CR) 10Jforrester: "Do we want to name these following MSB (so wikifunctions-evaluator-python/etc.)?" [dns] - 10https://gerrit.wikimedia.org/r/1277099 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [10:17:26] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2175.codfw.wmnet with reason: host reimage [10:19:30] (03CR) 10Marostegui: [C:03+2] Revert "db1166,db2190: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1279257 (owner: 10Marostegui) [10:20:00] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1166.eqiad.wmnet with OS trixie [10:20:21] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1279255 (https://phabricator.wikimedia.org/T424024) (owner: 10Tiziano Fogli) [10:20:50] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1279254 (https://phabricator.wikimedia.org/T424024) (owner: 10Tiziano Fogli) [10:21:43] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T419961)', diff saved to https://phabricator.wikimedia.org/P91879 and previous config saved to /var/cache/conftool/dbconfig/20260429-102142-fceratto.json [10:21:48] FIRING: KubernetesCalicoDown: wikikube-worker1039.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1039.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:22:32] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw [10:23:09] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): 2 devices deleted from netbox that where active - https://phabricator.wikimedia.org/T424019#11870307 (10BTullis) Thanks all. I have now marked those two devices as active in netbox and I have told the Wikidata Platform team that t... [10:23:36] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker1039.eqiad.wmnet [10:23:37] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker1039.eqiad.wmnet [10:24:34] FIRING: [11x] CertAlmostExpired: Certificate for service apus:443 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:24:43] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1166: after reimage to trixie [10:25:04] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2175.codfw.wmnet with reason: host reimage [10:27:56] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.4 point update - https://phabricator.wikimedia.org/T420240#11870321 (10MoritzMuehlenhoff) [10:29:59] !log installing Envoy upgrades on chartmuseum* T410975 T419637 [10:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:04] T410975: Upgrade Envoy to v1.35.7 - https://phabricator.wikimedia.org/T410975 [10:30:05] T419637: Upgrade Envoy to v1.35.9 - https://phabricator.wikimedia.org/T419637 [10:31:51] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P91882 and previous config saved to /var/cache/conftool/dbconfig/20260429-103150-fceratto.json [10:31:57] 06SRE, 10SRE-swift-storage, 10Ceph, 06ServiceOps new, and 2 others: scap can’t deploy (blob upload unknown) after apus.discovery.wmnet is repooled in codfw - https://phabricator.wikimedia.org/T422166#11870345 (10Blake) 05In progress→03Resolved The service has been excluded from the switchover, and... [10:32:06] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2190.codfw.wmnet with OS trixie [10:32:38] !log installing Envoy upgrades on webperf* T410975 T419637 [10:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:42] 10SRE-swift-storage, 10Ceph, 06Data-Persistence: Migrate Data Persistence Envoy TLS proxy services to the 2026 discovery intermediate - https://phabricator.wikimedia.org/T424674#11870353 (10MatthewVernon) [10:34:34] FIRING: [11x] CertAlmostExpired: Certificate for service apus:443 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:37:23] (03PS1) 10MVernon: role::thanos::frontend: use discovery2026 intermediate for envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279265 (https://phabricator.wikimedia.org/T424674) [10:39:10] (03PS1) 10Marostegui: db2175: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279266 [10:41:59] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P91885 and previous config saved to /var/cache/conftool/dbconfig/20260429-104158-fceratto.json [10:42:47] (03CR) 10Tiziano Fogli: [C:03+2] prom5003: assign prometheus::pop role [puppet] - 10https://gerrit.wikimedia.org/r/1279251 (https://phabricator.wikimedia.org/T424024) (owner: 10Tiziano Fogli) [10:42:58] (03CR) 10Tiziano Fogli: [C:03+2] prom5003: setup firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1279252 (https://phabricator.wikimedia.org/T424024) (owner: 10Tiziano Fogli) [10:43:19] (03CR) 10Tiziano Fogli: [C:03+2] prometheus::pop: enable rsyncd on eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1279253 (https://phabricator.wikimedia.org/T424024) (owner: 10Tiziano Fogli) [10:43:53] (03CR) 10Marostegui: [C:03+2] db2175: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279266 (owner: 10Marostegui) [10:45:00] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti5005.eqsin.wmnet with reason: host reimage [10:45:07] (03PS1) 10Jelto: etherpad: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279268 (https://phabricator.wikimedia.org/T420993) [10:46:40] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:47:43] (03PS1) 10STran: Enable staggered rollout for IRS on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279269 [10:48:03] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2175.codfw.wmnet with OS trixie [10:48:28] (03PS2) 10STran: Enable staggered rollout for IRS on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279269 (https://phabricator.wikimedia.org/T424075) [10:49:08] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2175: After reimage [10:49:12] !log marostegui@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool db2175: After reimage [10:49:23] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2175: After reimage [10:50:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti5005.eqsin.wmnet with reason: host reimage [10:50:38] (03CR) 10Mszwarc: [C:03+1] Enable staggered rollout for IRS on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279269 (https://phabricator.wikimedia.org/T424075) (owner: 10STran) [10:50:42] 06SRE, 10Observability-Metrics, 13Patch-For-Review, 06SRE Observability (FY2025/2026-Q4): Migrate prometheus5002 to prometheus5003 - https://phabricator.wikimedia.org/T424024#11870386 (10tappof) [10:52:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T419961)', diff saved to https://phabricator.wikimedia.org/P91887 and previous config saved to /var/cache/conftool/dbconfig/20260429-105206-fceratto.json [10:52:27] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance [10:52:28] (03PS2) 10Federico Ceratto: sre.mysql.global-read-only Set all sections as RO/RW [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) [10:52:28] (03CR) 10Federico Ceratto: "Flagging CR as ready for an initial review, but we still want to test it as discussed." [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [10:52:35] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1202 (T419961)', diff saved to https://phabricator.wikimedia.org/P91888 and previous config saved to /var/cache/conftool/dbconfig/20260429-105234-fceratto.json [10:53:20] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1229: after reimage to trixie [10:54:02] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker1039.eqiad.wmnet [10:54:18] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker1039.eqiad.wmnet [10:54:54] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker1039.eqiad.wmnet [10:55:00] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker1039.eqiad.wmnet [10:55:56] (03CR) 10Arnaudb: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1279268 (https://phabricator.wikimedia.org/T420993) (owner: 10Jelto) [10:57:25] FIRING: [2x] CertAlmostExpired: Certificate for service phab1004:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:57:55] !incidents [10:57:56] 7883 (UNACKED) [2x] CertAlmostExpired sre (phab1004:443 probes/custom eqiad) [10:57:56] 7882 (RESOLVED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [10:57:56] 7881 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [10:57:56] 7880 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [10:57:56] 7879 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr2-eqord:9804 Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372} xe-0/1/3 gnmi eqiad) [10:57:57] 7877 (RESOLVED) kafka-jumbo1013/Kafka Broker Server (paged) [10:58:00] !ack [10:58:00] 7883 (ACKED) [2x] CertAlmostExpired sre (phab1004:443 probes/custom eqiad) [10:58:11] (03CR) 10Jelto: [C:03+2] etherpad: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279268 (https://phabricator.wikimedia.org/T420993) (owner: 10Jelto) [10:58:16] elukey: are on-call about to get p.aged about a lot of certs? [10:59:34] though phab1004 isn't in the link I get from the alert [11:00:04] mvolz: Time to do the Services – Citoid / Zotero deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260429T1100). [11:00:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T419961)', diff saved to https://phabricator.wikimedia.org/P91891 and previous config saved to /var/cache/conftool/dbconfig/20260429-110005-fceratto.json [11:00:36] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2190: After reimage [11:06:11] (03PS1) 10Hnowlan: grafana: use discovery2026 intermediate certificate [puppet] - 10https://gerrit.wikimedia.org/r/1279271 (https://phabricator.wikimedia.org/T424673) [11:07:11] (03Abandoned) 10Marostegui: Revert "db1229,db2175: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1279249 (owner: 10Marostegui) [11:08:19] (03PS1) 10MVernon: role::swift::proxy: use discovery2026 intermediate for envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279272 (https://phabricator.wikimedia.org/T424674) [11:10:08] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1166: after reimage to trixie [11:10:14] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P91895 and previous config saved to /var/cache/conftool/dbconfig/20260429-111013-fceratto.json [11:11:17] !log installing libpng1.6 security updates [11:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:31] (03PS1) 10Jelto: aphlict: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279273 (https://phabricator.wikimedia.org/T420993) [11:11:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti5005.eqsin.wmnet with OS bookworm [11:11:57] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11870507 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti5005.eqsin.wmnet with OS bookworm completed: - ganeti5... [11:12:02] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1279273 (https://phabricator.wikimedia.org/T420993) (owner: 10Jelto) [11:12:32] (03PS9) 10Bartosz Wójtowicz: Fix gRPC Gateway protocol to allow TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277436 (https://phabricator.wikimedia.org/T424049) [11:13:10] (03CR) 10Jelto: [C:03+2] aphlict: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279273 (https://phabricator.wikimedia.org/T420993) (owner: 10Jelto) [11:16:00] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1279271 (https://phabricator.wikimedia.org/T424673) (owner: 10Hnowlan) [11:17:49] (03CR) 10Marostegui: sre.mysql.global-read-only Set all sections as RO/RW (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [11:18:11] (03CR) 10Marostegui: "@Ladsgroup@gmail.com can you also check this please, to make sure nothing MW side would explode." [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [11:20:22] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P91898 and previous config saved to /var/cache/conftool/dbconfig/20260429-112021-fceratto.json [11:23:04] (03PS1) 10Jelto: phabricator: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279274 (https://phabricator.wikimedia.org/T420993) [11:23:39] (03PS1) 10Muehlenhoff: Add ganeti5005 to the routed Ganeti cluster in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1279275 (https://phabricator.wikimedia.org/T421863) [11:27:06] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1279274 (https://phabricator.wikimedia.org/T420993) (owner: 10Jelto) [11:28:08] (03CR) 10Jelto: [C:03+2] phabricator: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279274 (https://phabricator.wikimedia.org/T420993) (owner: 10Jelto) [11:28:53] (03PS1) 10Brouberol: Restore kerberos API authentication by explicitly setting an empty public role [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279277 (https://phabricator.wikimedia.org/T424761) [11:30:24] (03PS2) 10Brouberol: Restore kerberos API authentication by explicitly setting an empty public role [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279277 (https://phabricator.wikimedia.org/T424761) [11:30:30] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T419961)', diff saved to https://phabricator.wikimedia.org/P91899 and previous config saved to /var/cache/conftool/dbconfig/20260429-113029-fceratto.json [11:30:51] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1227.eqiad.wmnet with reason: Maintenance [11:30:57] (03PS3) 10Brouberol: Restore kerberos API authentication by explicitly setting an empty public role [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279277 (https://phabricator.wikimedia.org/T424761) [11:31:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1227 (T419961)', diff saved to https://phabricator.wikimedia.org/P91901 and previous config saved to /var/cache/conftool/dbconfig/20260429-113105-fceratto.json [11:31:25] (03PS1) 10STran: Support staggered rollout via Test Kitchen [extensions/ReportIncident] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279279 (https://phabricator.wikimedia.org/T424220) [11:31:39] (03PS1) 10STran: Update IRS instrumentation [extensions/ReportIncident] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279280 (https://phabricator.wikimedia.org/T424075) [11:32:36] (03PS1) 10Novem Linguae: purge_securepoll: don't exclude private wikis [puppet] - 10https://gerrit.wikimedia.org/r/1279281 (https://phabricator.wikimedia.org/T419309) [11:32:51] (03CR) 10Btullis: [C:03+1] "Fantastic! Thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279277 (https://phabricator.wikimedia.org/T424761) (owner: 10Brouberol) [11:34:51] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2175: After reimage [11:35:12] (03CR) 10Mszwarc: [C:03+1] Support staggered rollout via Test Kitchen [extensions/ReportIncident] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279279 (https://phabricator.wikimedia.org/T424220) (owner: 10STran) [11:35:20] (03CR) 10Mszwarc: [C:03+1] Update IRS instrumentation [extensions/ReportIncident] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279280 (https://phabricator.wikimedia.org/T424075) (owner: 10STran) [11:35:26] (03CR) 10Brouberol: [C:03+2] Restore kerberos API authentication by explicitly setting an empty public role [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279277 (https://phabricator.wikimedia.org/T424761) (owner: 10Brouberol) [11:35:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/ReportIncident] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279279 (https://phabricator.wikimedia.org/T424220) (owner: 10STran) [11:35:42] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/ReportIncident] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279280 (https://phabricator.wikimedia.org/T424075) (owner: 10STran) [11:35:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279269 (https://phabricator.wikimedia.org/T424075) (owner: 10STran) [11:37:25] RESOLVED: [2x] CertAlmostExpired: Certificate for service phab1004:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:38:14] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T419961)', diff saved to https://phabricator.wikimedia.org/P91903 and previous config saved to /var/cache/conftool/dbconfig/20260429-113813-fceratto.json [11:38:23] (03PS1) 10Jelto: peopleweb: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279282 (https://phabricator.wikimedia.org/T420993) [11:39:21] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [11:39:26] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1279282 (https://phabricator.wikimedia.org/T420993) (owner: 10Jelto) [11:39:54] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [11:40:19] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1279272 (https://phabricator.wikimedia.org/T424674) (owner: 10MVernon) [11:40:55] (03CR) 10Jelto: [C:03+2] peopleweb: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279282 (https://phabricator.wikimedia.org/T420993) (owner: 10Jelto) [11:41:26] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1279265 (https://phabricator.wikimedia.org/T424674) (owner: 10MVernon) [11:41:55] (03CR) 10Dreamy Jazz: [C:03+1] purge_securepoll: don't exclude private wikis [puppet] - 10https://gerrit.wikimedia.org/r/1279281 (https://phabricator.wikimedia.org/T419309) (owner: 10Novem Linguae) [11:42:36] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [11:42:42] (03CR) 10Dpogorzelski: "We don't need to change custom_deploy.d/istio/ml-serve/config.yaml, this config is no longer used" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277436 (https://phabricator.wikimedia.org/T424049) (owner: 10Bartosz Wójtowicz) [11:42:56] FIRING: [2x] ProbeDown: Service gitlab1004:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:43:06] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [11:43:45] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply [11:44:17] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply [11:45:37] (03PS1) 10Muehlenhoff: proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279285 [11:46:00] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2190: After reimage [11:46:20] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1279256 (https://phabricator.wikimedia.org/T424024) (owner: 10Tiziano Fogli) [11:46:21] (03PS10) 10Bartosz Wójtowicz: Fix gRPC Gateway protocol to allow TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277436 (https://phabricator.wikimedia.org/T424049) [11:46:43] (03CR) 10Btullis: [C:03+2] Configure dse-k8s-worker nodes for ipip encapsulation [puppet] - 10https://gerrit.wikimedia.org/r/1278519 (https://phabricator.wikimedia.org/T420437) (owner: 10Btullis) [11:46:46] (03CR) 10Ayounsi: [C:03+1] Add ganeti5005 to the routed Ganeti cluster in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1279275 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [11:46:50] (03CR) 10Dpogorzelski: [C:03+1] Fix gRPC Gateway protocol to allow TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277436 (https://phabricator.wikimedia.org/T424049) (owner: 10Bartosz Wójtowicz) [11:47:15] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-product: apply [11:47:28] (03PS2) 10Novem Linguae: purge_securepoll: don't exclude private wikis [puppet] - 10https://gerrit.wikimedia.org/r/1279281 (https://phabricator.wikimedia.org/T419309) [11:47:45] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-product: apply [11:47:51] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [11:47:56] RESOLVED: [2x] ProbeDown: Service gitlab1004:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:48:22] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P91905 and previous config saved to /var/cache/conftool/dbconfig/20260429-114821-fceratto.json [11:48:29] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [11:51:19] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-fr-tech: apply [11:51:48] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-fr-tech: apply [11:51:56] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [11:52:26] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [11:52:33] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply [11:53:11] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply [11:53:16] (03CR) 10Bartosz Wójtowicz: [C:03+2] Fix gRPC Gateway protocol to allow TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277436 (https://phabricator.wikimedia.org/T424049) (owner: 10Bartosz Wójtowicz) [11:53:41] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply [11:54:21] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply [11:54:30] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-sre: apply [11:54:34] !log TLS key rollover for thanos-fe T424674 [11:54:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:39] T424674: Migrate Data Persistence Envoy TLS proxy services to the 2026 discovery intermediate - https://phabricator.wikimedia.org/T424674 [11:54:49] (03CR) 10MVernon: [C:03+2] role::thanos::frontend: use discovery2026 intermediate for envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279265 (https://phabricator.wikimedia.org/T424674) (owner: 10MVernon) [11:55:00] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-sre: apply [11:55:32] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1278682 (https://phabricator.wikimedia.org/T424683) (owner: 10Cathal Mooney) [11:55:56] (03PS1) 10Jelto: doc: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279287 (https://phabricator.wikimedia.org/T420993) [11:56:35] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wikidata: apply [11:57:00] (03PS2) 10Jelto: doc: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279287 (https://phabricator.wikimedia.org/T424669) [11:57:07] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wikidata: apply [11:57:14] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply [11:57:46] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply [11:57:51] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1279287 (https://phabricator.wikimedia.org/T424669) (owner: 10Jelto) [11:58:10] (03CR) 10Arnaudb: [C:03+1] "looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/1279287 (https://phabricator.wikimedia.org/T424669) (owner: 10Jelto) [11:58:30] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P91906 and previous config saved to /var/cache/conftool/dbconfig/20260429-115829-fceratto.json [12:00:23] (03PS1) 10Marostegui: db1223,db2177: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279289 (https://phabricator.wikimedia.org/T424792) [12:00:33] (03Merged) 10jenkins-bot: Fix gRPC Gateway protocol to allow TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277436 (https://phabricator.wikimedia.org/T424049) (owner: 10Bartosz Wójtowicz) [12:00:47] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1223.eqiad.wmnet with reason: Reimage to Trixie [12:00:52] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1223: Reimage to Trixie [12:00:56] (03CR) 10Cathal Mooney: [C:03+1] "LGTM but I'm not really sure I get why this is beneficial? Seems fine but I think I'm missing that bit, maybe in future we start setting " [puppet] - 10https://gerrit.wikimedia.org/r/1278390 (https://phabricator.wikimedia.org/T416360) (owner: 10Ayounsi) [12:00:56] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2177.codfw.wmnet with reason: Reimage to Trixie [12:00:58] (03CR) 10Marostegui: [C:03+2] db1223,db2177: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279289 (https://phabricator.wikimedia.org/T424792) (owner: 10Marostegui) [12:01:02] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2177: Reimage to Trixie [12:01:19] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1223: Reimage to Trixie [12:01:20] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2177: Reimage to Trixie [12:01:31] (03CR) 10Cathal Mooney: [C:03+2] Add pint ignore rules for CoreRouterInterfaceDropPercent [alerts] - 10https://gerrit.wikimedia.org/r/1277472 (owner: 10Cathal Mooney) [12:01:34] (03CR) 10Jelto: [C:03+2] doc: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279287 (https://phabricator.wikimedia.org/T424669) (owner: 10Jelto) [12:02:57] (03PS1) 10Marostegui: db1197: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279290 (https://phabricator.wikimedia.org/T424615) [12:03:04] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1223.eqiad.wmnet with OS trixie [12:03:12] (03Merged) 10jenkins-bot: Add pint ignore rules for CoreRouterInterfaceDropPercent [alerts] - 10https://gerrit.wikimedia.org/r/1277472 (owner: 10Cathal Mooney) [12:03:16] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2177.codfw.wmnet with OS trixie [12:03:50] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1197.eqiad.wmnet with reason: Reimage to Trixie [12:03:52] (03CR) 10Marostegui: [C:03+2] db1197: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279290 (https://phabricator.wikimedia.org/T424615) (owner: 10Marostegui) [12:03:55] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1197: Reimage to Trixie [12:04:03] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2204.codfw.wmnet with reason: Reimage to Trixie [12:04:09] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2204: Reimage to Trixie [12:04:28] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2204: Reimage to Trixie [12:04:30] (03CR) 10Muehlenhoff: [C:03+2] Add ganeti5005 to the routed Ganeti cluster in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1279275 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [12:04:33] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1197: Reimage to Trixie [12:04:34] FIRING: [9x] CertAlmostExpired: Certificate for service grafana:443 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [12:05:18] jouncebot: nowandnext [12:05:18] No deployments scheduled for the next 0 hour(s) and 54 minute(s) [12:05:18] In 0 hour(s) and 54 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260429T1300) [12:05:43] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1197.eqiad.wmnet with OS trixie [12:05:58] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2204.codfw.wmnet with OS trixie [12:06:23] (03PS1) 10Marostegui: db2204: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279292 (https://phabricator.wikimedia.org/T424615) [12:08:38] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T419961)', diff saved to https://phabricator.wikimedia.org/P91911 and previous config saved to /var/cache/conftool/dbconfig/20260429-120837-fceratto.json [12:09:00] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1231.eqiad.wmnet with reason: Maintenance [12:09:08] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1231 (T419961)', diff saved to https://phabricator.wikimedia.org/P91912 and previous config saved to /var/cache/conftool/dbconfig/20260429-120907-fceratto.json [12:09:55] (03PS1) 10Marostegui: Revert "db1223,db2177: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1279298 [12:11:39] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [12:12:53] 06SRE, 10Observability-Metrics, 13Patch-For-Review, 06SRE Observability (FY2025/2026-Q4): Migrate prometheus5002 to prometheus5003 - https://phabricator.wikimedia.org/T424024#11870872 (10tappof) [12:13:02] (03CR) 10Elukey: [C:03+1] cloud management: add RO Netbox for Spicerack [puppet] - 10https://gerrit.wikimedia.org/r/1279258 (owner: 10Volans) [12:14:14] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [12:14:20] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:14:35] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [12:14:51] (03CR) 10Elukey: "Hey James, fine for me, I have already added the configs in k8s for the current naming scheme, but I can change them. Lemme know!" [dns] - 10https://gerrit.wikimedia.org/r/1277099 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [12:16:00] Emperor: sorry just seen your ping now, I think there are few remaining systems with almost expired certs, they shouldn't be paging in theory [12:16:05] did you see otherwise? [12:16:28] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [12:16:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T419961)', diff saved to https://phabricator.wikimedia.org/P91913 and previous config saved to /var/cache/conftool/dbconfig/20260429-121633-fceratto.json [12:17:54] elukey: yeah, we got paged about phab1004 earlier (hence my question) [12:18:34] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1223.eqiad.wmnet with reason: host reimage [12:18:38] (03CR) 10Volans: [C:03+2] cloud management: add RO Netbox for Spicerack [puppet] - 10https://gerrit.wikimedia.org/r/1279258 (owner: 10Volans) [12:19:20] RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:19:51] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1197.eqiad.wmnet with reason: host reimage [12:20:53] (03PS1) 10Elukey: admin_ng: add extra TLS SANs for wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279315 (https://phabricator.wikimedia.org/T424193) [12:21:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5005.eqsin.wmnet [12:21:16] !log bwojtowicz@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [12:21:23] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2177.codfw.wmnet with reason: host reimage [12:21:38] !log TLS key rollover for ms-fe T424674 [12:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:42] T424674: Migrate Data Persistence Envoy TLS proxy services to the 2026 discovery intermediate - https://phabricator.wikimedia.org/T424674 [12:22:02] (03CR) 10MVernon: [C:03+2] role::swift::proxy: use discovery2026 intermediate for envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279272 (https://phabricator.wikimedia.org/T424674) (owner: 10MVernon) [12:22:45] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2204.codfw.wmnet with reason: host reimage [12:22:53] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [12:23:37] Emperor: not sure why it happened, the CertAlmostExpired definition in the alerts repo doesn't have a page severity option afaics [12:23:51] (03PS1) 10Jelto: doc: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279317 (https://phabricator.wikimedia.org/T424669) [12:24:15] (03CR) 10Elukey: [C:03+1] doc: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279317 (https://phabricator.wikimedia.org/T424669) (owner: 10Jelto) [12:24:16] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1223.eqiad.wmnet with reason: host reimage [12:24:56] (03CR) 10Jelto: "The old patch was in the wrong file I89a48749795b414dc51d3e6ff16b3c9d51b488a8. This should be the correct file." [puppet] - 10https://gerrit.wikimedia.org/r/1279317 (https://phabricator.wikimedia.org/T424669) (owner: 10Jelto) [12:25:30] (03CR) 10Jelto: [C:03+2] doc: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279317 (https://phabricator.wikimedia.org/T424669) (owner: 10Jelto) [12:25:34] elukey: IHNI either, maybe something the service owners set up for that service? [12:26:10] I am wondering if it is just for services in service.yaml that can page [12:26:42] we have a dedicated blackbox check for Phab with a pag.ing severity. Maybe this triggered the pag.ing alert [12:26:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P91914 and previous config saved to /var/cache/conftool/dbconfig/20260429-122641-fceratto.json [12:28:31] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2177.codfw.wmnet with reason: host reimage [12:28:49] elukey: https://portal.victorops.com/ui/wikimedia/incident/7883/details has the details [12:29:11] yeah see what jelto wrote above --^ [12:29:27] (03PS1) 10Elukey: role::chartmuseum: move to pki discovery2026 [puppet] - 10https://gerrit.wikimedia.org/r/1279320 (https://phabricator.wikimedia.org/T424671) [12:29:53] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [12:30:33] 06SRE, 10Observability-Metrics, 13Patch-For-Review, 06SRE Observability (FY2025/2026-Q4): Migrate prometheus5002 to prometheus5003 - https://phabricator.wikimedia.org/T424024#11870959 (10tappof) [12:31:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5005.eqsin.wmnet [12:31:02] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1279320 (https://phabricator.wikimedia.org/T424671) (owner: 10Elukey) [12:32:21] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1197.eqiad.wmnet with reason: host reimage [12:33:10] (03CR) 10Elukey: [C:03+2] role::chartmuseum: move to pki discovery2026 [puppet] - 10https://gerrit.wikimedia.org/r/1279320 (https://phabricator.wikimedia.org/T424671) (owner: 10Elukey) [12:34:19] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti5005.eqsin.wmnet to cluster eqsin02 and group 01 [12:34:44] (03PS1) 10Jelto: releases: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279322 (https://phabricator.wikimedia.org/T424669) [12:35:47] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti5005.eqsin.wmnet to cluster eqsin02 and group 01 [12:36:07] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2204.codfw.wmnet with reason: host reimage [12:36:48] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P91915 and previous config saved to /var/cache/conftool/dbconfig/20260429-123648-fceratto.json [12:36:58] (03PS2) 10Elukey: admin_ng: add extra TLS SANs for wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279315 (https://phabricator.wikimedia.org/T424193) [12:37:40] (03CR) 10Tiziano Fogli: [C:03+2] prometheus/eqsin: remove 5002, add 5003 [puppet] - 10https://gerrit.wikimedia.org/r/1279254 (https://phabricator.wikimedia.org/T424024) (owner: 10Tiziano Fogli) [12:38:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13[73-82] - https://phabricator.wikimedia.org/T423719#11871050 (10Jclark-ctr) @jmeybohm can you update site.pp. it only has servers upto wikik... [12:38:52] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): 2 devices deleted from netbox that where active - https://phabricator.wikimedia.org/T424019#11871061 (10Jclark-ctr) 05Open→03Resolved [12:39:20] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:39:34] FIRING: [7x] CertAlmostExpired: Certificate for service grafana:443 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [12:40:25] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Multi-bit memory errors on wikikube-worker1039.eqiad.wmnet - https://phabricator.wikimedia.org/T424797#11871065 (10Jclark-ctr) a:03Jclark-ctr @JMeybohm this server is out of warranty. i could swap with a spare from decom server bu... [12:40:55] !log migrate prometheus5002 to prometheus5003 T424024 [12:40:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:59] T424024: Migrate prometheus5002 to prometheus5003 - https://phabricator.wikimedia.org/T424024 [12:41:49] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1279322 (https://phabricator.wikimedia.org/T424669) (owner: 10Jelto) [12:42:30] (03CR) 10AikoChou: "Thanks for working on this!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279244 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [12:42:43] (03CR) 10AikoChou: [C:03+1] ml-services: Deploy rr-multilingual latest model version on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279244 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [12:44:20] RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:45:14] (03PS1) 10Elukey: role::grafana: migrate to new pki intermediate discovery2026 [puppet] - 10https://gerrit.wikimedia.org/r/1279323 (https://phabricator.wikimedia.org/T424673) [12:46:39] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1223.eqiad.wmnet with OS trixie [12:46:49] (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy rr-multilingual latest model version on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279244 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [12:46:56] (03CR) 10Muehlenhoff: "There's already https://gerrit.wikimedia.org/r/c/operations/puppet/+/1279271" [puppet] - 10https://gerrit.wikimedia.org/r/1279323 (https://phabricator.wikimedia.org/T424673) (owner: 10Elukey) [12:46:57] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T419961)', diff saved to https://phabricator.wikimedia.org/P91916 and previous config saved to /var/cache/conftool/dbconfig/20260429-124656-fceratto.json [12:47:17] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1236.eqiad.wmnet with reason: Maintenance [12:47:25] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1236 (T419961)', diff saved to https://phabricator.wikimedia.org/P91917 and previous config saved to /var/cache/conftool/dbconfig/20260429-124725-fceratto.json [12:47:53] (03CR) 10Jelto: [C:03+2] releases: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279322 (https://phabricator.wikimedia.org/T424669) (owner: 10Jelto) [12:47:54] (03Abandoned) 10Elukey: role::grafana: migrate to new pki intermediate discovery2026 [puppet] - 10https://gerrit.wikimedia.org/r/1279323 (https://phabricator.wikimedia.org/T424673) (owner: 10Elukey) [12:48:11] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11871082 (10MoritzMuehlenhoff) [12:48:24] (03CR) 10Elukey: [C:03+2] grafana: use discovery2026 intermediate certificate [puppet] - 10https://gerrit.wikimedia.org/r/1279271 (https://phabricator.wikimedia.org/T424673) (owner: 10Hnowlan) [12:49:00] (03Merged) 10jenkins-bot: ml-services: Deploy rr-multilingual latest model version on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279244 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [12:49:02] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11871084 (10MoritzMuehlenhoff) [12:49:32] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11871087 (10MoritzMuehlenhoff) [12:50:10] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11871088 (10MoritzMuehlenhoff) [12:50:20] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Multi-bit memory errors on wikikube-worker1039.eqiad.wmnet - https://phabricator.wikimedia.org/T424797#11871089 (10JMeybohm) >>! In T424797#11871065, @Jclark-ctr wrote: > @JMeybohm this server is out of warranty. i could swap with a... [12:50:39] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2177.codfw.wmnet with OS trixie [12:51:08] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1223: after reimage to trixie [12:53:29] (03CR) 10Tiziano Fogli: [C:03+2] prometheus/eqsin: update svc record [dns] - 10https://gerrit.wikimedia.org/r/1279256 (https://phabricator.wikimedia.org/T424024) (owner: 10Tiziano Fogli) [12:53:32] (03CR) 10Marostegui: [C:03+2] Revert "db1223,db2177: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1279298 (owner: 10Marostegui) [12:53:55] !log tappof@dns1004 START - running authdns-update [12:54:02] (03CR) 10Marostegui: [C:03+2] db2204: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279292 (https://phabricator.wikimedia.org/T424615) (owner: 10Marostegui) [12:54:30] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1197.eqiad.wmnet with OS trixie [12:54:34] FIRING: [5x] CertAlmostExpired: Certificate for service grafana:443 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [12:54:43] (03CR) 10Muehlenhoff: [C:03+2] proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279285 (owner: 10Muehlenhoff) [12:54:45] (03PS1) 10Bartosz Wójtowicz: Add 50051 to istio ingressgateway ports for ml-staging-codfw. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279327 (https://phabricator.wikimedia.org/T424049) [12:55:21] (03CR) 10Dpogorzelski: [C:03+1] Add 50051 to istio ingressgateway ports for ml-staging-codfw. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279327 (https://phabricator.wikimedia.org/T424049) (owner: 10Bartosz Wójtowicz) [12:55:32] !log tappof@dns1004 END - running authdns-update [12:55:37] !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [12:56:02] !log jmm@deploy1003 helmfile [staging] START helmfile.d/services/proton: apply [12:56:07] (03PS1) 10Jelto: jenkins: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279329 (https://phabricator.wikimedia.org/T424669) [12:56:46] (03CR) 10Arnaudb: [C:03+1] "looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/1279329 (https://phabricator.wikimedia.org/T424669) (owner: 10Jelto) [12:56:47] (03PS1) 10Bartosz Dziewoński: CentralAuthTokenSessionProvider: Add security context to "centralauthtoken is invalid" logs [extensions/CentralAuth] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1279330 [12:56:57] (03CR) 10Jelto: [C:03+2] jenkins: Switch to discovery2026 intermediate for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1279329 (https://phabricator.wikimedia.org/T424669) (owner: 10Jelto) [12:57:03] (03PS1) 10Bartosz Dziewoński: CentralAuthTokenSessionProvider: Add security context to "centralauthtoken is invalid" logs [extensions/CentralAuth] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279331 [12:57:14] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/CentralAuth] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1279330 (owner: 10Bartosz Dziewoński) [12:57:14] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.3 point update - https://phabricator.wikimedia.org/T414179#11871111 (10MoritzMuehlenhoff) [12:57:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/CentralAuth] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279331 (owner: 10Bartosz Dziewoński) [12:57:27] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2204.codfw.wmnet with OS trixie [12:57:36] jouncebot: next [12:57:36] In 0 hour(s) and 2 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260429T1300) [12:57:41] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1197: after reimage to trixie [12:57:45] !log urbanecm@deploy1003 mwscript-k8s job started: GrowthExperiments:reassignMentees --wiki=enwiki --mentor=GrayStorm --performer=GrayStorm --as-job # T418194 [12:57:49] T418194: Mentors still having mentees after removing themselves - https://phabricator.wikimedia.org/T418194 [12:57:56] (03PS3) 10Federico Ceratto: sre.mysql.global-read-only Set all sections as RO/RW [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) [12:58:11] !log jmm@deploy1003 helmfile [staging] DONE helmfile.d/services/proton: apply [12:58:27] hi folks, i added some small patches to the window, i hope you can fit them in (i don't have deployment access). they are safe to deploy together with other changes. [12:58:32] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2177: after reimage to trixie [12:58:46] (03CR) 10Federico Ceratto: sre.mysql.global-read-only Set all sections as RO/RW (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [12:59:11] MatmaRex the window is quite busy but I can try. Are you able to test them? [12:59:18] yeah [12:59:19] FIRING: [4x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:59:37] (03PS1) 10Cathal Mooney: QoS: Map packets marked with DSCP CS1 into low-prirority class [homer/public] - 10https://gerrit.wikimedia.org/r/1279334 (https://phabricator.wikimedia.org/T424640) [12:59:53] (03CR) 10Federico Ceratto: "I added ask_confirmation and more detailed log messages and phabricator updatate. Can I add x1 and x3 as the s* sections or with a differe" [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [12:59:59] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2204: after reimage to trixie [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260429T1300). [13:00:05] codenamenoreste, stephanebisson, Tran, and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:09] o/ [13:00:12] o/ [13:00:22] i'm here [13:00:33] codenamenoreste can you do your patch? [13:00:55] (03PS4) 10Federico Ceratto: sre.mysql.global-read-only Set all sections as RO/RW [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) [13:00:58] Or I can help [13:01:51] (03PS5) 10Federico Ceratto: sre.mysql.global-read-only Set all sections as RO/RW [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) [13:02:32] !log jmm@deploy1003 helmfile [codfw] START helmfile.d/services/proton: apply [13:02:32] codenamenoreste are you able/willing to deploy your own patch or do you want someone else to do it? [13:02:53] (03PS1) 10STran: Instrument link clicks on success pages per spec [extensions/ReportIncident] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279335 (https://phabricator.wikimedia.org/T424075) [13:02:54] (03CR) 10Bartosz Wójtowicz: [C:03+2] Add 50051 to istio ingressgateway ports for ml-staging-codfw. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279327 (https://phabricator.wikimedia.org/T424049) (owner: 10Bartosz Wójtowicz) [13:03:35] (03CR) 10Mszwarc: [C:03+1] Instrument link clicks on success pages per spec [extensions/ReportIncident] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279335 (https://phabricator.wikimedia.org/T424075) (owner: 10STran) [13:03:50] !log jmm@deploy1003 helmfile [codfw] DONE helmfile.d/services/proton: apply [13:03:57] (03PS1) 10JMeybohm: Add wikikube-worker13[73-82] to site.pp and preseed [puppet] - 10https://gerrit.wikimedia.org/r/1279336 (https://phabricator.wikimedia.org/T423719) [13:04:20] FIRING: [2x] JobUnavailable: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:04:34] FIRING: [4x] CertAlmostExpired: Certificate for service grafana:443 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:04:50] I'll start with my patch in the meantime [13:06:10] 06SRE, 10DNS, 06Traffic: [Update DNS Record Request] - wikimedia.org - Add TXT verification for Anthropic - https://phabricator.wikimedia.org/T424785#11871180 (10ssingh) a:03CDobbins [13:06:52] I can't reach deploy1003.eqiad.wmnet. Is there another server I should use? [13:08:43] (03CR) 10Kamila Součková: [C:03+1] Add wikikube-worker13[73-82] to site.pp and preseed [puppet] - 10https://gerrit.wikimedia.org/r/1279336 (https://phabricator.wikimedia.org/T423719) (owner: 10JMeybohm) [13:09:00] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta [13:09:18] (03CR) 10JMeybohm: [C:03+2] Add wikikube-worker13[73-82] to site.pp and preseed [puppet] - 10https://gerrit.wikimedia.org/r/1279336 (https://phabricator.wikimedia.org/T423719) (owner: 10JMeybohm) [13:10:14] !log jmm@deploy1003 helmfile [eqiad] START helmfile.d/services/proton: apply [13:10:25] (03PS1) 10Cathal Mooney: Network QoS: adjust configuration to mark low-priority traffic as CS1 [puppet] - 10https://gerrit.wikimedia.org/r/1279339 (https://phabricator.wikimedia.org/T424640) [13:10:31] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13[73-82] - https://phabricator.wikimedia.org/T423719#11871190 (10JMeybohm) >>! In T423719#11871050, @Jclark-ctr wrote: > @jmeybohm can you update site.pp. it only... [13:11:01] (03Merged) 10jenkins-bot: Add 50051 to istio ingressgateway ports for ml-staging-codfw. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279327 (https://phabricator.wikimedia.org/T424049) (owner: 10Bartosz Wójtowicz) [13:11:46] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Multi-bit memory errors on wikikube-worker1039.eqiad.wmnet - https://phabricator.wikimedia.org/T424797#11871199 (10Jclark-ctr) Dimm has been swapped Thank you old dimm ` BankLabelA CacheSizeInformation Not Available CPUAffinity1 Cur... [13:12:01] stephanebisson: Are you still having trouble? [13:12:44] OK, my problem is resolved. [13:12:59] codenamenoreste are you able to deploy your change or do you need help? [13:13:23] i don't think they have deployment access [13:13:37] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:13:44] I was about to say that ^^ [13:13:59] codenamenoreste OK I'm starting with your patch [13:14:02] RECOVERY - Host wikikube-worker1039 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [13:14:17] Sorry for the delay [13:14:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271215 (https://phabricator.wikimedia.org/T423100) (owner: 10Codename Noreste) [13:14:41] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11871223 (10SLyngshede-WMF) [13:15:18] 10SRE-swift-storage, 10Ceph, 06Data-Persistence: Migrate Data Persistence Envoy TLS proxy services to the 2026 discovery intermediate - https://phabricator.wikimedia.org/T424674#11871225 (10Eevans) [13:15:28] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:15:30] (03CR) 10Codename Noreste: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271215 (https://phabricator.wikimedia.org/T423100) (owner: 10Codename Noreste) [13:15:36] (03Merged) 10jenkins-bot: lbwiki: Limit ContentTranslation extension to autoconfirmed and confirmed users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1271215 (https://phabricator.wikimedia.org/T423100) (owner: 10Codename Noreste) [13:15:40] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Multi-bit memory errors on wikikube-worker1039.eqiad.wmnet - https://phabricator.wikimedia.org/T424797#11871228 (10Jclark-ctr) 05Open→03Resolved [13:16:04] !log sbisson@deploy1003 Started scap sync-world: Backport for [[gerrit:1271215|lbwiki: Limit ContentTranslation extension to autoconfirmed and confirmed users (T423100)]] [13:16:08] T423100: [lbwiki] Limit ContentTranslation to autoconfirmed and confirmed users - https://phabricator.wikimedia.org/T423100 [13:16:12] !log jmm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/proton: apply [13:16:48] codenamenoreste will you be able to test your change against the test servers using the WikimediaDebug browser extension? [13:17:57] !log sbisson@deploy1003 sbisson, codenamenoreste: Backport for [[gerrit:1271215|lbwiki: Limit ContentTranslation extension to autoconfirmed and confirmed users (T423100)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:18:05] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/ReportIncident] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279335 (https://phabricator.wikimedia.org/T424075) (owner: 10STran) [13:18:13] I'm going to log in to my alt test account to verify the changes on lbwiki [13:18:23] ^ that patch is just going to go with my current stack [13:18:40] codenamenoreste ready for you to test [13:19:26] (03PS1) 10Muehlenhoff: tlsproxy::envoy: Bump default now that services have moved [puppet] - 10https://gerrit.wikimedia.org/r/1279340 (https://phabricator.wikimedia.org/T420993) [13:19:37] (03PS2) 10Cathal Mooney: Network QoS: adjust configuration to mark low-priority traffic as CS1 [puppet] - 10https://gerrit.wikimedia.org/r/1279339 (https://phabricator.wikimedia.org/T424640) [13:19:53] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1279339 (https://phabricator.wikimedia.org/T424640) (owner: 10Cathal Mooney) [13:20:55] using incognito and my alternate account, without the change the content translation extension lists article suggestions, but with the patch activated, it doesn't display anything [13:21:10] ^ such suggestions, I meant [13:21:37] !log installing tiff security updates [13:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:59] codenamenoreste there appears to be a problem with the suggestions system at the moment [13:22:12] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13[73-82] - https://phabricator.wikimedia.org/T423719#11871249 (10Jclark-ctr) @jmeybohm. Sorry for the duplicate work. I just finished moving and cabling everythi... [13:22:20] But the patch looks good I think we can go ahead with the change [13:22:28] Go ahead :) [13:22:44] !log sbisson@deploy1003 sbisson, codenamenoreste: Continuing with deployment [13:22:52] (03CR) 10Bking: [C:03+2] wcqs: Migrate to new discovery intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1278610 (https://phabricator.wikimedia.org/T420993) (owner: 10Bking) [13:23:51] (03PS3) 10Cathal Mooney: Network QoS: adjust configuration to mark low-priority traffic as CS1 [puppet] - 10https://gerrit.wikimedia.org/r/1279339 (https://phabricator.wikimedia.org/T424640) [13:24:34] (03PS1) 10Muehlenhoff: Add bast5005 [puppet] - 10https://gerrit.wikimedia.org/r/1279343 (https://phabricator.wikimedia.org/T421863) [13:26:21] (03CR) 10Majavah: zookeeper: allow overriding the zookeeper host ID (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1278524 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [13:26:35] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T419961)', diff saved to https://phabricator.wikimedia.org/P91928 and previous config saved to /var/cache/conftool/dbconfig/20260429-132635-fceratto.json [13:26:37] !log sbisson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1271215|lbwiki: Limit ContentTranslation extension to autoconfirmed and confirmed users (T423100)]] (duration: 10m 33s) [13:26:42] T423100: [lbwiki] Limit ContentTranslation to autoconfirmed and confirmed users - https://phabricator.wikimedia.org/T423100 [13:27:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278584 (https://phabricator.wikimedia.org/T417200) (owner: 10Sbisson) [13:27:59] (03Merged) 10jenkins-bot: testwiki: Article Guidance experiment config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278584 (https://phabricator.wikimedia.org/T417200) (owner: 10Sbisson) [13:28:26] !log sbisson@deploy1003 Started scap sync-world: Backport for [[gerrit:1278584|testwiki: Article Guidance experiment config (T417200)]] [13:28:30] T417200: Deploy Article Guidance extension to production (testwiki) - https://phabricator.wikimedia.org/T417200 [13:29:20] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:30:16] !log sbisson@deploy1003 sbisson: Backport for [[gerrit:1278584|testwiki: Article Guidance experiment config (T417200)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:31:33] (03PS5) 10Andrew Bogott: zookeeper: allow overriding the zookeeper host ID [puppet] - 10https://gerrit.wikimedia.org/r/1278524 (https://phabricator.wikimedia.org/T422646) [13:31:34] (03PS3) 10Andrew Bogott: Designate: use zookeeper as the tooz backend, everywhere [puppet] - 10https://gerrit.wikimedia.org/r/1278528 (https://phabricator.wikimedia.org/T422646) [13:31:40] (03CR) 10Muehlenhoff: [C:03+2] alertmanager: add frack networks to iptables allow on 9093 [puppet] - 10https://gerrit.wikimedia.org/r/1269672 (https://phabricator.wikimedia.org/T422888) (owner: 10Dwisehaupt) [13:31:52] (03PS1) 10Tiziano Fogli: Revert "prometheus::pop: enable rsyncd on eqsin" [puppet] - 10https://gerrit.wikimedia.org/r/1279345 [13:32:52] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1007.eqiad.wmnet with OS trixie [13:33:04] (03CR) 10Andrew Bogott: zookeeper: allow overriding the zookeeper host ID (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1278524 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [13:33:04] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13{75-84} - https://phabricator.wikimedia.org/T423719#11871286 (10Jclark-ctr) [13:33:29] !log sbisson@deploy1003 sbisson: Continuing with deployment [13:34:12] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [13:34:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:34:31] Tran will you do your changes or do you want me to? [13:34:34] RESOLVED: [2x] CertAlmostExpired: Certificate for service wcqs:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wcqs:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:34:35] I can do it [13:35:27] FIRING: CertAlmostExpired: Certificate for service wcqs:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wcqs:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:35:35] (03CR) 10Muehlenhoff: [C:03+1] Revert "prometheus::pop: enable rsyncd on eqsin" [puppet] - 10https://gerrit.wikimedia.org/r/1279345 (owner: 10Tiziano Fogli) [13:35:43] (03CR) 10Tiziano Fogli: [C:03+2] Revert "prometheus::pop: enable rsyncd on eqsin" [puppet] - 10https://gerrit.wikimedia.org/r/1279345 (owner: 10Tiziano Fogli) [13:36:11] (03PS2) 10Tiziano Fogli: prom5003: clean up firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1279255 (https://phabricator.wikimedia.org/T424024) [13:36:32] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1223: after reimage to trixie [13:36:40] stephanebisson I have one more patch to deploy, https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1274928 [13:36:43] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P91933 and previous config saved to /var/cache/conftool/dbconfig/20260429-133643-fceratto.json [13:36:58] (03CR) 10Codename Noreste: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1274928 (https://phabricator.wikimedia.org/T423461) (owner: 10Codename Noreste) [13:37:17] !log sbisson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1278584|testwiki: Article Guidance experiment config (T417200)]] (duration: 08m 51s) [13:37:22] T417200: Deploy Article Guidance extension to production (testwiki) - https://phabricator.wikimedia.org/T417200 [13:37:52] Tran over to you [13:38:11] codenamenoreste if there is time at the end of the window [13:38:21] (03CR) 10Tiziano Fogli: [C:03+2] prom5003: clean up firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1279255 (https://phabricator.wikimedia.org/T424024) (owner: 10Tiziano Fogli) [13:38:36] it's 8:38 a.m. where I live right now, so we might still have time [13:38:48] !log jclark@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [13:38:49] 10SRE-swift-storage, 10Ceph, 06Data-Persistence: Migrate Data Persistence Envoy TLS proxy services to the 2026 discovery intermediate - https://phabricator.wikimedia.org/T424674#11871308 (10MatthewVernon) [13:39:10] starting [13:39:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:39:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy1003 using scap backport" [extensions/CheckUser] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1278380 (https://phabricator.wikimedia.org/T420517) (owner: 10STran) [13:39:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy1003 using scap backport" [extensions/ReportIncident] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279279 (https://phabricator.wikimedia.org/T424220) (owner: 10STran) [13:39:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy1003 using scap backport" [extensions/ReportIncident] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279280 (https://phabricator.wikimedia.org/T424075) (owner: 10STran) [13:39:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy1003 using scap backport" [extensions/ReportIncident] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279335 (https://phabricator.wikimedia.org/T424075) (owner: 10STran) [13:39:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279269 (https://phabricator.wikimedia.org/T424075) (owner: 10STran) [13:39:52] 10SRE-swift-storage, 10Ceph, 06Data-Persistence: Migrate Data Persistence Envoy TLS proxy services to the 2026 discovery intermediate - https://phabricator.wikimedia.org/T424674#11871310 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon Rune in the description probably should be more like `open... [13:40:14] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:40:28] RESOLVED: [2x] CertAlmostExpired: Certificate for service wcqs:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wcqs:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:40:48] (03Merged) 10jenkins-bot: Enable staggered rollout for IRS on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279269 (https://phabricator.wikimedia.org/T424075) (owner: 10STran) [13:41:31] (03Merged) 10jenkins-bot: Update action parameter for bulk blocking instrumented events [extensions/CheckUser] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1278380 (https://phabricator.wikimedia.org/T420517) (owner: 10STran) [13:41:33] (03Merged) 10jenkins-bot: Support staggered rollout via Test Kitchen [extensions/ReportIncident] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279279 (https://phabricator.wikimedia.org/T424220) (owner: 10STran) [13:42:05] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:42:12] (03CR) 10Codename Noreste: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278363 (https://phabricator.wikimedia.org/T424445) (owner: 10VadymTS1) [13:43:02] (03Merged) 10jenkins-bot: Update IRS instrumentation [extensions/ReportIncident] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279280 (https://phabricator.wikimedia.org/T424075) (owner: 10STran) [13:43:06] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1197: after reimage to trixie [13:43:17] (03Merged) 10jenkins-bot: Instrument link clicks on success pages per spec [extensions/ReportIncident] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279335 (https://phabricator.wikimedia.org/T424075) (owner: 10STran) [13:43:48] !log stran@deploy1003 Started scap sync-world: Backport for [[gerrit:1278380|Update action parameter for bulk blocking instrumented events (T420517)]], [[gerrit:1279279|Support staggered rollout via Test Kitchen (T424220)]], [[gerrit:1279280|Update IRS instrumentation (T424075)]], [[gerrit:1279335|Instrument link clicks on success pages per spec (T424075)]], [[gerrit:1279269|Enable staggered rollout for IRS on testwiki (T [13:43:48] 424075)]] [13:43:58] T420517: Instrument bulk blocking of connected temporary accounts - https://phabricator.wikimedia.org/T420517 [13:43:58] T424220: IRS should support full deployment and experiment rollout percentages - https://phabricator.wikimedia.org/T424220 [13:43:58] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2177: after reimage to trixie [13:43:59] T424075: Update instrumentation MVP for enwiki 5% rollout - https://phabricator.wikimedia.org/T424075 [13:44:03] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2204 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1279346 (https://phabricator.wikimedia.org/T424848) [13:44:05] (03CR) 10Codename Noreste: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278363 (https://phabricator.wikimedia.org/T424445) (owner: 10VadymTS1) [13:44:20] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:45:10] (03PS1) 10Elukey: role::crm: update postfix's cfssl pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1279347 (https://phabricator.wikimedia.org/T420993) [13:45:24] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2204: after reimage to trixie [13:45:40] !log stran@deploy1003 stran: Backport for [[gerrit:1278380|Update action parameter for bulk blocking instrumented events (T420517)]], [[gerrit:1279279|Support staggered rollout via Test Kitchen (T424220)]], [[gerrit:1279280|Update IRS instrumentation (T424075)]], [[gerrit:1279335|Instrument link clicks on success pages per spec (T424075)]], [[gerrit:1279269|Enable staggered rollout for IRS on testwiki (T424075)]] synced t [13:45:40] o the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:46:03] (03CR) 10Codename Noreste: "Done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278363 (https://phabricator.wikimedia.org/T424445) (owner: 10VadymTS1) [13:46:16] testing now [13:46:51] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P91937 and previous config saved to /var/cache/conftool/dbconfig/20260429-134651-fceratto.json [13:47:01] (03PS1) 10Marostegui: db1157,db2156: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279351 (https://phabricator.wikimedia.org/T424792) [13:47:15] so, I still have a patch to check for ukwiki which is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1274928 [13:47:23] tests look good, continuing [13:47:26] !log stran@deploy1003 stran: Continuing with deployment [13:47:49] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1157.eqiad.wmnet with reason: Reimage to Trixie [13:47:50] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2156.codfw.wmnet with reason: Reimage to Trixie [13:47:51] (03CR) 10Marostegui: [C:03+2] db1157,db2156: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1279351 (https://phabricator.wikimedia.org/T424792) (owner: 10Marostegui) [13:47:54] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1157: Reimage to Trixie [13:47:56] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2156: Reimage to Trixie [13:48:14] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2156: Reimage to Trixie [13:48:22] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1157: Reimage to Trixie [13:48:38] 06SRE, 10Observability-Metrics, 13Patch-For-Review, 06SRE Observability (FY2025/2026-Q4): Migrate prometheus5002 to prometheus5003 - https://phabricator.wikimedia.org/T424024#11871399 (10tappof) [13:49:54] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2156.codfw.wmnet with OS trixie [13:50:10] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1157.eqiad.wmnet with OS trixie [13:51:14] !log stran@deploy1003 Finished scap sync-world: Backport for [[gerrit:1278380|Update action parameter for bulk blocking instrumented events (T420517)]], [[gerrit:1279279|Support staggered rollout via Test Kitchen (T424220)]], [[gerrit:1279280|Update IRS instrumentation (T424075)]], [[gerrit:1279335|Instrument link clicks on success pages per spec (T424075)]], [[gerrit:1279269|Enable staggered rollout for IRS on testwiki ( [13:51:14] T424075)]] (duration: 07m 26s) [13:51:28] T420517: Instrument bulk blocking of connected temporary accounts - https://phabricator.wikimedia.org/T420517 [13:51:29] T424220: IRS should support full deployment and experiment rollout percentages - https://phabricator.wikimedia.org/T424220 [13:51:29] T424075: Update instrumentation MVP for enwiki 5% rollout - https://phabricator.wikimedia.org/T424075 [13:51:33] done. I think MatmaRex is next? [13:52:01] i don't have deployment access, can anyone else ship the changes? [13:52:14] yeah I'm still in spiderpig. Can you test? [13:52:16] MatmaRex I can do it [13:52:22] oh sure, feel free [13:52:22] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1007.eqiad.wmnet with reason: host reimage [13:52:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1279330 (owner: 10Bartosz Dziewoński) [13:52:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279331 (owner: 10Bartosz Dziewoński) [13:52:52] one more reminder, I still have one more patch to deploy for ukwiki [13:52:53] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1279347 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [13:53:45] !log tappof@cumin1003 START - Cookbook sre.hosts.decommission for hosts prometheus5002.eqsin.wmnet [13:54:25] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [13:54:42] (03PS1) 10Elukey: pki: add the discovery2026 intermediate in cloud-pki [puppet] - 10https://gerrit.wikimedia.org/r/1279356 (https://phabricator.wikimedia.org/T424549) [13:55:13] (03CR) 10Elukey: [C:03+2] role::crm: update postfix's cfssl pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1279347 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [13:55:23] (03CR) 10Ottomata: [C:03+1] stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279247 (https://phabricator.wikimedia.org/T424624) (owner: 10JavierMonton) [13:55:39] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [13:55:51] (03Merged) 10jenkins-bot: CentralAuthTokenSessionProvider: Add security context to "centralauthtoken is invalid" logs [extensions/CentralAuth] (wmf/1.46.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1279330 (owner: 10Bartosz Dziewoński) [13:55:52] (03Merged) 10jenkins-bot: CentralAuthTokenSessionProvider: Add security context to "centralauthtoken is invalid" logs [extensions/CentralAuth] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279331 (owner: 10Bartosz Dziewoński) [13:56:03] (03PS1) 10Marostegui: Revert "db1157,db2156: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1279359 [13:56:25] !log sbisson@deploy1003 Started scap sync-world: Backport for [[gerrit:1279330|CentralAuthTokenSessionProvider: Add security context to "centralauthtoken is invalid" logs]], [[gerrit:1279331|CentralAuthTokenSessionProvider: Add security context to "centralauthtoken is invalid" logs]] [13:56:45] !log jclark@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [13:57:00] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T419961)', diff saved to https://phabricator.wikimedia.org/P91940 and previous config saved to /var/cache/conftool/dbconfig/20260429-135659-fceratto.json [13:57:04] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [13:57:13] (03PS1) 10Bartosz Wójtowicz: ml-services: Use gRPC port for staging outlink-topic-model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279360 (https://phabricator.wikimedia.org/T424049) [13:58:14] !log sbisson@deploy1003 matmarex, sbisson: Backport for [[gerrit:1279330|CentralAuthTokenSessionProvider: Add security context to "centralauthtoken is invalid" logs]], [[gerrit:1279331|CentralAuthTokenSessionProvider: Add security context to "centralauthtoken is invalid" logs]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:58:36] !log tappof@cumin1003 START - Cookbook sre.dns.netbox [13:58:40] MatmaRex can you test? [13:58:46] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2026-04-14-215402 to 2026-04-21-184122 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279362 (https://phabricator.wikimedia.org/T402956) [13:58:57] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2026-04-15-195941 to 2026-04-29-001940 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279363 (https://phabricator.wikimedia.org/T400517) [13:58:59] 06SRE-OnFire, 10SRE-swift-storage, 07Sustainability (Incident Followup): Swift proxy server misbehaviour (no longer calling `accept`?) - https://phabricator.wikimedia.org/T360913#11871485 (10hnowlan) [13:59:01] yep, looking [13:59:25] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1007.eqiad.wmnet with reason: host reimage [13:59:56] (03CR) 10Dpogorzelski: [C:03+1] ml-services: Use gRPC port for staging outlink-topic-model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279360 (https://phabricator.wikimedia.org/T424049) (owner: 10Bartosz Wójtowicz) [14:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260429T1400) [14:00:24] We're wrapping config deployment [14:00:34] *wrapping up [14:00:37] stephanebisson: thanks, looks good [14:00:42] !log sbisson@deploy1003 matmarex, sbisson: Continuing with deployment [14:01:20] did we finish codenamenoreste's deployments? i saw some message about it earleir [14:01:51] (03PS1) 10Elukey: CHANGELOG: add changelogs for release v12.5.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1279367 [14:01:57] ¯\_(ツ)_/¯ [14:02:40] (03CR) 10Genoveva Galarza: [C:03+2] wikifunctions: Upgrade evaluators from 2026-04-14-215402 to 2026-04-21-184122 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279362 (https://phabricator.wikimedia.org/T402956) (owner: 10Jforrester) [14:03:05] (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Use gRPC port for staging outlink-topic-model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279360 (https://phabricator.wikimedia.org/T424049) (owner: 10Bartosz Wójtowicz) [14:03:18] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1157.eqiad.wmnet with reason: host reimage [14:03:25] FIRING: [16x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:03:34] !log tappof@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: prometheus5002.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - tappof@cumin1003" [14:04:09] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [14:04:20] FIRING: [3x] JobUnavailable: Reduced availability for job envoy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:04:34] !log sbisson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1279330|CentralAuthTokenSessionProvider: Add security context to "centralauthtoken is invalid" logs]], [[gerrit:1279331|CentralAuthTokenSessionProvider: Add security context to "centralauthtoken is invalid" logs]] (duration: 08m 08s) [14:04:50] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2026-04-14-215402 to 2026-04-21-184122 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279362 (https://phabricator.wikimedia.org/T402956) (owner: 10Jforrester) [14:04:51] 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Cookbook for rack depool - https://phabricator.wikimedia.org/T327300#11871539 (10FCeratto-WMF) @ayounsi an amount of data is exposed by https://zarcillo.wikimedia.org/apidocs#/default/get_sections_data_api_v0_sections_get but we can create a simp... [14:05:14] (03Merged) 10jenkins-bot: ml-services: Use gRPC port for staging outlink-topic-model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279360 (https://phabricator.wikimedia.org/T424049) (owner: 10Bartosz Wójtowicz) [14:05:20] FIRING: [3x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in 3d 23h 49m 25s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [14:05:30] (03CR) 10Elukey: [C:03+2] pki: add the discovery2026 intermediate in cloud-pki [puppet] - 10https://gerrit.wikimedia.org/r/1279356 (https://phabricator.wikimedia.org/T424549) (owner: 10Elukey) [14:05:40] (03CR) 10Elukey: pki: add the discovery2026 intermediate in cloud-pki [puppet] - 10https://gerrit.wikimedia.org/r/1279356 (https://phabricator.wikimedia.org/T424549) (owner: 10Elukey) [14:06:11] !log bwojtowicz@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [14:06:16] !log gengh@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:06:40] tappof@cumin1003 decommission (PID 2577680) is awaiting input [14:06:44] (03CR) 10Elukey: [C:03+2] CHANGELOG: add changelogs for release v12.5.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1279367 (owner: 10Elukey) [14:06:45] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:07:45] (03PS1) 10Elukey: Upstream release v12.5.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1279371 [14:08:01] (03CR) 10Elukey: [V:03+2 C:03+2] Upstream release v12.5.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1279371 (owner: 10Elukey) [14:08:40] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1157.eqiad.wmnet with reason: host reimage [14:09:03] (03CR) 10JHathaway: [C:03+1] role::crm: update postfix's cfssl pki intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1279347 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [14:09:10] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2156.codfw.wmnet with reason: host reimage [14:09:20] RESOLVED: [3x] JobUnavailable: Reduced availability for job envoy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:11:47] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1279356 (https://phabricator.wikimedia.org/T424549) (owner: 10Elukey) [14:11:56] (03CR) 10Elukey: [C:03+2] pki: add the discovery2026 intermediate in cloud-pki [puppet] - 10https://gerrit.wikimedia.org/r/1279356 (https://phabricator.wikimedia.org/T424549) (owner: 10Elukey) [14:13:10] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2156.codfw.wmnet with reason: host reimage [14:14:01] (03CR) 10Jforrester: "Let's keep these ones this way around, and the new (replacement, Rust-based) ones can be "better named"?" [dns] - 10https://gerrit.wikimedia.org/r/1277099 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [14:15:10] RECOVERY - Postfix SMTP on crm2001 is OK: OK - Certificate crm2001.codfw.wmnet will expire on Wed 27 May 2026 01:53:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [14:16:34] (03PS1) 10MVernon: swift: remove 2 drained nodes from rings for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1279372 (https://phabricator.wikimedia.org/T354872) [14:16:57] !log uploaded spicerack_12.5.0 to apt.wikimedia.org bookworm-wikimedia [14:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:43] (03CR) 10Elukey: "sure!" [dns] - 10https://gerrit.wikimedia.org/r/1277099 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [14:18:09] FIRING: HelmReleaseBadStatus: Helm release wikifunctions/python-evaluator on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:18:41] (03PS1) 10Gkyziridis: ml-services: Use concurrency knative metric for rr-multilingual model on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279373 (https://phabricator.wikimedia.org/T415892) [14:18:43] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1375 [14:18:51] (03PS1) 10Bartosz Wójtowicz: Enable Knative HTTP/2 auto-detection on ml-staging-codfw. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279375 (https://phabricator.wikimedia.org/T424049) [14:19:24] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:19:32] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:19:56] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1375 [14:19:57] (03CR) 10Dpogorzelski: [C:03+1] Enable Knative HTTP/2 auto-detection on ml-staging-codfw. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279375 (https://phabricator.wikimedia.org/T424049) (owner: 10Bartosz Wójtowicz) [14:20:22] James_F: o/ I haven't deployed the new mesh/ingress changes yet to prod, they are relatively safe to push forward but I am missing https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1279315. Ping me if you deploy to prod so we can check together [14:20:58] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance [14:21:05] !log gengh@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:21:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2152 (T419961)', diff saved to https://phabricator.wikimedia.org/P91941 and previous config saved to /var/cache/conftool/dbconfig/20260429-142105-fceratto.json [14:21:10] !log gengh@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:21:16] elukey: Ack. This is our weekly deploy window now. [14:21:22] !log tappof@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: prometheus5002.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - tappof@cumin1003" [14:21:22] !log tappof@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:21:24] !log tappof@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts prometheus5002.eqsin.wmnet [14:21:38] 06SRE, 10Observability-Metrics, 13Patch-For-Review, 06SRE Observability (FY2025/2026-Q4): Migrate prometheus5002 to prometheus5003 - https://phabricator.wikimedia.org/T424024#11871640 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by tappof@cumin1003 for hosts: `prometheus5002.eqsin.wmnet... [14:21:46] !log gengh@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:22:25] (03CR) 10Gkyziridis: [C:03+2] "Merging this HotFix on staging. Tested on experimental." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279373 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [14:22:30] elukey: Do we need to stop deploying before the admin_ng bit is merged? [14:22:41] (03PS2) 10MVernon: swift: remove 2 drained nodes from rings, set for new-style storage [puppet] - 10https://gerrit.wikimedia.org/r/1279372 (https://phabricator.wikimedia.org/T354872) [14:23:02] (03CR) 10Bartosz Wójtowicz: [C:03+2] Enable Knative HTTP/2 auto-detection on ml-staging-codfw. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279375 (https://phabricator.wikimedia.org/T424049) (owner: 10Bartosz Wójtowicz) [14:23:07] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1007.eqiad.wmnet with OS trixie [14:23:18] !log gengh@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:24:23] (03Merged) 10jenkins-bot: ml-services: Use concurrency knative metric for rr-multilingual model on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279373 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [14:24:44] !log gengh@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:25:06] (03PS1) 10Santiago Faci: Test Kitchen UI: Deploy v1.3.0 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279377 (https://phabricator.wikimedia.org/T419511) [14:25:26] !log gengh@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:25:53] !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [14:25:54] James_F: in theory no, the new ingress stuff will just sit there on the side [14:25:57] Ack. [14:26:03] So far looks good. [14:26:22] (03CR) 10Genoveva Galarza: [C:03+2] wikifunctions: Upgrade orchestrator from 2026-04-15-195941 to 2026-04-29-001940 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279363 (https://phabricator.wikimedia.org/T400517) (owner: 10Jforrester) [14:26:43] (03PS1) 10Elukey: sre.hosts: fix ipmi() calls after spicerack 12.5.0 [cookbooks] - 10https://gerrit.wikimedia.org/r/1279379 (https://phabricator.wikimedia.org/T418929) [14:27:12] (03CR) 10Elukey: "Related change: https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/1271631" [cookbooks] - 10https://gerrit.wikimedia.org/r/1279379 (https://phabricator.wikimedia.org/T418929) (owner: 10Elukey) [14:27:31] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13{75-84} - https://phabricator.wikimedia.org/T423719#11871688 (10Jclark-ctr) netbox has been updated , network ports configured. Pending ru... [14:28:47] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 6 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11871716 (10A_smart_kitten) Prompted by {T424511}, I'm probably gonna try and work a bit from (subsets of) [[https://codesearch.wmclo... [14:29:17] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T419961)', diff saved to https://phabricator.wikimedia.org/P91942 and previous config saved to /var/cache/conftool/dbconfig/20260429-142916-fceratto.json [14:29:33] (03CR) 10Marostegui: [C:03+2] Revert "db1157,db2156: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1279359 (owner: 10Marostegui) [14:29:58] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1157.eqiad.wmnet with OS trixie [14:30:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260429T1400) [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260429T1430) [14:30:40] (03Merged) 10jenkins-bot: Enable Knative HTTP/2 auto-detection on ml-staging-codfw. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279375 (https://phabricator.wikimedia.org/T424049) (owner: 10Bartosz Wójtowicz) [14:30:51] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Degraded RAID on an-worker1199 - https://phabricator.wikimedia.org/T424654#11871742 (10Jclark-ctr) @BTullis @RKemper Parts have Arrived 2x drives. for replacement of Physical Disk 0:1:4 Physical Disk 0:1:5 Please let me kno... [14:32:08] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:33:43] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.3 point update - https://phabricator.wikimedia.org/T414179#11871791 (10MoritzMuehlenhoff) [14:33:56] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:34:26] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2026-04-15-195941 to 2026-04-29-001940 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279363 (https://phabricator.wikimedia.org/T400517) (owner: 10Jforrester) [14:34:41] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1157: after reimage to trixie [14:34:42] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:35:20] !log installing zsh updates from Trixie point release [14:35:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:44] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:36:08] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:36:29] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:36:33] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:37:01] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:37:07] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2156.codfw.wmnet with OS trixie [14:37:07] !log bking@cloudelastic1010 run smartctl against all physical disks T424852 [14:37:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:20] T424852: Investigate performance issues in cloudelastic - https://phabricator.wikimedia.org/T424852 [14:37:20] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:37:49] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:37:54] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.3 point update - https://phabricator.wikimedia.org/T414179#11871842 (10MoritzMuehlenhoff) [14:38:33] (03CR) 10Phuedx: [C:03+1] Test Kitchen UI: Deploy v1.3.0 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279377 (https://phabricator.wikimedia.org/T419511) (owner: 10Santiago Faci) [14:39:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P91944 and previous config saved to /var/cache/conftool/dbconfig/20260429-143924-fceratto.json [14:40:10] !log bking@cumin2002 conftool action : set/weight=10; selector: name=cloudelastic1007. [14:40:12] !log mstyles@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [14:40:20] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [14:40:29] !log mstyles@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [14:40:36] !log bking@cumin2002 conftool action : set/pooled/yes; selector: dc=eqiad,cluster=cloudelastic,name=cloudelastic1007. [14:40:43] (03CR) 10Jforrester: [C:03+1] admin_ng: add extra TLS SANs for wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279315 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [14:40:44] !log mstyles@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [14:40:54] !log mstyles@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [14:40:56] (03CR) 10Jforrester: [C:03+1] wmnet: add new CNAMEs for wikifunctions evaluators [dns] - 10https://gerrit.wikimedia.org/r/1277099 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [14:41:05] !log mstyles@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [14:41:11] !log mstyles@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [14:41:22] !log mstyles@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [14:41:29] !log mstyles@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [14:41:33] !log mstyles@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [14:41:37] !log mstyles@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [14:41:53] !log bking@cumin2002 conftool action : set/pooled=yes; selector: name=cloudelastic1007.eqiad.wmnet [14:42:07] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: sync [14:42:30] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: sync [14:43:05] (03PS1) 10Gehel: wdqs: remove duplicate entry in allow list [puppet] - 10https://gerrit.wikimedia.org/r/1279383 (https://phabricator.wikimedia.org/T417573) [14:43:38] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2156: after reimage to trixie [14:43:44] (03CR) 10Santiago Faci: [C:03+2] Test Kitchen UI: Deploy v1.3.0 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279377 (https://phabricator.wikimedia.org/T419511) (owner: 10Santiago Faci) [14:44:01] (03CR) 10Ayounsi: [C:03+1] Add bast5005 [puppet] - 10https://gerrit.wikimedia.org/r/1279343 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [14:44:02] (03CR) 10Elukey: [C:03+2] admin_ng: add extra TLS SANs for wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279315 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [14:44:28] (03CR) 10Btullis: [C:03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1279383 (https://phabricator.wikimedia.org/T417573) (owner: 10Gehel) [14:44:37] (03PS2) 10Jforrester: wikifunctions: Double the number of evaluators from 2 to 4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1271942 (https://phabricator.wikimedia.org/T419933) [14:45:39] (03Merged) 10jenkins-bot: Test Kitchen UI: Deploy v1.3.0 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279377 (https://phabricator.wikimedia.org/T419511) (owner: 10Santiago Faci) [14:46:05] (03CR) 10Gehel: [C:03+2] wdqs: remove duplicate entry in allow list [puppet] - 10https://gerrit.wikimedia.org/r/1279383 (https://phabricator.wikimedia.org/T417573) (owner: 10Gehel) [14:46:40] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:46:43] !log elukey@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [14:46:51] !log elukey@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [14:47:02] !log elukey@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [14:47:18] !log elukey@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [14:47:43] (03CR) 10SBassett: [C:03+2] miscweb: updated image for security landing page [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278570 (https://phabricator.wikimedia.org/T423940) (owner: 10Mstyles) [14:48:16] James_F: ingress works nice in staging now! [14:48:26] Excellent. [14:48:40] `curl https://wikifunctions-javascript-evaluator.k8s-staging.discovery.wmnet:30443/_info -i` for example [14:49:33] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P91947 and previous config saved to /var/cache/conftool/dbconfig/20260429-144932-fceratto.json [14:50:05] !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/admin 'sync'. [14:50:10] !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [14:50:14] (03Merged) 10jenkins-bot: miscweb: updated image for security landing page [deployment-charts] - 10https://gerrit.wikimedia.org/r/1278570 (https://phabricator.wikimedia.org/T423940) (owner: 10Mstyles) [14:50:26] (03CR) 10Nikerabbit: [C:03+1] cxserver: Update cxserver to 2026-04-23-114216-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277294 (https://phabricator.wikimedia.org/T423002) (owner: 10KartikMistry) [14:50:43] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen-next: apply [14:50:46] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/admin 'sync'. [14:50:58] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'sync'. [14:51:01] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen-next: apply [14:51:07] James_F: synced also in prod, I'll wait for your deployments to test ingress in there too [14:51:17] (03PS1) 10Gkyziridis: changeprop: Configure RevertRisk multilingual model on changeprop. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279385 (https://phabricator.wikimedia.org/T415892) [14:51:30] elukey: We're deployed in staging and prod for the week; want me to re-deploy? [14:52:06] !log mstyles@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [14:52:29] !log mstyles@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [14:52:33] !log mstyles@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [14:52:53] !log mstyles@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [14:52:58] !log mstyles@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [14:53:20] !log mstyles@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [14:53:24] !log mstyles@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [14:53:32] !log mstyles@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [14:53:41] !log mstyles@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [14:53:45] !log mstyles@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [14:54:53] (03CR) 10Elukey: [C:03+2] wmnet: add new CNAMEs for wikifunctions evaluators [dns] - 10https://gerrit.wikimedia.org/r/1277099 (https://phabricator.wikimedia.org/T424193) (owner: 10Elukey) [14:55:11] !log elukey@dns1004 START - running authdns-update [14:56:49] !log elukey@dns1004 END - running authdns-update [14:59:17] James_F: oh nice perfect! [14:59:41] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T419961)', diff saved to https://phabricator.wikimedia.org/P91950 and previous config saved to /var/cache/conftool/dbconfig/20260429-145940-fceratto.json [15:00:02] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance [15:00:10] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2154 (T419961)', diff saved to https://phabricator.wikimedia.org/P91951 and previous config saved to /var/cache/conftool/dbconfig/20260429-150010-fceratto.json [15:00:37] elukey: If this means we can use a shorter string than 'https://function-evaluator-python-evaluator-tls-service.wikifunctions.svc.cluster.local:4970/1/v1/evaluate/' in values-main-orchestrator.yaml I'll be delighted, but having the auditing of the traffic is enough. :-) [15:00:56] !log elukey@dns1004 START - running authdns-update [15:02:26] !log elukey@dns1004 END - running authdns-update [15:04:38] (03CR) 10JavierMonton: [C:03+2] stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279247 (https://phabricator.wikimedia.org/T424624) (owner: 10JavierMonton) [15:05:32] !log eevans@cumin1003 DONE (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for cassandra-dev2001.codfw.wmnet: Renew puppet certificate - eevans@cumin1003 [15:06:37] (03Merged) 10jenkins-bot: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279247 (https://phabricator.wikimedia.org/T424624) (owner: 10JavierMonton) [15:07:11] !log eevans@cumin1003 DONE (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for cassandra-dev2001.codfw.wmnet: Renew puppet certificate - eevans@cumin1003 [15:07:19] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T419961)', diff saved to https://phabricator.wikimedia.org/P91953 and previous config saved to /var/cache/conftool/dbconfig/20260429-150719-fceratto.json [15:08:04] (03CR) 10Muehlenhoff: [C:03+2] Add bast5005 [puppet] - 10https://gerrit.wikimedia.org/r/1279343 (https://phabricator.wikimedia.org/T421863) (owner: 10Muehlenhoff) [15:09:36] !log elukey@cumin1003 START - Cookbook sre.dns.wipe-cache wikifunctions-javascript-evaluator.discovery.wmnet on all recursors [15:09:40] !log elukey@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikifunctions-javascript-evaluator.discovery.wmnet on all recursors [15:09:49] !log elukey@cumin1003 START - Cookbook sre.dns.wipe-cache wikifunctions-python-evaluator.discovery.wmnet on all recursors [15:09:53] !log elukey@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikifunctions-python-evaluator.discovery.wmnet on all recursors [15:10:07] (03PS2) 10CDanis: mwscript-k8s: add --output-file flag [puppet] - 10https://gerrit.wikimedia.org/r/1273905 [15:10:08] (03PS3) 10CDanis: deployment_server: add kubectl wait-job plugin [puppet] - 10https://gerrit.wikimedia.org/r/1273926 [15:10:08] (03PS10) 10CDanis: fundraising_data_import maintenance script wrapper & timer [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948) [15:11:21] James_F: at the moment it becomes wikifunctions-javascript-evaluator.discovery.wmnet:30443, but we'll not call it, but it's mesh equivalent (so once configured, localhost:port). Even shorter :D [15:11:25] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:11:47] !log eevans@cumin1003 DONE (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for cassandra-dev2001.codfw.wmnet: Renew puppet certificate - eevans@cumin1003 [15:11:57] elukey: Excellent! [15:12:28] (03CR) 10CI reject: [V:04-1] mwscript-k8s: add --output-file flag [puppet] - 10https://gerrit.wikimedia.org/r/1273905 (owner: 10CDanis) [15:12:38] (03Abandoned) 10Daniel Kinzler: rest gateways: EXPERIMENT: set rate limit by referer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276404 (owner: 10Daniel Kinzler) [15:12:48] (03PS1) 10Gkyziridis: ml-services: Deploy the latest version of rr-multilingual model server on prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279388 (https://phabricator.wikimedia.org/T415892) [15:12:55] James_F: one question for you - I'll configure envoy (the mesh sidecar on the orchestrator pod) to be able to call the evaluators, but I'll need some details like max timeout allowed etc.. [15:13:04] think about it and lemme know :) [15:13:13] even tomorrow [15:13:26] !log eevans@cumin1003 DONE (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for cassandra-dev2001.codfw.wmnet: Renew puppet certificate - eevans@cumin1003 [15:13:27] The orchestrator->evaluator network timeout is currently configured at 10s. Is that sufficient for you? [15:13:32] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2205 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1279390 (https://phabricator.wikimedia.org/T424864) [15:13:42] (03Abandoned) 10Daniel Kinzler: rest_gateway: Rename the user_class descriptor key to ratelimit_class. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203786 (https://phabricator.wikimedia.org/T409155) (owner: 10Daniel Kinzler) [15:15:31] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [15:15:45] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [15:16:06] (03CR) 10Gkyziridis: "Should I also add more cpu/memory at the revertrisk-multilingual-pre-save ?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279388 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [15:16:19] (03PS1) 10JMeybohm: site.pp: Fix names of repurposed tools-k8s hosts [puppet] - 10https://gerrit.wikimedia.org/r/1279391 (https://phabricator.wikimedia.org/T423719) [15:16:53] (03CR) 10Eevans: [C:03+1] swift: remove 2 drained nodes from rings, set for new-style storage [puppet] - 10https://gerrit.wikimedia.org/r/1279372 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [15:17:27] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P91955 and previous config saved to /var/cache/conftool/dbconfig/20260429-151727-fceratto.json [15:18:52] (03CR) 10JMeybohm: [C:03+2] site.pp: Fix names of repurposed tools-k8s hosts [puppet] - 10https://gerrit.wikimedia.org/r/1279391 (https://phabricator.wikimedia.org/T423719) (owner: 10JMeybohm) [15:19:21] (03CR) 10AikoChou: changeprop: Configure RevertRisk multilingual model on changeprop. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279385 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [15:20:05] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1157: after reimage to trixie [15:21:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:22:40] (03CR) 10MVernon: [C:03+2] swift: remove 2 drained nodes from rings, set for new-style storage [puppet] - 10https://gerrit.wikimedia.org/r/1279372 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [15:24:51] !log mvernon@cumin2002 START - Cookbook sre.swift.convert-disks for host ms-be2066 [15:27:35] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P91957 and previous config saved to /var/cache/conftool/dbconfig/20260429-152735-fceratto.json [15:28:45] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13{75-84} - https://phabricator.wikimedia.org/T423719#11872131 (10JMeybohm) Thanks for noticing. I've fixed site.pp [15:29:03] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2156: after reimage to trixie [15:32:58] (03CR) 10AikoChou: "No, the *-pre-save is a separate isvc with a different endpoint. page_change events won’t go to it, so there’s no need to change anything." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279388 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [15:33:10] (03PS1) 10CDobbins: wikimedia.org: Add TXT verification for Claude [dns] - 10https://gerrit.wikimedia.org/r/1279402 (https://phabricator.wikimedia.org/T424785) [15:37:45] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T419961)', diff saved to https://phabricator.wikimedia.org/P91960 and previous config saved to /var/cache/conftool/dbconfig/20260429-153743-fceratto.json [15:38:06] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2161.codfw.wmnet with reason: Maintenance [15:38:15] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2161 (T419961)', diff saved to https://phabricator.wikimedia.org/P91961 and previous config saved to /var/cache/conftool/dbconfig/20260429-153814-fceratto.json [15:40:40] !log mvernon@cumin2002 START - Cookbook sre.swift.convert-disks for host ms-be2067 [15:45:19] (03CR) 10RLazarus: "Oh, yep, thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/1278792 (owner: 10RLazarus) [15:45:23] (03Abandoned) 10RLazarus: interfaces: Update playbook link [alerts] - 10https://gerrit.wikimedia.org/r/1278792 (owner: 10RLazarus) [15:45:26] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T419961)', diff saved to https://phabricator.wikimedia.org/P91962 and previous config saved to /var/cache/conftool/dbconfig/20260429-154525-fceratto.json [15:48:03] (03PS2) 10Elukey: _cookbook: fix parallel test failures with pytest-xdist (-n auto) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1270380 (https://phabricator.wikimedia.org/T420475) [15:48:09] (03CR) 10Elukey: _cookbook: fix parallel test failures with pytest-xdist (-n auto) (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1270380 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey) [15:49:50] (03PS1) 10Atsuko: deployment_server: define more opensearch configs [puppet] - 10https://gerrit.wikimedia.org/r/1279410 (https://phabricator.wikimedia.org/T424248) [15:50:48] (03CR) 10Nikerabbit: [C:03+1] Don't load general modules as style modules [extensions/Translate] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279079 (https://phabricator.wikimedia.org/T424618) (owner: 10Abijeet Patro) [15:51:00] (03PS2) 10Atsuko: deployment_server: define more opensearch configs [puppet] - 10https://gerrit.wikimedia.org/r/1279410 (https://phabricator.wikimedia.org/T424248) [15:52:07] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1279410 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko) [15:52:38] (03CR) 10Bking: [C:03+1] deployment_server: define more opensearch configs [puppet] - 10https://gerrit.wikimedia.org/r/1279410 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko) [15:52:44] (03CR) 10Atsuko: [C:03+2] deployment_server: define more opensearch configs [puppet] - 10https://gerrit.wikimedia.org/r/1279410 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko) [15:55:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P91963 and previous config saved to /var/cache/conftool/dbconfig/20260429-155533-fceratto.json [15:56:10] hi Emperor, there are unapplied changes on puppet, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1279372 [15:56:24] can I apply it? [15:58:49] (03PS1) 10Elukey: wmcs: add the pki discovery2026 intermediate public cert [puppet] - 10https://gerrit.wikimedia.org/r/1279413 (https://phabricator.wikimedia.org/T424549) [16:01:02] (03CR) 10JHathaway: [C:03+1] sre.hosts: fix ipmi() calls after spicerack 12.5.0 [cookbooks] - 10https://gerrit.wikimedia.org/r/1279379 (https://phabricator.wikimedia.org/T418929) (owner: 10Elukey) [16:05:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P91964 and previous config saved to /var/cache/conftool/dbconfig/20260429-160541-fceratto.json [16:09:20] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:09:42] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11872316 (10Ahoelzl) [16:13:17] mvernon@cumin2002 convert-disks (PID 462469) is awaiting input [16:13:54] (03CR) 10Elukey: [C:03+2] wmcs: add the pki discovery2026 intermediate public cert [puppet] - 10https://gerrit.wikimedia.org/r/1279413 (https://phabricator.wikimedia.org/T424549) (owner: 10Elukey) [16:15:19] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.swift.convert-disks (exit_code=99) for host ms-be2066 [16:15:50] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T419961)', diff saved to https://phabricator.wikimedia.org/P91965 and previous config saved to /var/cache/conftool/dbconfig/20260429-161549-fceratto.json [16:15:55] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2066.codfw.wmnet with OS bullseye [16:16:08] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 13Patch-For-Review: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11872347 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2066.codfw.wm... [16:16:11] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance [16:16:15] !log mvernon@cumin2002 START - Cookbook sre.hosts.move-vlan for host ms-be2066 [16:16:20] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2163 (T419961)', diff saved to https://phabricator.wikimedia.org/P91966 and previous config saved to /var/cache/conftool/dbconfig/20260429-161619-fceratto.json [16:16:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:16:25] !log mvernon@cumin2002 START - Cookbook sre.dns.netbox [16:16:31] (03PS1) 10Elukey: wmcs/cloud: add the discovery2026 pki intermediate config [puppet] - 10https://gerrit.wikimedia.org/r/1279417 (https://phabricator.wikimedia.org/T424549) [16:17:13] (03CR) 10Elukey: [C:03+2] wmcs/cloud: add the discovery2026 pki intermediate config [puppet] - 10https://gerrit.wikimedia.org/r/1279417 (https://phabricator.wikimedia.org/T424549) (owner: 10Elukey) [16:19:24] (03PS4) 10Tiziano Fogli: rsyslog: forward thanos-query-frontend logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/1275799 (https://phabricator.wikimedia.org/T423986) [16:19:24] (03PS3) 10Tiziano Fogli: logstash/filter: increase sockets-timeout for unit tests [puppet] - 10https://gerrit.wikimedia.org/r/1278501 (https://phabricator.wikimedia.org/T423986) [16:19:24] (03PS7) 10Tiziano Fogli: logstash: add thanos-query-frontend filter [puppet] - 10https://gerrit.wikimedia.org/r/1275800 (https://phabricator.wikimedia.org/T423986) [16:20:38] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability, 13Patch-For-Review: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11872383 (10elukey) Next steps: - Deploy the new spicerack release and https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1279379 - Add a workaro... [16:21:35] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2066 - mvernon@cumin2002" [16:21:40] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2066 - mvernon@cumin2002" [16:21:41] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:21:41] !log mvernon@cumin2002 START - Cookbook sre.dns.wipe-cache ms-be2066.codfw.wmnet 209.0.192.10.in-addr.arpa 9.0.2.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:21:45] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ms-be2066.codfw.wmnet 209.0.192.10.in-addr.arpa 9.0.2.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:21:46] !log mvernon@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be2066 [16:21:58] !log mvernon@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be2066 [16:21:58] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ms-be2066 [16:23:38] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T419961)', diff saved to https://phabricator.wikimedia.org/P91967 and previous config saved to /var/cache/conftool/dbconfig/20260429-162337-fceratto.json [16:28:35] (03CR) 10VadymTS1: [C:03+1] enwikiversity: Add some user rights to the curator user group on English Wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278363 (https://phabricator.wikimedia.org/T424445) (owner: 10VadymTS1) [16:28:49] mvernon@cumin2002 convert-disks (PID 473740) is awaiting input [16:29:19] (03PS1) 10Atsuko: dse-k8s: adding more opensearch namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279423 (https://phabricator.wikimedia.org/T424248) [16:33:46] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P91968 and previous config saved to /var/cache/conftool/dbconfig/20260429-163345-fceratto.json [16:34:20] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:36:20] (03CR) 10CI reject: [V:04-1] dse-k8s: adding more opensearch namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279423 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko) [16:36:52] (03Abandoned) 10Andrew Bogott: Openstack: use debian.net repo rather than the wmf-hosted repo [puppet] - 10https://gerrit.wikimedia.org/r/1272837 (https://phabricator.wikimedia.org/T423598) (owner: 10Andrew Bogott) [16:38:06] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.swift.convert-disks (exit_code=99) for host ms-be2067 [16:38:31] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2067.codfw.wmnet with OS bullseye [16:38:41] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11872467 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2067.codfw.wmnet with OS bullseye [16:38:55] !log mvernon@cumin2002 START - Cookbook sre.hosts.move-vlan for host ms-be2067 [16:39:11] !log mvernon@cumin2002 START - Cookbook sre.dns.netbox [16:40:15] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2066.codfw.wmnet with reason: host reimage [16:43:07] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2067 - mvernon@cumin2002" [16:43:13] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2067 - mvernon@cumin2002" [16:43:13] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:43:14] !log mvernon@cumin2002 START - Cookbook sre.dns.wipe-cache ms-be2067.codfw.wmnet 160.16.192.10.in-addr.arpa 0.6.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:43:17] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ms-be2067.codfw.wmnet 160.16.192.10.in-addr.arpa 0.6.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:43:18] !log mvernon@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be2067 [16:43:40] !log mvernon@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be2067 [16:43:41] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ms-be2067 [16:43:54] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P91969 and previous config saved to /var/cache/conftool/dbconfig/20260429-164353-fceratto.json [16:44:28] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2066.codfw.wmnet with reason: host reimage [16:45:32] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11872473 (10MatthewVernon) [16:51:16] (03PS2) 10Gkyziridis: changeprop: Configure RevertRisk multilingual model on changeprop. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279385 (https://phabricator.wikimedia.org/T415892) [16:52:04] (03CR) 10Novem Linguae: "Hmm. The dblist securepollglobal contains officewiki but doesn't contain arbcom_zhwiki. Maybe I should revert to PS1." [puppet] - 10https://gerrit.wikimedia.org/r/1279281 (https://phabricator.wikimedia.org/T419309) (owner: 10Novem Linguae) [16:52:09] (03PS3) 10Gkyziridis: changeprop: Configure RevertRisk multilingual model on changeprop. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279385 (https://phabricator.wikimedia.org/T415892) [16:54:02] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T419961)', diff saved to https://phabricator.wikimedia.org/P91970 and previous config saved to /var/cache/conftool/dbconfig/20260429-165401-fceratto.json [16:54:24] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance [16:54:25] (03PS1) 10AKhatun: stream: move mw-page-html-feature-counts-change-enrich to v1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279429 (https://phabricator.wikimedia.org/T424624) [16:54:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2164 (T419961)', diff saved to https://phabricator.wikimedia.org/P91971 and previous config saved to /var/cache/conftool/dbconfig/20260429-165431-fceratto.json [16:54:45] (03PS1) 10VadymTS1: enwikiversity: Enable the abuse filter block action on English Wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279430 (https://phabricator.wikimedia.org/T424053) [16:58:03] (03CR) 10Dragoniez: [C:03+1] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279430 (https://phabricator.wikimedia.org/T424053) (owner: 10VadymTS1) [16:58:22] (03PS2) 10Gkyziridis: ml-services: Deploy the latest version of rr-multilingual model server on prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279388 (https://phabricator.wikimedia.org/T415892) [16:59:12] (03CR) 10Codename Noreste: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279430 (https://phabricator.wikimedia.org/T424053) (owner: 10VadymTS1) [17:00:05] jasmine_: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki infrastructure (UTC late) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260429T1700). [17:01:53] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2067.codfw.wmnet with reason: host reimage [17:01:58] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T419961)', diff saved to https://phabricator.wikimedia.org/P91972 and previous config saved to /var/cache/conftool/dbconfig/20260429-170157-fceratto.json [17:03:28] (03CR) 10VadymTS1: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279430 (https://phabricator.wikimedia.org/T424053) (owner: 10VadymTS1) [17:03:45] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2066.codfw.wmnet with OS bullseye [17:03:46] (03CR) 10VadymTS1: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279430 (https://phabricator.wikimedia.org/T424053) (owner: 10VadymTS1) [17:04:00] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11872566 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2066.codfw.wmnet with OS bullseye compl... [17:08:59] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2067.codfw.wmnet with reason: host reimage [17:09:00] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta [17:12:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P91973 and previous config saved to /var/cache/conftool/dbconfig/20260429-171205-fceratto.json [17:13:24] (03PS1) 10Sbisson: Load TestKitchen earlier [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279431 (https://phabricator.wikimedia.org/T424876) [17:14:30] (03PS1) 10Andrew Bogott: magnum: include helm package for magnum-cluster-api driver [puppet] - 10https://gerrit.wikimedia.org/r/1279432 [17:14:30] (03PS1) 10Andrew Bogott: magnum-cluster-api: update versions for worker cluster [puppet] - 10https://gerrit.wikimedia.org/r/1279433 [17:15:15] (03CR) 10Andrew Bogott: [C:03+2] magnum: include helm package for magnum-cluster-api driver [puppet] - 10https://gerrit.wikimedia.org/r/1279432 (owner: 10Andrew Bogott) [17:16:35] (03CR) 10Esanders: [C:03+1] Enable mobile editor abandonment survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277569 (https://phabricator.wikimedia.org/T423923) (owner: 10DLynch) [17:16:58] (03CR) 10Andrew Bogott: [C:03+2] magnum-cluster-api: update versions for worker cluster [puppet] - 10https://gerrit.wikimedia.org/r/1279433 (owner: 10Andrew Bogott) [17:18:32] (03CR) 10ArielGlenn: [C:03+1] "This is a small behaviour change which we should probably watch once this is live. Anyways, good catch between you and Bartosz!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277703 (owner: 10Daniel Kinzler) [17:19:34] (03CR) 10Phuedx: [C:03+1] Load TestKitchen earlier [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279431 (https://phabricator.wikimedia.org/T424876) (owner: 10Sbisson) [17:22:15] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P91975 and previous config saved to /var/cache/conftool/dbconfig/20260429-172214-fceratto.json [17:27:12] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2067.codfw.wmnet with OS bullseye [17:27:20] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11872731 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2067.codfw.wmnet with OS bullseye compl... [17:30:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279431 (https://phabricator.wikimedia.org/T424876) (owner: 10Sbisson) [17:32:23] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T419961)', diff saved to https://phabricator.wikimedia.org/P91976 and previous config saved to /var/cache/conftool/dbconfig/20260429-173222-fceratto.json [17:32:45] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance [17:32:54] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2166 (T419961)', diff saved to https://phabricator.wikimedia.org/P91977 and previous config saved to /var/cache/conftool/dbconfig/20260429-173253-fceratto.json [17:36:09] (03CR) 10Bartosz Dziewoński: [C:03+1] rest gateway: remove redundant bearerPayload case [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277703 (owner: 10Daniel Kinzler) [17:40:17] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T419961)', diff saved to https://phabricator.wikimedia.org/P91978 and previous config saved to /var/cache/conftool/dbconfig/20260429-174016-fceratto.json [17:50:25] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P91979 and previous config saved to /var/cache/conftool/dbconfig/20260429-175024-fceratto.json [17:58:25] (03PS1) 10CDanis: turnilo: webrequest: add ja4h sub-component dimensions [puppet] - 10https://gerrit.wikimedia.org/r/1279439 [18:00:04] jeena and dduvall: MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260429T1800). Please do the needful. [18:00:33] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P91980 and previous config saved to /var/cache/conftool/dbconfig/20260429-180032-fceratto.json [18:03:25] FIRING: [16x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:05:20] FIRING: [3x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in 3d 19h 49m 25s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [18:07:45] 10ops-eqiad, 06SRE, 06DC-Ops: verify cables - https://phabricator.wikimedia.org/T424601#11872843 (10VRiley-WMF) 05Open→03Resolved https://netbox.wikimedia.org/dcim/cables/4533/ This cable is connected the other two should have cables for them now. I did have to add a dummy console for the test server [18:09:22] (03PS1) 10TrainBranchBot: group1 to 1.46.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279442 (https://phabricator.wikimedia.org/T423877) [18:09:25] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by jhuneidi@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279442 (https://phabricator.wikimedia.org/T423877) (owner: 10TrainBranchBot) [18:10:21] (03Merged) 10jenkins-bot: group1 to 1.46.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279442 (https://phabricator.wikimedia.org/T423877) (owner: 10TrainBranchBot) [18:10:41] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T419961)', diff saved to https://phabricator.wikimedia.org/P91981 and previous config saved to /var/cache/conftool/dbconfig/20260429-181041-fceratto.json [18:10:57] (03CR) 10Ottomata: [C:03+1] stream: move mw-page-html-feature-counts-change-enrich to v1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279429 (https://phabricator.wikimedia.org/T424624) (owner: 10AKhatun) [18:11:03] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance [18:11:12] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2167 (T419961)', diff saved to https://phabricator.wikimedia.org/P91982 and previous config saved to /var/cache/conftool/dbconfig/20260429-181111-fceratto.json [18:14:37] (03PS1) 10Andrew Bogott: cluster-api worker: use latest kubeadm, set up k8s env before using helm [puppet] - 10https://gerrit.wikimedia.org/r/1279445 [18:15:45] (03PS2) 10Andrew Bogott: cluster-api worker: use latest kubeadm, set up k8s env before using helm [puppet] - 10https://gerrit.wikimedia.org/r/1279445 [18:15:59] (03CR) 10RLazarus: turnilo: webrequest: add ja4h sub-component dimensions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1279439 (owner: 10CDanis) [18:16:03] !log jhuneidi@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.46.0-wmf.26 refs T423877 [18:16:08] T423877: 1.46.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T423877 [18:16:50] (03CR) 10Andrew Bogott: [C:03+2] cluster-api worker: use latest kubeadm, set up k8s env before using helm [puppet] - 10https://gerrit.wikimedia.org/r/1279445 (owner: 10Andrew Bogott) [18:18:24] FIRING: HelmReleaseBadStatus: Helm release wikifunctions/python-evaluator on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [18:18:29] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T419961)', diff saved to https://phabricator.wikimedia.org/P91983 and previous config saved to /var/cache/conftool/dbconfig/20260429-181829-fceratto.json [18:22:40] (03PS1) 10CDanis: turnilo: webrequest: add ja4h sub-component dimensions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279447 [18:28:10] RESOLVED: HelmReleaseBadStatus: Helm release wikifunctions/python-evaluator on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [18:28:38] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P91984 and previous config saved to /var/cache/conftool/dbconfig/20260429-182837-fceratto.json [18:30:58] (03CR) 10Brouberol: "My bad, this should have been deleted from the puppet repo. The canonical configuration now lives in https://gerrit.wikimedia.org/r/plugin" [puppet] - 10https://gerrit.wikimedia.org/r/1279439 (owner: 10CDanis) [18:38:46] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P91985 and previous config saved to /var/cache/conftool/dbconfig/20260429-183845-fceratto.json [18:39:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277569 (https://phabricator.wikimedia.org/T423923) (owner: 10DLynch) [18:42:09] FIRING: HelmReleaseBadStatus: Helm release wikifunctions/python-evaluator on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [18:42:24] (03PS1) 10Medelius: Abandon editor survey: UI updates [extensions/MobileFrontend] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279448 (https://phabricator.wikimedia.org/T422931) [18:42:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/MobileFrontend] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279448 (https://phabricator.wikimedia.org/T422931) (owner: 10Medelius) [18:44:09] (03CR) 10AKhatun: alerts: mw-page-html-feature-counts-change-enrich (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1278559 (https://phabricator.wikimedia.org/T424224) (owner: 10AKhatun) [18:48:56] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T419961)', diff saved to https://phabricator.wikimedia.org/P91986 and previous config saved to /var/cache/conftool/dbconfig/20260429-184854-fceratto.json [18:49:18] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance [18:49:25] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2181 (T419961)', diff saved to https://phabricator.wikimedia.org/P91987 and previous config saved to /var/cache/conftool/dbconfig/20260429-184925-fceratto.json [18:49:37] (03CR) 10Dragoniez: [C:03+1] "Just so you know, you need to backport this because this repo isn’t deployed in the weekly deployment train. See https://wikitech.wikimedi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279430 (https://phabricator.wikimedia.org/T424053) (owner: 10VadymTS1) [18:49:51] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11873039 (10wiki_willy) Hey @elukey - do you have the Supermicro case number for this one? Thanks, Willy [18:50:06] (03PS2) 10Anzx: enwikiquote: enable UseSandboxLink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279404 (https://phabricator.wikimedia.org/T424863) [18:50:13] (03PS2) 10Anzx: arbcom_enwiki: update logo, icon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279422 (https://phabricator.wikimedia.org/T424555) [18:50:18] (03PS3) 10Anzx: cswiki: lift IP cap for edithathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279406 (https://phabricator.wikimedia.org/T424843) [18:50:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279406 (https://phabricator.wikimedia.org/T424843) (owner: 10Anzx) [18:50:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279422 (https://phabricator.wikimedia.org/T424555) (owner: 10Anzx) [18:51:05] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279404 (https://phabricator.wikimedia.org/T424863) (owner: 10Anzx) [18:54:02] (03PS1) 10C. Scott Ananian: Deploy Parsoid Read Views to 6 language converter wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279450 (https://phabricator.wikimedia.org/T423785) [18:56:24] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 6 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11873042 (10Nux) I went through this [[ https://global-search.toolforge.org/?q=%5C%2Fthumb%5C%2F%28%5B%5E%5C%2F%5D%2B%3F%5C%2F%29%7B3... [18:56:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T419961)', diff saved to https://phabricator.wikimedia.org/P91988 and previous config saved to /var/cache/conftool/dbconfig/20260429-185634-fceratto.json [18:56:35] (03PS1) 10C. Scott Ananian: Deploy Parsoid Read Views to 12 small wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279451 (https://phabricator.wikimedia.org/T424590) [18:57:29] (03Abandoned) 10C. Scott Ananian: Deploy Parsoid Read Views to 12 small wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279451 (https://phabricator.wikimedia.org/T424590) (owner: 10C. Scott Ananian) [18:57:53] (03CR) 10C. Scott Ananian: [C:03+1] Deploy PRV to 12 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277770 (https://phabricator.wikimedia.org/T424590) (owner: 10Arlolra) [18:57:54] (03PS2) 10AKhatun: alerts: mw-page-html-feature-counts-change-enrich [alerts] - 10https://gerrit.wikimedia.org/r/1278559 (https://phabricator.wikimedia.org/T424224) [18:58:26] (03PS3) 10AKhatun: alerts: mw-page-html-feature-counts-change-enrich [alerts] - 10https://gerrit.wikimedia.org/r/1278559 (https://phabricator.wikimedia.org/T424224) [18:58:38] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11873046 (10Papaul) [18:59:08] (03PS3) 10Arlolra: Deploy PRV to 12 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277770 (https://phabricator.wikimedia.org/T424590) [19:00:46] (03PS2) 10C. Scott Ananian: Deploy Parsoid Read Views to 6 language converter wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279450 (https://phabricator.wikimedia.org/T423785) [19:02:13] (03CR) 10RLazarus: [C:03+1] turnilo: webrequest: add ja4h sub-component dimensions (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279447 (owner: 10CDanis) [19:06:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P91989 and previous config saved to /var/cache/conftool/dbconfig/20260429-190641-fceratto.json [19:07:22] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be209[7,8] - https://phabricator.wikimedia.org/T424892 (10RobH) 03NEW [19:07:38] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be209[7,8] - https://phabricator.wikimedia.org/T424892#11873092 (10RobH) [19:09:01] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be209[7,8] - https://phabricator.wikimedia.org/T424892#11873096 (10RobH) a:03MatthewVernon Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations) and... [19:11:29] (03PS1) 10C. Scott Ananian: Enable Parsoid Read Views for 20% of enwiki mobile web traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279452 (https://phabricator.wikimedia.org/T424880) [19:11:31] (03PS1) 10C. Scott Ananian: Increase Parsoid Read Views to 60% of enwiki mobile web traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279453 (https://phabricator.wikimedia.org/T424880) [19:11:33] (03PS1) 10C. Scott Ananian: Increase Parsoid Read Views to 100% of enwiki mobile web traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279454 (https://phabricator.wikimedia.org/T424880) [19:12:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277770 (https://phabricator.wikimedia.org/T424590) (owner: 10Arlolra) [19:13:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279450 (https://phabricator.wikimedia.org/T423785) (owner: 10C. Scott Ananian) [19:13:42] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279452 (https://phabricator.wikimedia.org/T424880) (owner: 10C. Scott Ananian) [19:15:31] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be1098, ms-be1099, ms-be1100 - https://phabricator.wikimedia.org/T424895 (10RobH) 03NEW [19:15:53] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be1098, ms-be1099, ms-be1100 - https://phabricator.wikimedia.org/T424895#11873155 (10RobH) [19:16:24] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install ms-be1098, ms-be1099, ms-be1100 - https://phabricator.wikimedia.org/T424895#11873159 (10RobH) a:03MatthewVernon Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/D... [19:16:51] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P91990 and previous config saved to /var/cache/conftool/dbconfig/20260429-191650-fceratto.json [19:17:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279430 (https://phabricator.wikimedia.org/T424053) (owner: 10VadymTS1) [19:20:27] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [19:22:05] (03CR) 10Ottomata: [C:03+1] alerts: mw-page-html-feature-counts-change-enrich [alerts] - 10https://gerrit.wikimedia.org/r/1278559 (https://phabricator.wikimedia.org/T424224) (owner: 10AKhatun) [19:24:35] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns infor for asw1-23-ulsfo - pt1979@cumin2002" [19:25:04] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns infor for asw1-23-ulsfo - pt1979@cumin2002" [19:25:04] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:25:17] (03CR) 10CDanis: turnilo: webrequest: add ja4h sub-component dimensions (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279447 (owner: 10CDanis) [19:26:59] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T419961)', diff saved to https://phabricator.wikimedia.org/P91992 and previous config saved to /var/cache/conftool/dbconfig/20260429-192658-fceratto.json [19:27:19] (03Abandoned) 10CDanis: turnilo: webrequest: add ja4h sub-component dimensions [puppet] - 10https://gerrit.wikimedia.org/r/1279439 (owner: 10CDanis) [19:27:21] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance [19:27:31] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2195 (T419961)', diff saved to https://phabricator.wikimedia.org/P91993 and previous config saved to /var/cache/conftool/dbconfig/20260429-192729-fceratto.json [19:28:34] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11873200 (10Papaul) [19:28:51] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278363 (https://phabricator.wikimedia.org/T424445) (owner: 10VadymTS1) [19:29:21] (03PS2) 10Atsuko: dse-k8s: adding more opensearch namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279423 (https://phabricator.wikimedia.org/T424248) [19:30:57] (03PS1) 10Dzahn: zuul: remove zuul-nodepool config, user, stop service [puppet] - 10https://gerrit.wikimedia.org/r/1279461 (https://phabricator.wikimedia.org/T424879) [19:32:52] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1279461/8490/zuul1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1279461 (https://phabricator.wikimedia.org/T424879) (owner: 10Dzahn) [19:34:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T419961)', diff saved to https://phabricator.wikimedia.org/P91994 and previous config saved to /var/cache/conftool/dbconfig/20260429-193431-fceratto.json [19:38:39] (03PS1) 10CDanis: haproxy: webrequest: capture ratelimiting headers [puppet] - 10https://gerrit.wikimedia.org/r/1279465 (https://phabricator.wikimedia.org/T419736) [19:38:48] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1279465 (https://phabricator.wikimedia.org/T419736) (owner: 10CDanis) [19:38:49] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11873230 (10Papaul) [19:40:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265959 (https://phabricator.wikimedia.org/T419309) (owner: 101F616EMO) [19:41:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:41:40] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11873235 (10Papaul) [19:44:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P91995 and previous config saved to /var/cache/conftool/dbconfig/20260429-194439-fceratto.json [19:44:59] (03PS1) 10Dzahn: zuul: create profile for new zuul-builder replacing nodepool [puppet] - 10https://gerrit.wikimedia.org/r/1279470 (https://phabricator.wikimedia.org/T424879) [19:45:27] (03PS2) 10Dzahn: zuul: create profile for new zuul-builder replacing nodepool [puppet] - 10https://gerrit.wikimedia.org/r/1279470 (https://phabricator.wikimedia.org/T424879) [19:46:03] (03CR) 10Bking: [C:03+1] dse-k8s: adding more opensearch namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279423 (https://phabricator.wikimedia.org/T424248) (owner: 10Atsuko) [19:46:11] (03CR) 10VadymTS1: [C:03+1] ukwiki: Remove the patroller user group and adjust various user rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1274928 (https://phabricator.wikimedia.org/T423461) (owner: 10Codename Noreste) [19:46:50] (03CR) 10Dzahn: [C:04-1] "missing the config template, equivalent to modules/profile/templates/zuul/nodepool.conf.erb and nodepool.yaml.erb" [puppet] - 10https://gerrit.wikimedia.org/r/1279470 (https://phabricator.wikimedia.org/T424879) (owner: 10Dzahn) [19:47:53] (03PS2) 10Dzahn: zuul: remove zuul-nodepool config, user, stop service [puppet] - 10https://gerrit.wikimedia.org/r/1279461 (https://phabricator.wikimedia.org/T424879) [19:48:33] (03CR) 10Dzahn: "should we just do this now before even upgrading? should it wait until after builder is installed? does it matter?" [puppet] - 10https://gerrit.wikimedia.org/r/1279461 (https://phabricator.wikimedia.org/T424879) (owner: 10Dzahn) [19:54:48] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P91996 and previous config saved to /var/cache/conftool/dbconfig/20260429-195447-fceratto.json [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260429T2000). [20:00:05] phuedx, cmede, anzx, cscott, VadymTS1, and ZhaoFJx: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:07] o/ [20:00:09] o/ [20:00:10] o/ [20:00:11] o/ [20:00:33] (03PS1) 10CDanis: base::kernel: ban algif_aead [puppet] - 10https://gerrit.wikimedia.org/r/1279473 [20:01:51] (03PS1) 10Andrew Bogott: setup_capi.sh.erb: don't manually install certmanager [puppet] - 10https://gerrit.wikimedia.org/r/1279474 [20:02:36] (03CR) 10Andrew Bogott: [C:03+2] setup_capi.sh.erb: don't manually install certmanager [puppet] - 10https://gerrit.wikimedia.org/r/1279474 (owner: 10Andrew Bogott) [20:03:45] I can help with deployments. [20:04:14] anzx: Can your changes go out all at once? [20:04:20] ok [20:04:34] rephrasing: Is it safe for yours to go out all at once [20:04:53] (03CR) 10RLazarus: [C:03+1] base::kernel: ban algif_aead [puppet] - 10https://gerrit.wikimedia.org/r/1279473 (owner: 10CDanis) [20:04:55] sure, no problem if it sync at once [20:04:56] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T419961)', diff saved to https://phabricator.wikimedia.org/P91997 and previous config saved to /var/cache/conftool/dbconfig/20260429-200455-fceratto.json [20:05:03] (03CR) 10CDanis: [C:03+2] base::kernel: ban algif_aead [puppet] - 10https://gerrit.wikimedia.org/r/1279473 (owner: 10CDanis) [20:05:21] (03CR) 10CDanis: [V:03+1 C:03+2] "100.0% (2431/2431) of nodes failed to execute command #1: 'lsmod | grep algif'" [puppet] - 10https://gerrit.wikimedia.org/r/1279473 (owner: 10CDanis) [20:05:25] Alright [20:05:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279406 (https://phabricator.wikimedia.org/T424843) (owner: 10Anzx) [20:05:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279422 (https://phabricator.wikimedia.org/T424555) (owner: 10Anzx) [20:05:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279404 (https://phabricator.wikimedia.org/T424863) (owner: 10Anzx) [20:08:54] (03PS4) 10Tiziano Fogli: logstash/filter: increase sockets-timeout for unit tests [puppet] - 10https://gerrit.wikimedia.org/r/1278501 (https://phabricator.wikimedia.org/T423986) [20:09:55] (03Merged) 10jenkins-bot: cswiki: lift IP cap for edithathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279406 (https://phabricator.wikimedia.org/T424843) (owner: 10Anzx) [20:09:59] (03Merged) 10jenkins-bot: arbcom_enwiki: update logo, icon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279422 (https://phabricator.wikimedia.org/T424555) (owner: 10Anzx) [20:10:02] (03Merged) 10jenkins-bot: enwikiquote: enable UseSandboxLink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279404 (https://phabricator.wikimedia.org/T424863) (owner: 10Anzx) [20:10:32] !log dancy@deploy1003 Started scap sync-world: Backport for [[gerrit:1279406|cswiki: lift IP cap for edithathon (T424843)]], [[gerrit:1279422|arbcom_enwiki: update logo, icon (T424555)]], [[gerrit:1279404|enwikiquote: enable UseSandboxLink (T424863)]] [20:10:39] T424843: Lift IP cap on 2026-05-14 for an editathon - cs.wikipedia - https://phabricator.wikimedia.org/T424843 [20:10:40] T424555: Requesting logo change for arbcom-en.wikipedia.org - https://phabricator.wikimedia.org/T424555 [20:10:40] T424863: Enable the SandboxLink extension on English Wikiquote - https://phabricator.wikimedia.org/T424863 [20:12:25] !log dancy@deploy1003 anzx, dancy: Backport for [[gerrit:1279406|cswiki: lift IP cap for edithathon (T424843)]], [[gerrit:1279422|arbcom_enwiki: update logo, icon (T424555)]], [[gerrit:1279404|enwikiquote: enable UseSandboxLink (T424863)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:12:30] checking [20:13:11] (03PS1) 10Phuedx: JS SDK: Remove compat deprecation warnings [extensions/TestKitchen] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279476 [20:13:28] dancy: looks good, ok to sync [20:13:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 30 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/TestKitchen] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279476 (owner: 10Phuedx) [20:13:37] OK [20:13:40] !log dancy@deploy1003 anzx, dancy: Continuing with deployment [20:13:59] ZhaoFJx: You'll be next [20:14:12] ominous [20:14:16] haha [20:14:21] lol [20:14:25] Deployment of doom [20:14:40] 😨 [20:15:26] dancy thanks [20:17:31] !log dancy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1279406|cswiki: lift IP cap for edithathon (T424843)]], [[gerrit:1279422|arbcom_enwiki: update logo, icon (T424555)]], [[gerrit:1279404|enwikiquote: enable UseSandboxLink (T424863)]] (duration: 06m 59s) [20:17:41] T424843: Lift IP cap on 2026-05-14 for an editathon - cs.wikipedia - https://phabricator.wikimedia.org/T424843 [20:17:41] T424555: Requesting logo change for arbcom-en.wikipedia.org - https://phabricator.wikimedia.org/T424555 [20:17:42] T424863: Enable the SandboxLink extension on English Wikiquote - https://phabricator.wikimedia.org/T424863 [20:17:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265959 (https://phabricator.wikimedia.org/T419309) (owner: 101F616EMO) [20:18:36] (03PS5) 10Cwhite: logstash CI: increase sockets-timeout for e2e testing [puppet] - 10https://gerrit.wikimedia.org/r/1278501 (https://phabricator.wikimedia.org/T423986) (owner: 10Tiziano Fogli) [20:19:34] (03Merged) 10jenkins-bot: arbcom_zhwiki: Enable SecurePoll without PII rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265959 (https://phabricator.wikimedia.org/T419309) (owner: 101F616EMO) [20:19:56] (03PS1) 10VadymTS1: nlwiki: Modify autoconfirmed requirements for nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279477 (https://phabricator.wikimedia.org/T424898) [20:19:58] !log dancy@deploy1003 Started scap sync-world: Backport for [[gerrit:1265959|arbcom_zhwiki: Enable SecurePoll without PII rights (T419309)]] [20:20:02] T419309: Enable SecurePoll extension on arbcom_zh - https://phabricator.wikimedia.org/T419309 [20:20:03] dancy: thanks for deploying, please run above to purge logos https://www.irccloud.com/pastebin/VvNfoWen/ [20:20:21] ok, stand by [20:20:49] Done. [20:20:59] thank you [20:21:21] (03CR) 10Cwhite: [C:03+2] logstash CI: increase sockets-timeout for e2e testing (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1278501 (https://phabricator.wikimedia.org/T423986) (owner: 10Tiziano Fogli) [20:21:51] !log dancy@deploy1003 1f616emo, dancy: Backport for [[gerrit:1265959|arbcom_zhwiki: Enable SecurePoll without PII rights (T419309)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:22:05] checking [20:23:18] dancy checked! [20:23:21] works great [20:23:25] OK. Moving on [20:23:29] !log dancy@deploy1003 1f616emo, dancy: Continuing with deployment [20:23:53] (03CR) 10Ahmon Dancy: [C:03+2] Abandon editor survey: UI updates [extensions/MobileFrontend] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279448 (https://phabricator.wikimedia.org/T422931) (owner: 10Medelius) [20:24:15] cmede: You're next in line. [20:24:21] thank you, less ominous [20:25:23] (03Merged) 10jenkins-bot: Abandon editor survey: UI updates [extensions/MobileFrontend] (wmf/1.46.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1279448 (https://phabricator.wikimedia.org/T422931) (owner: 10Medelius) [20:27:10] lol [20:27:14] !log dancy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1265959|arbcom_zhwiki: Enable SecurePoll without PII rights (T419309)]] (duration: 07m 16s) [20:27:19] T419309: Enable SecurePoll extension on arbcom_zh - https://phabricator.wikimedia.org/T419309 [20:27:23] cmede: OK for your two changes to go out together? [20:27:28] yep! [20:28:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277569 (https://phabricator.wikimedia.org/T423923) (owner: 10DLynch) [20:28:17] works great without mwdebug on [20:28:21] dancy thanks a lot [20:28:33] ZhaoFJx: You're welcome [20:29:00] o/ [20:29:06] i'm late, sorry. lost track of time. [20:29:26] (03PS1) 10Dzahn: admin: extend expiry_date for sarmbruster by 1 month [puppet] - 10https://gerrit.wikimedia.org/r/1279482 (https://phabricator.wikimedia.org/T424402) [20:29:29] No problem [20:32:23] (03Merged) 10jenkins-bot: Enable mobile editor abandonment survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277569 (https://phabricator.wikimedia.org/T423923) (owner: 10DLynch) [20:32:30] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Extend wmde/nda LDAP access for Sarmbruster - https://phabricator.wikimedia.org/T424402#11873384 (10Dzahn) 05Open→03In progress p:05Triage→03Medium [20:32:50] !log dancy@deploy1003 Started scap sync-world: Backport for [[gerrit:1277569|Enable mobile editor abandonment survey on enwiki (T423923)]], [[gerrit:1279448|Abandon editor survey: UI updates (T422931)]] [20:32:58] T423923: Deploy config change to start "Exit the editor" survey (v1.0) - https://phabricator.wikimedia.org/T423923 [20:32:58] T422931: Implement the "Exit the editor" survey - https://phabricator.wikimedia.org/T422931 [20:34:48] !log dancy@deploy1003 dancy, caro, kemayo: Backport for [[gerrit:1277569|Enable mobile editor abandonment survey on enwiki (T423923)]], [[gerrit:1279448|Abandon editor survey: UI updates (T422931)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:34:57] checking~ [20:35:58] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11873424 (10Papaul) [20:38:29] looks good [20:40:48] !log dancy@deploy1003 dancy, caro, kemayo: Continuing with deployment [20:44:33] !log dancy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1277569|Enable mobile editor abandonment survey on enwiki (T423923)]], [[gerrit:1279448|Abandon editor survey: UI updates (T422931)]] (duration: 11m 43s) [20:44:39] T423923: Deploy config change to start "Exit the editor" survey (v1.0) - https://phabricator.wikimedia.org/T423923 [20:44:39] T422931: Implement the "Exit the editor" survey - https://phabricator.wikimedia.org/T422931 [20:45:07] phuedx: Do you want to handle your own deployment? [20:45:58] thank you! [20:46:08] cmede: You got it [20:46:22] dancy: Can do [20:46:26] VadymTS1: Are you lurking? [20:46:39] No [20:46:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279431 (https://phabricator.wikimedia.org/T424876) (owner: 10Sbisson) [20:47:22] VaymTS1: We can process your changes after phuedx is done. [20:47:51] Ok [20:48:20] (03Merged) 10jenkins-bot: Load TestKitchen earlier [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279431 (https://phabricator.wikimedia.org/T424876) (owner: 10Sbisson) [20:48:43] !log phuedx@deploy1003 Started scap sync-world: Backport for [[gerrit:1279431|Load TestKitchen earlier (T424876)]] [20:48:48] T424876: TestKitchen and other extensions loading order may influence group assignments - https://phabricator.wikimedia.org/T424876 [20:50:37] !log phuedx@deploy1003 phuedx, sbisson: Backport for [[gerrit:1279431|Load TestKitchen earlier (T424876)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:50:46] Checking [20:53:34] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11873477 (10Papaul) [20:55:46] Quick browse of enwiki, dewiki, wikidata. Things appear to be working correctly and the logs look clean [20:56:26] !log phuedx@deploy1003 phuedx, sbisson: Continuing with deployment [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260429T2100) [21:00:16] !log phuedx@deploy1003 Finished scap sync-world: Backport for [[gerrit:1279431|Load TestKitchen earlier (T424876)]] (duration: 11m 33s) [21:00:21] T424876: TestKitchen and other extensions loading order may influence group assignments - https://phabricator.wikimedia.org/T424876 [21:00:35] dancy: Back to you [21:00:44] Thanks. VadymTS1 ready? [21:00:47] yes [21:01:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279430 (https://phabricator.wikimedia.org/T424053) (owner: 10VadymTS1) [21:01:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278363 (https://phabricator.wikimedia.org/T424445) (owner: 10VadymTS1) [21:04:10] (03Merged) 10jenkins-bot: enwikiversity: Enable the abuse filter block action on English Wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279430 (https://phabricator.wikimedia.org/T424053) (owner: 10VadymTS1) [21:04:13] (03Merged) 10jenkins-bot: enwikiversity: Add some user rights to the curator user group on English Wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1278363 (https://phabricator.wikimedia.org/T424445) (owner: 10VadymTS1) [21:04:40] !log dancy@deploy1003 Started scap sync-world: Backport for [[gerrit:1279430|enwikiversity: Enable the abuse filter block action on English Wikiversity (T424053)]], [[gerrit:1278363|enwikiversity: Add some user rights to the curator user group on English Wikiversity (T424445)]] [21:04:46] T424053: Enable the abuse filter block action on English Wikiversity - https://phabricator.wikimedia.org/T424053 [21:04:47] T424445: Add some user rights to the curator user group on English Wikiversity - https://phabricator.wikimedia.org/T424445 [21:06:32] !log dancy@deploy1003 vadymts1, dancy: Backport for [[gerrit:1279430|enwikiversity: Enable the abuse filter block action on English Wikiversity (T424053)]], [[gerrit:1278363|enwikiversity: Add some user rights to the curator user group on English Wikiversity (T424445)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:08:59] VadymTS1: Are you running checks? [21:09:00] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta [21:09:15] please wait a minute [21:09:19] ok [21:10:17] (03PS1) 10Aleksandar Mastilovic: Add x_trusted_request and x_wmf_ratelimit_class to webrequest live streams [puppet] - 10https://gerrit.wikimedia.org/r/1279489 (https://phabricator.wikimedia.org/T419736) [21:10:41] PROBLEM - Host mr1-magru.oob is DOWN: PING CRITICAL - Packet loss = 100% [21:10:53] (03CR) 10CI reject: [V:04-1] Add x_trusted_request and x_wmf_ratelimit_class to webrequest live streams [puppet] - 10https://gerrit.wikimedia.org/r/1279489 (https://phabricator.wikimedia.org/T419736) (owner: 10Aleksandar Mastilovic) [21:13:44] I can't start the test because I recently switched to a Mac and I can't use mwdebug here, can you help me? [21:13:45] (03PS1) 10Bking: cumin: install gnutls-bin package [puppet] - 10https://gerrit.wikimedia.org/r/1279491 (https://phabricator.wikimedia.org/T424672) [21:13:50] (03PS2) 10Aleksandar Mastilovic: Add x_trusted_request and x_wmf_ratelimit_class to webrequest live streams [puppet] - 10https://gerrit.wikimedia.org/r/1279489 (https://phabricator.wikimedia.org/T419736) [21:14:06] VadymTS1: Sure. Let me know what you need me to do [21:16:06] Im activated the WikimediaDebug here: https://wikitech.wikimedia.org/wiki/Special:WikimediaDebug and idk what to do next [21:16:28] This my first try to do this [21:17:58] Just to make sure I understand, are you saying you got the debug extension working? [21:19:09] yes I'm activate the Wikimedia debug cookie at this site [21:20:31] I was guided this by: https://wikitech.wikimedia.org/wiki/WikimediaDebug [21:20:43] Ok good. So what you do next is visit a URl that is affected by your changes, enable the extension, and select k8s-mwdebug in the pulldown (it's probably set that way), then reload the page. [21:20:55] RECOVERY - Host mr1-magru.oob is UP: PING OK - Packet loss = 0%, RTA = 117.42 ms [21:21:01] And verify that whatever effects you expected your changes to have are actually happening. [21:23:06] (03CR) 10Aleksandar Mastilovic: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1279465 (https://phabricator.wikimedia.org/T419736) (owner: 10CDanis) [21:28:04] My promise, I need to dowloand Chrome browser, Safari not in this extension [21:28:25] dancy: (go ahead with ZhaoFJx ahead of me once this deploy completes, I need to be away from keyboard for a few minutes) [21:28:51] cscott: OK. I'll let you know when we're unstuck [21:30:21] Is anyone around who can continue to help VadymTS1? I need to get out of here. [21:30:31] If not, I will roll back and revert the two changes. [21:32:31] yes I can [21:32:46] oh good. Thanks Jeena. The deployment is still active in SpiderPig. [21:33:22] you're welcome! VadymTS1 let me know when to continue [21:34:02] ok, now I'm dowloand chrome and turn button of extension debugger [21:34:18] 👍 [21:41:59] VadymTS1: Do you need any help? Or is it still downloading? [21:42:24] Idk to checked the rights [21:42:39] *edits [21:43:37] What exactly do I need to do to confirm these changes? [21:44:11] let me see if I can find out [21:47:48] VadymTS1: can you go to Special:AbuseFilter and see if the block action is enabled? [21:48:04] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11873578 (10Papaul) @ssingh important note: The public subnet mask for servers in rack 103.02.22 will be changing for /28 to /27 so will will have to manually... [21:50:14] Oh, I guess that only shows up if someone is blocked? I'm not sure [21:50:18] still looking [21:54:48] VadymTS1: if you have the correct rights, I think there should be block user option on the Special:AbuseFilter page under actions [21:55:57] I don't have rights in Wikiversity also I have to see the groups rights on special pages (about curator) [21:58:08] okay, let me try to check [21:58:14] Yes all is correct [21:58:35] The curator have new rights [21:59:34] (03CR) 10AKhatun: [C:03+2] stream: move mw-page-html-feature-counts-change-enrich to v1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279429 (https://phabricator.wikimedia.org/T424624) (owner: 10AKhatun) [22:00:05] Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260429T2200) [22:01:17] VadymTS1: so you were able to check the curator rights? What about the abuse filter? unfortunately it doesn't look like I have permissions [22:01:37] (03Merged) 10jenkins-bot: stream: move mw-page-html-feature-counts-change-enrich to v1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1279429 (https://phabricator.wikimedia.org/T424624) (owner: 10AKhatun) [22:02:58] in my opinion everything works and appeared [22:03:07] okay thanks I will proceed [22:03:15] !log dancy@deploy1003 vadymts1, dancy: Continuing with deployment [22:03:25] FIRING: [16x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:05:20] FIRING: [3x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in 3d 15h 49m 25s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [22:07:06] !log dancy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1279430|enwikiversity: Enable the abuse filter block action on English Wikiversity (T424053)]], [[gerrit:1278363|enwikiversity: Add some user rights to the curator user group on English Wikiversity (T424445)]] (duration: 62m 26s) [22:07:12] T424053: Enable the abuse filter block action on English Wikiversity - https://phabricator.wikimedia.org/T424053 [22:07:13] T424445: Add some user rights to the curator user group on English Wikiversity - https://phabricator.wikimedia.org/T424445 [22:07:52] cscott: ready for you [22:08:48] VadymTS1: all deployed, thanks for your patience [22:09:22] jeena Thanks you, for help absolutely [22:09:50] yw! [22:10:47] Ok, I think I'm next? Let me check that everything else in the queue is merged now. [22:10:59] yeah I think you're the last one! [22:11:04] I can spiderpig my own patch, so I think you're off the hook jeena. [22:11:37] okay thanks! [22:14:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277770 (https://phabricator.wikimedia.org/T424590) (owner: 10Arlolra) [22:14:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279450 (https://phabricator.wikimedia.org/T423785) (owner: 10C. Scott Ananian) [22:18:21] (03Merged) 10jenkins-bot: Deploy PRV to 12 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1277770 (https://phabricator.wikimedia.org/T424590) (owner: 10Arlolra) [22:18:24] (03Merged) 10jenkins-bot: Deploy Parsoid Read Views to 6 language converter wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279450 (https://phabricator.wikimedia.org/T423785) (owner: 10C. Scott Ananian) [22:18:48] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1277770|Deploy PRV to 12 wikis (T424590)]], [[gerrit:1279450|Deploy Parsoid Read Views to 6 language converter wikis (T423785)]] [22:18:54] T424590: Parsoid Read Views to deploy ~2026-04-30 - https://phabricator.wikimedia.org/T424590 [22:18:55] T423785: Parsoid Read Views to deploy ~2026-04-20 (Language Converter wikis) - https://phabricator.wikimedia.org/T423785 [22:20:42] !log cscott@deploy1003 arlolra, cscott: Backport for [[gerrit:1277770|Deploy PRV to 12 wikis (T424590)]], [[gerrit:1279450|Deploy Parsoid Read Views to 6 language converter wikis (T423785)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:21:26] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11873638 (10Papaul) @RobH Remote hands instructions are ready @ https://docs.google.com/document/d/1EW6hxHCQjXPy1PXQWluwOTnCl_AHddI34iOYHdJuvek/edit?tab=t.0 Pl... [22:35:04] !log cscott@deploy1003 arlolra, cscott: Continuing with deployment [22:39:51] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1277770|Deploy PRV to 12 wikis (T424590)]], [[gerrit:1279450|Deploy Parsoid Read Views to 6 language converter wikis (T423785)]] (duration: 21m 03s) [22:39:57] T424590: Parsoid Read Views to deploy ~2026-04-30 - https://phabricator.wikimedia.org/T424590 [22:39:58] T423785: Parsoid Read Views to deploy ~2026-04-20 (Language Converter wikis) - https://phabricator.wikimedia.org/T423785 [22:41:20] ok, one last patch to go (whew!) [22:41:28] this is the exciting one [22:42:24] FIRING: HelmReleaseBadStatus: Helm release wikifunctions/python-evaluator on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [22:43:11] 06SRE, 10LDAP-Access-Requests: Requesting logstash-access LDAP group access for HakanIST - https://phabricator.wikimedia.org/T424812#11873714 (10KFrancis) The NDA has been sent for signatures. I'll confirm when it's complete. Thanks! [22:45:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279452 (https://phabricator.wikimedia.org/T424880) (owner: 10C. Scott Ananian) [22:46:27] cscott: Congrats! [22:46:34] Jeena: Thanks again! [22:46:57] np! [22:50:01] (03Merged) 10jenkins-bot: Enable Parsoid Read Views for 20% of enwiki mobile web traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1279452 (https://phabricator.wikimedia.org/T424880) (owner: 10C. Scott Ananian) [22:50:27] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1279452|Enable Parsoid Read Views for 20% of enwiki mobile web traffic (T424880)]] [22:50:32] T424880: Parsoid Read Views to deploy 2026-04-29-2026-04-30 (enwiki mobile web) - https://phabricator.wikimedia.org/T424880 [22:52:22] !log cscott@deploy1003 cscott: Backport for [[gerrit:1279452|Enable Parsoid Read Views for 20% of enwiki mobile web traffic (T424880)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:59:54] !log cscott@deploy1003 cscott: Continuing with deployment [23:04:30] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1279452|Enable Parsoid Read Views for 20% of enwiki mobile web traffic (T424880)]] (duration: 14m 03s) [23:04:35] T424880: Parsoid Read Views to deploy 2026-04-29-2026-04-30 (enwiki mobile web) - https://phabricator.wikimedia.org/T424880 [23:11:42] (03PS1) 10Dduvall: zuul: Upgrade to Zuul 14.2.0 [puppet] - 10https://gerrit.wikimedia.org/r/1279500 (https://phabricator.wikimedia.org/T424879) [23:13:31] (03PS2) 10Dduvall: zuul: Upgrade to Zuul 14.2.0 [puppet] - 10https://gerrit.wikimedia.org/r/1279500 (https://phabricator.wikimedia.org/T424879) [23:15:42] (03CR) 10ArielGlenn: "Generally seems ok, a few questions left inline" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1272765 (https://phabricator.wikimedia.org/T413448) (owner: 10Daniel Kinzler) [23:16:26] i'm done, and Parsoid Read Views is live on enwiki mobile web now (yay) for 20% of pages. [23:18:09] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host tools-k8s-worker1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:18:24] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host tools-k8s-worker1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:18:27] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host tools-k8s-worker1008.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:18:29] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host tools-k8s-worker1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:18:31] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host tools-k8s-ctrl1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:18:33] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host tools-k8s-worker1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:18:38] (03PS1) 10Papaul: Add BGP peering from asw1-23 to core routers and mr1 [homer/public] - 10https://gerrit.wikimedia.org/r/1279501 (https://phabricator.wikimedia.org/T408892) [23:19:49] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host tools-k8s-worker1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:19:49] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host tools-k8s-worker1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:19:50] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host tools-k8s-ctrl1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:19:52] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host tools-k8s-worker1008.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:19:58] (03CR) 10CI reject: [V:04-1] Add BGP peering from asw1-23 to core routers and mr1 [homer/public] - 10https://gerrit.wikimedia.org/r/1279501 (https://phabricator.wikimedia.org/T408892) (owner: 10Papaul) [23:24:11] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [23:25:59] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host tools-k8s-worker1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:27:04] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host tools-k8s-worker1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:27:38] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1375.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:28:01] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker1375 to eqiad - jclark@cumin1003" [23:28:04] (03PS2) 10Papaul: Add BGP peering from asw1-23 to core routers and mr1 [homer/public] - 10https://gerrit.wikimedia.org/r/1279501 (https://phabricator.wikimedia.org/T408892) [23:28:06] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker1375 to eqiad - jclark@cumin1003" [23:28:06] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:28:12] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1376.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:28:22] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1377.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:28:28] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1375.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:28:44] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1377.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:29:00] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1378.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:29:10] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1378.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:29:34] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1379.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:30:03] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1380.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:30:24] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1380.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:31:36] (03CR) 10Papaul: [C:03+2] Add BGP peering from asw1-23 to core routers and mr1 [homer/public] - 10https://gerrit.wikimedia.org/r/1279501 (https://phabricator.wikimedia.org/T408892) (owner: 10Papaul) [23:33:04] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1375.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:33:32] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1378.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:34:15] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1377.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:34:29] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1380.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:35:49] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1378.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:36:02] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1376.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:36:17] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1381.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:36:20] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1377.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:36:39] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1382.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:36:40] (03CR) 10Cwhite: [C:04-1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1275800 (https://phabricator.wikimedia.org/T423986) (owner: 10Tiziano Fogli) [23:37:14] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1379.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:38:02] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1384.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:39:50] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1375.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:39:57] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1279502 [23:39:58] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1279502 (owner: 10TrainBranchBot) [23:41:13] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1380.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:41:40] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:42:44] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1375.eqiad.wmnet with OS trixie [23:42:46] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1376.eqiad.wmnet with OS trixie [23:42:49] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11873863 (10Papaul) [23:42:51] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13{75-84} - https://phabricator.wikimedia.org/T423719#11873864 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclar... [23:42:55] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13{75-84} - https://phabricator.wikimedia.org/T423719#11873865 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclar... [23:43:03] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1381.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:43:24] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1382.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:43:31] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1379.eqiad.wmnet with OS trixie [23:43:41] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13{75-84} - https://phabricator.wikimedia.org/T423719#11873866 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclar... [23:44:18] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1381.eqiad.wmnet with OS trixie [23:44:20] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1380.eqiad.wmnet with OS trixie [23:44:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13{75-84} - https://phabricator.wikimedia.org/T423719#11873868 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclar... [23:44:33] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13{75-84} - https://phabricator.wikimedia.org/T423719#11873869 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclar... [23:44:41] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1384.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [23:45:23] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1382.eqiad.wmnet with OS trixie [23:45:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13{75-84} - https://phabricator.wikimedia.org/T423719#11873871 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclar... [23:51:04] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1279502 (owner: 10TrainBranchBot) [23:53:14] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11873875 (10Papaul) [23:54:43] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1376.eqiad.wmnet with reason: host reimage [23:54:50] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1375.eqiad.wmnet with reason: host reimage [23:55:28] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1379.eqiad.wmnet with reason: host reimage [23:56:04] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1381.eqiad.wmnet with reason: host reimage [23:56:08] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1380.eqiad.wmnet with reason: host reimage [23:57:13] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1382.eqiad.wmnet with reason: host reimage [23:58:50] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1376.eqiad.wmnet with reason: host reimage