[00:00:36] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:01:36] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:01:38] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:03:22] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp5024.* [00:16:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:30:27] !log jasmine@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-main1009.eqiad.wmnet with OS trixie [00:30:59] !log jasmine@cumin2002 START - Cookbook sre.hosts.move-vlan for host kafka-main1009 [00:34:02] jasmine@cumin2002 reimage (PID 1891761) is awaiting input [00:34:08] PROBLEM - Host db1262 #page is DOWN: PING CRITICAL - Packet loss = 100% [00:34:21] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp5020.* [00:34:28] (03PS1) 10Jasmine: hieradata/common.yaml: add new IPs for kafka-main1009 following vlan migration [puppet] - 10https://gerrit.wikimedia.org/r/1300281 (https://phabricator.wikimedia.org/T427088) [00:35:47] (03CR) 10Jasmine: [C:03+2] kafka-main1009: apply host-level override in advance of trixie upgrade [0] [puppet] - 10https://gerrit.wikimedia.org/r/1285477 (https://phabricator.wikimedia.org/T419216) (owner: 10Jasmine) [00:35:55] (03CR) 10Jasmine: [C:03+2] hieradata/common.yaml: add new IPs for kafka-main1009 following vlan migration [puppet] - 10https://gerrit.wikimedia.org/r/1300281 (https://phabricator.wikimedia.org/T427088) (owner: 10Jasmine) [00:36:21] !log jasmine@cumin2002 START - Cookbook sre.dns.netbox [00:39:38] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:39:51] !log cdanis@cumin1003 dbctl commit (dc=all): 'depool db1262', diff saved to https://phabricator.wikimedia.org/P94032 and previous config saved to /var/cache/conftool/dbconfig/20260611-003950-cdanis.json [00:40:54] !log jasmine@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host kafka-main1009 - jasmine@cumin2002" [00:40:59] !log jasmine@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host kafka-main1009 - jasmine@cumin2002" [00:41:00] !log jasmine@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:41:00] !log jasmine@cumin2002 START - Cookbook sre.dns.wipe-cache kafka-main1009.eqiad.wmnet 37.48.64.10.in-addr.arpa 7.3.0.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [00:41:04] !log jasmine@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) kafka-main1009.eqiad.wmnet 37.48.64.10.in-addr.arpa 7.3.0.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [00:41:05] !log jasmine@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host kafka-main1009 [00:41:31] !log jasmine@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kafka-main1009 [00:41:31] !log jasmine@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host kafka-main1009 [00:49:30] FIRING: Processor usage over 85%: Alert for device ssw1-d1-eqiad.mgmt.eqiad.wmnet - Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [00:49:41] 10ops-eqiad, 06Data-Persistence, 06DBA, 06DC-Ops: db1262 crashed - https://phabricator.wikimedia.org/T428832#12007625 (10colewhite) ` SeqNumber = 574 Message ID = CPU0000 Category = System AgentID = iDRAC Severity = Information Timestamp = 2026-06-11 00:34:15 Message... [00:49:45] FIRING: CirrusProducerFlinkJobNotRunning: cirrus_streaming_updater_producer in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusProducerFlinkJobNotRunning [00:50:44] FIRING: CirrusStreamingUpdaterFlinkNoRegisteredTask: cirrus-streaming-updater job in eqiad (k8s) is running without any taskmanagers - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkNoRegisteredTas [00:51:00] !log jasmine@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [00:51:08] !log jasmine@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [00:51:16] !log jasmine@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [00:51:24] !log jasmine@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [00:51:32] !log jasmine@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [00:51:38] FIRING: CirrusSearchUpdaterKafkaMessagesInTooLow: ... [00:51:39] The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.v1` is too low - TODO - https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&var-dc=codfw%2520prometheus%252Fops&var-kafka_cluster=main-eqiad&var-kafka_broker=All&from=now-1h&to=now&refresh=5m&var-topic=codfw.cirrussearch.update_pipeline.update.v1&var-topic=eqiad.cirrussearch.update_pipeline.update.v1&viewPanel=6 - ... [00:51:39] https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesInTooLow [00:51:40] !log jasmine@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [00:51:48] !log jasmine@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [00:51:56] !log jasmine@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [00:52:04] !log jasmine@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [00:52:11] !log jasmine@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [00:52:18] !log jasmine@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [00:52:26] !log jasmine@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [00:52:34] !log jasmine@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [00:52:42] !log jasmine@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [00:52:51] !log jasmine@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [00:52:59] !log jasmine@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [00:53:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [00:53:06] !log jasmine@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [00:53:14] !log jasmine@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [00:53:21] !log jasmine@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [00:53:28] !log jasmine@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [00:53:35] !log jasmine@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. [00:53:44] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [00:54:27] !log jasmine@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. [00:54:30] RESOLVED: Processor usage over 85%: Device ssw1-d1-eqiad.mgmt.eqiad.wmnet recovered from Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [00:55:28] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for EChukwukere-WMF - https://phabricator.wikimedia.org/T428827#12007635 (10RLazarus) [00:56:45] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [00:58:43] !log jasmine@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main1009.eqiad.wmnet with reason: host reimage [00:58:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [01:01:45] FIRING: [3x] CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [01:02:30] !log jasmine@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main1009.eqiad.wmnet with reason: host reimage [01:02:44] FIRING: [3x] CirrusStreamingUpdaterSetWeightedTagsTooLow: CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [01:02:45] FIRING: [3x] CirrusStreamingUpdaterClearWeightedTagsTooLow: CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [01:04:24] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for EChukwukere-WMF - https://phabricator.wikimedia.org/T428827#12007655 (10RLazarus) @EChukwukere-WMF: - You might have already seen this, but just to be sure -- [[ https://wikitech.wikimedia.org/wiki/Test_Kitchen/GrowthBook_user_g... [01:05:11] !log jasmine@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [01:05:33] !log jasmine@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [01:05:41] !log jasmine@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [01:06:17] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Bliviero - https://phabricator.wikimedia.org/T428815#12007657 (10RLazarus) [01:06:22] !log jasmine@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [01:06:31] !log jasmine@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [01:06:59] !log jasmine@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [01:07:07] !log jasmine@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [01:07:55] !log jasmine@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [01:08:03] !log jasmine@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [01:08:32] !log jasmine@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [01:08:39] !log jasmine@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [01:09:29] !log jasmine@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [01:09:36] !log jasmine@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [01:10:26] !log jasmine@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [01:10:35] !log jasmine@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [01:10:48] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1300282 [01:10:48] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1300282 (owner: 10TrainBranchBot) [01:11:01] !log jasmine@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [01:11:09] !log jasmine@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [01:11:12] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Bliviero - https://phabricator.wikimedia.org/T428815#12007660 (10RLazarus) Thanks @BLiviero-WMF! If you only need access to private data in Turnilo for now, we'll add you to analytics-privatedata-users, but we won't use your SSH... [01:11:58] !log jasmine@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [01:12:04] !log jasmine@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [01:12:34] !log jasmine@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [01:12:41] !log jasmine@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. [01:12:49] !log jasmine@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. [01:13:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [01:18:43] !log jasmine@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main1009.eqiad.wmnet with OS trixie [01:18:55] !log jasmine@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-main: sync [01:19:16] !log jasmine@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync [01:23:16] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1300282 (owner: 10TrainBranchBot) [01:23:43] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp2046.* [01:32:27] (03PS1) 10Jasmine: kafka-main: clean up host level overrides for kafka-main jdk 21 in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1300287 (https://phabricator.wikimedia.org/T427088) [01:42:33] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:et-0/0/0 (Transport: Arelion (IC-398708) {#20260601}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:00:51] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:07:33] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:35] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 44s) [02:10:15] 10SRE-swift-storage, 10EasyTimeline: "Timeline error. Could not store output files" - https://phabricator.wikimedia.org/T428063#12007957 (10Colinstu) Will anything need to be done manually on existing pages experiencing this issue? Or once the source code issue is resolved, suddenly all of broken ones will gen... [02:13:31] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1003 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [02:22:45] RESOLVED: [3x] CirrusStreamingUpdaterClearWeightedTagsTooLow: CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [02:32:45] FIRING: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [02:32:50] CirrusSearch consumer-search@codfw is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=codfw&var-search_cluster=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [02:34:38] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:34:47] FIRING: KubernetesCalicoDown: dse-k8s-worker1009.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1009.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [02:37:45] FIRING: [3x] CirrusStreamingUpdaterClearWeightedTagsTooLow: CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [02:45:15] FIRING: [4x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1009:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [03:15:10] FIRING: BFDdown: BFD session down between cr2-magru and fe80::ee38:73ff:fee8:9c58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [03:17:27] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 141 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [03:20:10] RESOLVED: BFDdown: BFD session down between cr2-magru and fe80::ee38:73ff:fee8:9c58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:12:54] FIRING: [3x] CirrusStreamingUpdaterClearWeightedTagsTooLow: CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [04:16:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:17:45] FIRING: [3x] CirrusStreamingUpdaterClearWeightedTagsTooLow: CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [04:42:45] RESOLVED: [3x] CirrusStreamingUpdaterClearWeightedTagsTooLow: CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [04:48:08] (03PS1) 10VadymTS1: Add alias namespace for cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300301 (https://phabricator.wikimedia.org/T428619) [04:49:59] FIRING: CirrusProducerFlinkJobNotRunning: cirrus_streaming_updater_producer in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusProducerFlinkJobNotRunning [04:50:45] FIRING: CirrusStreamingUpdaterFlinkNoRegisteredTask: cirrus-streaming-updater job in eqiad (k8s) is running without any taskmanagers - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkNoRegisteredTas [04:51:39] FIRING: CirrusSearchUpdaterKafkaMessagesInTooLow: ... [04:51:39] The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.v1` is too low - TODO - https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&var-dc=codfw%2520prometheus%252Fops&var-kafka_cluster=main-eqiad&var-kafka_broker=All&from=now-1h&to=now&refresh=5m&var-topic=codfw.cirrussearch.update_pipeline.update.v1&var-topic=eqiad.cirrussearch.update_pipeline.update.v1&viewPanel=6 - ... [04:51:39] https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesInTooLow [04:55:38] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: db1262 crashed - https://phabricator.wikimedia.org/T428832#12008127 (10Marostegui) Thank you @CDanis and @colewhite - we will take it from here with #ops-eqiad team. [04:57:45] FIRING: [2x] CirrusStreamingUpdaterClearWeightedTagsTooLow: CirrusSearch consumer-search@codfw is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [05:00:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 15 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300301 (https://phabricator.wikimedia.org/T428619) (owner: 10VadymTS1) [05:01:04] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#12008134 (10Marostegui) >>! In T427535#12004846, @VRiley-WMF wrote: > Hey @Marostegui, as it turns out, I am not able to find a compatible processor for this unit. Should we commence with the removal of t... [05:01:45] FIRING: [3x] CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [05:02:45] FIRING: [3x] CirrusStreamingUpdaterClearWeightedTagsTooLow: CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [05:02:45] FIRING: [3x] CirrusStreamingUpdaterSetWeightedTagsTooLow: CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [05:05:01] (03PS1) 10Marostegui: db1262: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1300302 [05:06:15] (03CR) 10Marostegui: [C:03+2] db1262: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1300302 (owner: 10Marostegui) [05:14:31] (03CR) 10Marostegui: [C:03+1] "This is okay, but we also have to add them to the proxies. I will do that anyway. Let me know when you want me to deploy this." [puppet] - 10https://gerrit.wikimedia.org/r/1300156 (https://phabricator.wikimedia.org/T423727) (owner: 10AOkoth) [05:16:24] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [05:16:46] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool es2045: Upgrading es2045.codfw.wmnet [05:17:08] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool es2045: Upgrading es2045.codfw.wmnet [05:17:51] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es2045.codfw.wmnet with OS trixie [05:18:09] !log arnaudb@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade gitlab [05:27:51] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade gitlab [05:28:45] !log arnaudb@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade gitlab [05:37:54] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade gitlab [05:42:33] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:et-0/0/0 (Transport: Arelion (IC-398708) {#20260601}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:55:15] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1237 not rebooting - https://phabricator.wikimedia.org/T428542#12008180 (10Marostegui) Thanks John - these things can take weeks to repeat, but I will re-open if that is the case. [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260611T0600) [06:00:04] marostegui, Amir1, and federico3: Your horoscope predicts another Primary database switchover deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260611T0600). [06:06:09] federico3: are you doing a switchover today? [06:07:37] (03PS1) 10Marostegui: es2042: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1300561 [06:11:06] (03CR) 10Marostegui: [C:03+2] es2042: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1300561 (owner: 10Marostegui) [06:19:53] (03CR) 10Effie Mouzeli: [C:03+2] mediawiki_common: update IP for rdb1014 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300114 (https://phabricator.wikimedia.org/T421711) (owner: 10Effie Mouzeli) [06:22:13] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 30 hosts with reason: Primary switchover s1 T426083 [06:22:17] T426083: Switchover s1 master (db1163 -> db1184) - https://phabricator.wikimedia.org/T426083 [06:22:25] !log fceratto@cumin1003 dbctl commit (dc=all): 'Set db1184 with weight 0 T426083', diff saved to https://phabricator.wikimedia.org/P94035 and previous config saved to /var/cache/conftool/dbconfig/20260611-062224-fceratto.json [06:24:21] (03Merged) 10jenkins-bot: mediawiki_common: update IP for rdb1014 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300114 (https://phabricator.wikimedia.org/T421711) (owner: 10Effie Mouzeli) [06:25:04] (03CR) 10Federico Ceratto: [C:03+2] mariadb: Promote db1184 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1286405 (https://phabricator.wikimedia.org/T426083) (owner: 10Gerrit maintenance bot) [06:29:35] !log Starting s1 eqiad failover from db1163 to db1184 - T426083 [06:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:39] T426083: Switchover s1 master (db1163 -> db1184) - https://phabricator.wikimedia.org/T426083 [06:29:49] !log fceratto@cumin1003 START - Cookbook sre.mysql.global-read-only [06:30:06] !log fceratto@cumin1003 Dbctl change: Setting sections s1 as read-only for T426083: 'Maintenance until 06:15 UTC' [06:30:13] !log fceratto@cumin1003 MariaDB change: Setting sections s1 as read-only for T426083: 'Maintenance until 06:15 UTC' [06:30:18] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.global-read-only (exit_code=0) [06:31:01] !log fceratto@cumin1003 dbctl commit (dc=all): 'Set s1 eqiad as read-only for maintenance - T426083', diff saved to https://phabricator.wikimedia.org/P94037 and previous config saved to /var/cache/conftool/dbconfig/20260611-063100-fceratto.json [06:31:58] !log fceratto@cumin1003 START - Cookbook sre.mysql.global-read-only [06:32:01] !log fceratto@cumin1003 MariaDB change: Setting sections s1 as read-write for T426083: 'Maintenance until 06:15 UTC' [06:32:08] !log fceratto@cumin1003 Dbctl change: Setting sections s1 as read-write for T426083: 'Maintenance until 06:15 UTC' [06:32:12] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.global-read-only (exit_code=0) [06:32:52] !log fceratto@cumin1003 dbctl commit (dc=all): 'Set s1 eqiad as read-only for maintenance - T426083', diff saved to https://phabricator.wikimedia.org/P94039 and previous config saved to /var/cache/conftool/dbconfig/20260611-063251-fceratto.json [06:33:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Promote db1184 to s1 primary and set section read-write T426083', diff saved to https://phabricator.wikimedia.org/P94040 and previous config saved to /var/cache/conftool/dbconfig/20260611-063323-fceratto.json [06:33:34] !log fceratto@cumin1003 START - Cookbook sre.mysql.global-read-only [06:33:40] !log fceratto@cumin1003 MariaDB change: Setting sections s1 as read-write for T426083: 'Maintenance until 06:15 UTC' [06:33:49] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.global-read-only (exit_code=0) [06:34:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:34:38] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:34:47] FIRING: KubernetesCalicoDown: dse-k8s-worker1009.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1009.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [06:38:09] (03PS2) 10Federico Ceratto: wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1286406 (https://phabricator.wikimedia.org/T426083) (owner: 10Gerrit maintenance bot) [06:38:22] (03CR) 10Federico Ceratto: [C:03+2] wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1286406 (https://phabricator.wikimedia.org/T426083) (owner: 10Gerrit maintenance bot) [06:39:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:40:28] !log fceratto@dns1005 START - running authdns-update [06:42:06] !log fceratto@dns1005 END - running authdns-update [06:43:20] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depool db1163 T426083', diff saved to https://phabricator.wikimedia.org/P94041 and previous config saved to /var/cache/conftool/dbconfig/20260611-064319-fceratto.json [06:43:25] T426083: Switchover s1 master (db1163 -> db1184) - https://phabricator.wikimedia.org/T426083 [06:44:06] 06SRE, 10observability, 06SRE Observability: Alerts showing "AlertLintProblem" - https://phabricator.wikimedia.org/T427469#12008316 (10Marostegui) Thanks for clarifying this. This is definitely very weird, we've not had any changes to: `instance=~"(db|an-redacteddb|clouddb)[12].*"}.` as in, no new hostname... [06:44:24] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1163: Repooling [06:45:15] FIRING: [4x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1009:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [06:45:25] PROBLEM - orchestrator resolve cache non-FQDNs on dborch1002 is CRITICAL: CRITICAL: 2 non-FQDN entries in orchestrator resolve cache: https://wikitech.wikimedia.org/wiki/Orchestrator [06:45:28] marostegui@cumin1003 major-upgrade (PID 3099244) is awaiting input [06:50:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool es2042', diff saved to https://phabricator.wikimedia.org/P94043 and previous config saved to /var/cache/conftool/dbconfig/20260611-065027-marostegui.json [06:50:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repool es2042', diff saved to https://phabricator.wikimedia.org/P94044 and previous config saved to /var/cache/conftool/dbconfig/20260611-065049-marostegui.json [06:51:27] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es2045.codfw.wmnet with OS trixie [06:52:25] RECOVERY - orchestrator resolve cache non-FQDNs on dborch1002 is OK: OK: all orchestrator resolve cache entries are FQDNs https://wikitech.wikimedia.org/wiki/Orchestrator [06:52:43] (03PS1) 10Arnaudb: gitlab: bump version to 18.11 [puppet] - 10https://gerrit.wikimedia.org/r/1300570 (https://phabricator.wikimedia.org/T428842) [06:53:13] 06SRE, 06DBA, 06Infrastructure-Foundations: Reimage failure when partitioning and keeping /srv - https://phabricator.wikimedia.org/T428852 (10Marostegui) 03NEW [06:53:21] (03CR) 10Jelto: [C:03+1] "lgtm thank you" [puppet] - 10https://gerrit.wikimedia.org/r/1300570 (https://phabricator.wikimedia.org/T428842) (owner: 10Arnaudb) [06:53:57] (03CR) 10Arnaudb: [C:03+2] gitlab: bump version to 18.11 [puppet] - 10https://gerrit.wikimedia.org/r/1300570 (https://phabricator.wikimedia.org/T428842) (owner: 10Arnaudb) [06:56:02] 06SRE, 06DBA, 06Infrastructure-Foundations: Reimage failure when partitioning and keeping /srv - https://phabricator.wikimedia.org/T428852#12008372 (10Marostegui) During the boot I've also seen: ` net0: 04:32:01:db:3c:10 using undionly on 0000:4b:00.0 (Ethernet) [open] [Link:up, TX:0 TXE:1 RX:0 RXE:0] .... [07:00:05] Amir1, urbanecm, and awight: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260611T0700). [07:00:05] codders: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:09] o/ [07:02:36] morning awight ! [07:02:41] I'll go ahead and deploy my patch [07:03:02] g*dspeed [07:03:08] (03PS1) 10Kevin Bazira: ml: add vLLM 0.22 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1300573 (https://phabricator.wikimedia.org/T428577) [07:04:07] (03CR) 10Muehlenhoff: [C:03+2] thumbor-plugins: Rebuild against latest package versions in Bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1300132 (owner: 10Muehlenhoff) [07:04:21] hmm. has dependencies not present on the live branch [07:05:15] (03CR) 10Effie Mouzeli: [C:03+1] kafka-main: clean up host level overrides for kafka-main jdk 21 in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1300287 (https://phabricator.wikimedia.org/T427088) (owner: 10Jasmine) [07:06:36] the dependency warning there is not super relevant - the change that this config depends on *is* merged to master, and beta is running master [07:06:56] awight: If I update the patch to drop the dependency, can you +1 it again? [07:06:59] codders: new messages aren't picked up without the full l10n cache rebuild which generally happens only with the first weekly train deployment [07:07:07] +1 [07:07:42] (03PS3) 10Arthur taylor: WikiProjects links - add statement-based link to project on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299451 (https://phabricator.wikimedia.org/T423144) [07:07:57] looks like it retained its +1 [07:08:14] I'll just wait for the tests and then I'll try again [07:08:40] codders: what is "on beta" about? this seems to be a production wikidatawiki change. [07:08:57] for 'labs' [07:09:11] of course, thanks for the correction. [07:09:31] (03CR) 10Effie Mouzeli: [C:03+1] "Were added to hieradata/role/codfw/kafka/main.yaml in I48d11e17e19252d27f7a47aae983ed67e06137db, so gtg" [puppet] - 10https://gerrit.wikimedia.org/r/1300288 (https://phabricator.wikimedia.org/T427088) (owner: 10Jasmine) [07:09:35] okay. trying again [07:09:43] kk yes for beta-only deployment go crazy, we don't need to be fancy about the i18n message [07:10:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arthurtaylor@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299451 (https://phabricator.wikimedia.org/T423144) (owner: 10Arthur taylor) [07:11:11] (03Merged) 10jenkins-bot: WikiProjects links - add statement-based link to project on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299451 (https://phabricator.wikimedia.org/T423144) (owner: 10Arthur taylor) [07:12:51] done - thanks! [07:13:09] (03CR) 10Awight: [C:03+1] "PS 4: removing the depends-on is fine because this is a beta-only change so mostly harmless to deploy without the message." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299451 (https://phabricator.wikimedia.org/T423144) (owner: 10Arthur taylor) [07:19:11] !log arnaudb@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade gitlab [07:20:32] (03CR) 10Jelto: "Build on `build2002` with `--git-dist=trixie` works and produces the helm3 binary." [debs/helm3] - 10https://gerrit.wikimedia.org/r/1300145 (https://phabricator.wikimedia.org/T427403) (owner: 10Jelto) [07:22:07] PROBLEM - MariaDB read only s1 on db1163 is CRITICAL: CRIT: read_only: False, expected True: OK: Version 10.11.14-MariaDB-log, Uptime 19443859s, event_scheduler: True, 5174.47 QPS, connection latency: 0.025082s, query latency: 0.000620s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [07:26:48] (03CR) 10Slyngshede: [C:03+1] Update account meta data for okryva [puppet] - 10https://gerrit.wikimedia.org/r/1300113 (owner: 10Muehlenhoff) [07:29:54] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1163: Repooling [07:30:17] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade gitlab [07:31:08] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [07:31:09] !log arnaudb@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade gitlab [07:31:28] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1219: Upgrading db1219.eqiad.wmnet [07:31:38] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [07:31:59] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db2176: Upgrading db2176.codfw.wmnet [07:32:09] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1219: Upgrading db1219.eqiad.wmnet [07:32:20] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2176: Upgrading db2176.codfw.wmnet [07:35:02] !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host rdb1013.eqiad.wmnet with OS trixie [07:35:09] cwilliams@cumin1003 major-upgrade (PID 3158135) is awaiting input [07:35:19] (03CR) 10Muehlenhoff: [C:03+2] Update account meta data for okryva [puppet] - 10https://gerrit.wikimedia.org/r/1300113 (owner: 10Muehlenhoff) [07:35:20] cwilliams@cumin1003 major-upgrade (PID 3158477) is awaiting input [07:41:58] (03CR) 10Dpogorzelski: [C:03+1] ml: add vLLM 0.22 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1300573 (https://phabricator.wikimedia.org/T428577) (owner: 10Kevin Bazira) [07:42:18] (03PS1) 10Effie Mouzeli: ProductionServices.php: switch filebackend.php back to rdb1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300580 (https://phabricator.wikimedia.org/T291916) [07:42:26] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade gitlab [07:42:31] FIRING: [5x] RedisReplicaDown: Redis replica down rdb1014:16378 redis_misc - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisReplicaDown [07:43:09] !log imported Jenkins 2.541.3 for thirdparty/ci (Bullseye) and thirdparty/jenkins (Bookworm, Trixie) [07:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:40] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1219.eqiad.wmnet with OS trixie [07:44:43] marostegui@cumin1003 reimage (PID 3128619) is awaiting input [07:45:17] (03CR) 10Dpogorzelski: [C:03+2] ml: add vLLM 0.22 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1300573 (https://phabricator.wikimedia.org/T428577) (owner: 10Kevin Bazira) [07:45:21] (03CR) 10Dpogorzelski: [V:03+2 C:03+2] ml: add vLLM 0.22 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1300573 (https://phabricator.wikimedia.org/T428577) (owner: 10Kevin Bazira) [07:46:03] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db2176.codfw.wmnet with OS trixie [07:47:31] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:47:37] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:49:34] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5004.eqsin.wmnet [07:49:49] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:49:52] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:50:17] !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on rdb1013.eqiad.wmnet with reason: host reimage [07:50:38] (03Abandoned) 10Marostegui: mariadb: Promote db1162 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1286407 (https://phabricator.wikimedia.org/T426084) (owner: 10Gerrit maintenance bot) [07:50:48] (03Abandoned) 10Marostegui: wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/1286408 (https://phabricator.wikimedia.org/T426084) (owner: 10Gerrit maintenance bot) [07:54:32] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on rdb1013.eqiad.wmnet with reason: host reimage [07:56:48] !log install mariadb 10.11.17 on pc1 T427345 [07:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:53] T427345: Compile and package MariaDB 10.11.17 - https://phabricator.wikimedia.org/T427345 [07:58:44] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1219.eqiad.wmnet with reason: host reimage [07:59:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5004.eqsin.wmnet [08:00:05] dduvall and jnuche: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260611T0800). [08:03:49] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool pc1021: Migration to 10.11.17 T427345 [08:03:49] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [08:03:55] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [08:03:55] T427345: Compile and package MariaDB 10.11.17 - https://phabricator.wikimedia.org/T427345 [08:03:55] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc1021: Migration to 10.11.17 T427345 [08:04:30] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2176.codfw.wmnet with reason: host reimage [08:05:00] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool pc1021: Migration to 10.11.17 T427345 [08:05:00] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc1021: Migration to 10.11.17 T427345 [08:05:02] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti5004.eqsin.wmnet to cluster eqsin02 and group 01 [08:05:19] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1219.eqiad.wmnet with reason: host reimage [08:05:47] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2021.codfw.wmnet,pc1021.eqiad.wmnet with reason: upgrade [08:06:46] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [08:06:50] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:06:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti5004.eqsin.wmnet to cluster eqsin02 and group 01 [08:07:31] RESOLVED: [5x] RedisReplicaDown: Redis replica down rdb1014:16378 redis_misc - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisReplicaDown [08:08:55] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host rdb1013.eqiad.wmnet with OS trixie [08:09:34] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2176.codfw.wmnet with reason: host reimage [08:11:21] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [08:11:24] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:14:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [08:14:50] RESOLVED: CirrusProducerFlinkJobNotRunning: cirrus_streaming_updater_producer in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusProducerFlinkJobNotRunning [08:14:56] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [08:15:00] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:15:45] RESOLVED: CirrusStreamingUpdaterFlinkNoRegisteredTask: cirrus-streaming-updater job in eqiad (k8s) is running without any taskmanagers - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkNoRegisteredT [08:16:39] RESOLVED: CirrusSearchUpdaterKafkaMessagesInTooLow: ... [08:16:39] The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.v1` is too low - TODO - https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&var-dc=codfw%2520prometheus%252Fops&var-kafka_cluster=main-eqiad&var-kafka_broker=All&from=now-1h&to=now&refresh=5m&var-topic=codfw.cirrussearch.update_pipeline.update.v1&var-topic=eqiad.cirrussearch.update_pipeline.update.v1&viewPanel=6 - ... [08:16:39] https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesInTooLow [08:17:00] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:17:01] !log installing PHP 8.2 security updates [08:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:05] RESOLVED: [3x] CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [08:17:45] RESOLVED: [3x] CirrusStreamingUpdaterClearWeightedTagsTooLow: CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [08:17:45] RESOLVED: [3x] CirrusStreamingUpdaterSetWeightedTagsTooLow: CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [08:19:44] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [08:22:01] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1219.eqiad.wmnet with OS trixie [08:22:18] !log atsuko@deploy1003 mwscript-k8s job started: foreachwikiindblist mwscript.dblist extensions/Translate/scripts/ttmserver-export.php --ttmserver eqiad-test # T425377 populating ttmserver index on test cluster to estimate time required for the release (dblist: https://phabricator.wikimedia.org/P94051) [08:22:23] T425377: Migrate Ttmserver (Translatewiki application) indices from production OpenSearch to OpenSearch on k8s - https://phabricator.wikimedia.org/T425377 [08:23:09] (03CR) 10Elukey: [C:03+2] redfish: improve add_account with AccountTypes (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293593 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [08:23:50] (03CR) 10Jelto: [C:03+1] "lgtm, Could the host_aliases be added for the replicas `gitlab1003` and `gitlab2002` too?" [puppet] - 10https://gerrit.wikimedia.org/r/1298771 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [08:23:51] !log jnuche@deploy1003 Started deploy [releng/jenkins-deploy@6200ab1] (releasing): Testing upgrade for T428823 [08:24:55] !log jnuche@deploy1003 Finished deploy [releng/jenkins-deploy@6200ab1] (releasing): Testing upgrade for T428823 (duration: 01m 17s) [08:25:01] !log atsuko@deploy1003 mwscript-k8s job started: foreachwikiindblist mwscript.dblist extensions/Translate/scripts/ttmserver-export.php --ttmserver eqiad-test # T425377 populating ttmserver index on test cluster to estimate time required for the release (dblist: https://phabricator.wikimedia.org/P94052) [08:25:02] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool pc1021: Migration to 10.11.17 [08:25:02] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [08:25:19] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [08:25:19] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool pc1021: Migration to 10.11.17 [08:27:01] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2176.codfw.wmnet with OS trixie [08:29:40] !log jnuche@deploy1003 Started deploy [releng/jenkins-deploy@6200ab1] (releasing): T428823 [08:30:50] !log jnuche@deploy1003 Finished deploy [releng/jenkins-deploy@6200ab1] (releasing): T428823 (duration: 01m 18s) [08:33:57] !log atsuko@deploy1003 mwscript-k8s job started: foreachwikiindblist mwscript.dblist extensions/Translate/scripts/ttmserver-export.php --ttmserver eqiad-test # T425377 populating ttmserver index on test cluster to estimate time required for the release (dblist: https://phabricator.wikimedia.org/P94053) [08:34:02] T425377: Migrate Ttmserver (Translatewiki application) indices from production OpenSearch to OpenSearch on k8s - https://phabricator.wikimedia.org/T425377 [08:34:28] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1219: Migration of db1219.eqiad.wmnet completed [08:34:43] !log atsuko@deploy1003 mwscript-k8s job started: foreachwikiindblist mwscript.dblist extensions/Translate/scripts/ttmserver-export.php --ttmserver eqiad-test # T425377 populating ttmserver index on test cluster to estimate time required for the release (dblist: https://phabricator.wikimedia.org/P94055) [08:39:53] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db2176: Migration of db2176.codfw.wmnet completed [08:40:07] jouncebot: nowandnext [08:40:07] For the next 1 hour(s) and 19 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260611T0800) [08:40:07] In 1 hour(s) and 19 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260611T1000) [08:40:44] 06SRE, 06DBA, 06Infrastructure-Foundations: Reimage failure when partitioning and keeping /srv - https://phabricator.wikimedia.org/T428852#12008694 (10Marostegui) Some more debugging: ` ~ # cat /etc/fstab devpts /dev/pts devpts defaults 0 0 tmpfs /run tmpfs... [08:43:40] (03PS1) 10Muehlenhoff: ganeti5006: set up custom bgp neighbors for private1-604-eqsin vlan [puppet] - 10https://gerrit.wikimedia.org/r/1300702 (https://phabricator.wikimedia.org/T428229) [08:43:45] 06SRE, 10observability, 06SRE Observability: Alerts showing "AlertLintProblem" - https://phabricator.wikimedia.org/T427469#12008708 (10tappof) I took a brief look at the code, and I don't think (although I may be wrong) that the mysql_slave_status_using_gtid metric is really reliable. The scrape function al... [08:44:50] (03CR) 10Ayounsi: [C:03+1] ganeti5006: set up custom bgp neighbors for private1-604-eqsin vlan [puppet] - 10https://gerrit.wikimedia.org/r/1300702 (https://phabricator.wikimedia.org/T428229) (owner: 10Muehlenhoff) [08:54:45] (03PS18) 10Daniel Kinzler: rest gateway: implement cost-based rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228535 (https://phabricator.wikimedia.org/T412586) [08:54:56] (03CR) 10Daniel Kinzler: rest gateway: implement cost-based rate limits (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228535 (https://phabricator.wikimedia.org/T412586) (owner: 10Daniel Kinzler) [08:55:20] (03PS3) 10Daniel Kinzler: rest-gateway: cost limits for action=parse (shadow mode) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296534 (https://phabricator.wikimedia.org/T405472) [08:57:32] FIRING: SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [08:57:36] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es2045.codfw.wmnet with OS bookworm [08:58:57] 06SRE, 10observability, 06SRE Observability: Alerts showing "AlertLintProblem" - MySQLReplicaNotUsingGTID - https://phabricator.wikimedia.org/T427469#12008788 (10tappof) [08:59:20] (03PS4) 10Daniel Kinzler: rest-gateway: cost limits for action=parse (shadow mode) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296534 (https://phabricator.wikimedia.org/T405472) [08:59:25] (03CR) 10Daniel Kinzler: rest-gateway: cost limits for action=parse (shadow mode) (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296534 (https://phabricator.wikimedia.org/T405472) (owner: 10Daniel Kinzler) [08:59:30] (03PS1) 10Elukey: CHANGELOG: add changelogs for release v12.8.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1300706 [09:00:48] (03PS2) 10Slyngshede: C:dumps::web::xmldumps block generic user-agents [puppet] - 10https://gerrit.wikimedia.org/r/1297102 (https://phabricator.wikimedia.org/T427836) [09:01:22] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Bliviero - https://phabricator.wikimedia.org/T428815#12008789 (10mark) Hey Reuven, thank you - it's not silly, it's a standard procedure. Approved! [09:01:55] (03PS6) 10Arnaudb: gitlab: support extra ssh host_aliases [puppet] - 10https://gerrit.wikimedia.org/r/1298771 (https://phabricator.wikimedia.org/T425441) [09:02:19] (03PS2) 10Arnaudb: gitlab: advertise gitlab-ssh.wikimedia.org in UI clone URLs [puppet] - 10https://gerrit.wikimedia.org/r/1298781 (https://phabricator.wikimedia.org/T425441) [09:02:32] FIRING: [6x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [09:02:48] (03CR) 10Arnaudb: "good idea, done with PS6!" [puppet] - 10https://gerrit.wikimedia.org/r/1298771 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [09:04:05] (03CR) 10Slyngshede: "The map looks much nicer. I tweaked the sequence of the matches a bit, so allow users have a generic user-agent, but with an email or URL." [puppet] - 10https://gerrit.wikimedia.org/r/1297102 (https://phabricator.wikimedia.org/T427836) (owner: 10Slyngshede) [09:04:26] (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1298771 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [09:06:11] (03PS7) 10Daniel Kinzler: rest gateway: per-policy upfront cost [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296598 (https://phabricator.wikimedia.org/T412586) [09:07:32] FIRING: [7x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [09:08:43] (03PS2) 10Effie Mouzeli: ProductionServices.php: switch filebackend.php back to rdb1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300580 (https://phabricator.wikimedia.org/T291916) [09:09:05] jouncebot: NotASpy [09:09:07] gosh [09:09:11] jouncebot: now [09:09:11] For the next 0 hour(s) and 50 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260611T0800) [09:09:46] (03PS3) 10Slyngshede: C:dumps::web::xmldumps block generic user-agents [puppet] - 10https://gerrit.wikimedia.org/r/1297102 (https://phabricator.wikimedia.org/T427836) [09:10:13] (03CR) 10Elukey: [C:03+2] CHANGELOG: add changelogs for release v12.8.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1300706 (owner: 10Elukey) [09:10:51] 06SRE, 06DBA, 06Infrastructure-Foundations: Reimage failure when partitioning and keeping /srv - https://phabricator.wikimedia.org/T428852#12008823 (10Marostegui) A bookworm install also, fails, so I guess the FS is corrupted. [09:11:26] !log cumin -x 'A:swift-fe' "disable-puppet 'Disabling puppet for ratelimit deploy - cgoubert'" [09:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:30] ^ Emperor [09:11:42] (03CR) 10Clément Goubert: [C:03+2] tls_terminator: Convert size to kB for rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/1300077 (https://phabricator.wikimedia.org/T414440) (owner: 10Clément Goubert) [09:11:56] (03PS1) 10Elukey: Upstream release v12.8.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1300709 [09:12:21] (03CR) 10Elukey: [V:03+2 C:03+2] Upstream release v12.8.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1300709 (owner: 10Elukey) [09:12:32] FIRING: [12x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [09:12:51] (03PS1) 10Marostegui: installserver: Format es2045 entirely [puppet] - 10https://gerrit.wikimedia.org/r/1300710 (https://phabricator.wikimedia.org/T428852) [09:14:40] (03CR) 10Blake: [C:03+1] ProductionServices.php: switch filebackend.php back to rdb1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300580 (https://phabricator.wikimedia.org/T291916) (owner: 10Effie Mouzeli) [09:15:36] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1300077 (https://phabricator.wikimedia.org/T414440) (owner: 10Clément Goubert) [09:15:54] (03CR) 10Marostegui: [C:03+2] installserver: Format es2045 entirely [puppet] - 10https://gerrit.wikimedia.org/r/1300710 (https://phabricator.wikimedia.org/T428852) (owner: 10Marostegui) [09:19:31] (03PS1) 10Clément Goubert: tls_terminator: Add missing hits_addend stanza [puppet] - 10https://gerrit.wikimedia.org/r/1300712 (https://phabricator.wikimedia.org/T414440) [09:19:57] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1219: Migration of db1219.eqiad.wmnet completed [09:19:58] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [09:21:06] (03PS1) 10Blake: mediawiki::web::vhost: Use utf-8 for text/plain and text/html. [puppet] - 10https://gerrit.wikimedia.org/r/1300713 (https://phabricator.wikimedia.org/T428772) [09:21:55] (03PS1) 10Clément Goubert: ratelimit-media: Limits in kB [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300714 (https://phabricator.wikimedia.org/T414440) [09:21:57] (03CR) 10Clément Goubert: [C:03+2] tls_terminator: Add missing hits_addend stanza [puppet] - 10https://gerrit.wikimedia.org/r/1300712 (https://phabricator.wikimedia.org/T414440) (owner: 10Clément Goubert) [09:24:38] RESOLVED: JobUnavailable: Reduced availability for job thanos-compact in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:25:22] (03CR) 10Hnowlan: [C:03+2] thumbor: emit structured logs from haproxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300211 (https://phabricator.wikimedia.org/T368180) (owner: 10Hnowlan) [09:25:24] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2176: Migration of db2176.codfw.wmnet completed [09:25:25] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [09:26:03] (03PS1) 10Elukey: Revert "setup.py: install setuptools for Python > 3.11" [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1300719 [09:26:14] (03CR) 10Elukey: [V:03+2 C:03+2] Revert "setup.py: install setuptools for Python > 3.11" [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1300719 (owner: 10Elukey) [09:26:20] !log marostegui@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host es2045.codfw.wmnet with OS bookworm [09:26:27] (03PS1) 10Clément Goubert: Revert "tls_terminator: Add missing hits_addend stanza" [puppet] - 10https://gerrit.wikimedia.org/r/1300720 [09:26:41] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es2045.codfw.wmnet with OS trixie [09:26:53] (03PS1) 10Clément Goubert: Revert "tls_terminator: Convert size to kB for rate limiting" [puppet] - 10https://gerrit.wikimedia.org/r/1300721 [09:27:32] FIRING: [12x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [09:27:37] (03Merged) 10jenkins-bot: thumbor: emit structured logs from haproxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300211 (https://phabricator.wikimedia.org/T368180) (owner: 10Hnowlan) [09:29:04] (03CR) 10Clément Goubert: [C:03+2] Revert "tls_terminator: Add missing hits_addend stanza" [puppet] - 10https://gerrit.wikimedia.org/r/1300720 (owner: 10Clément Goubert) [09:29:27] (03CR) 10Clément Goubert: [C:03+2] Revert "tls_terminator: Convert size to kB for rate limiting" [puppet] - 10https://gerrit.wikimedia.org/r/1300721 (owner: 10Clément Goubert) [09:31:03] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1298771 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [09:32:32] FIRING: [12x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [09:33:19] (03PS1) 10Aleksandar Mastilovic: Always display Airflow DAG trigger configuration dialog [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300723 (https://phabricator.wikimedia.org/T428872) [09:35:42] (03CR) 10Kamila Součková: [C:03+1] rest gateway: implement cost-based rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228535 (https://phabricator.wikimedia.org/T412586) (owner: 10Daniel Kinzler) [09:37:29] (03CR) 10Daniel Kinzler: rest gateway: per-policy upfront cost (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296598 (https://phabricator.wikimedia.org/T412586) (owner: 10Daniel Kinzler) [09:37:39] (03PS8) 10Daniel Kinzler: rest gateway: per-policy upfront cost [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296598 (https://phabricator.wikimedia.org/T412586) [09:37:42] (03PS2) 10Aleksandar Mastilovic: Always display Airflow DAG trigger configuration dialog [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300723 (https://phabricator.wikimedia.org/T428872) [09:37:50] !log uploaded spicerack_12.8.0 to apt.wikimedia.org bookworm-wikimedia,trixie-wikimedia [09:37:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:52] (03PS19) 10Daniel Kinzler: rest gateway: implement cost-based rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228535 (https://phabricator.wikimedia.org/T412586) [09:38:02] (03PS9) 10Daniel Kinzler: rest gateway: per-policy upfront cost [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296598 (https://phabricator.wikimedia.org/T412586) [09:38:11] (03CR) 10Kamila Součková: [C:03+1] rest-gateway: cost limits for action=parse (shadow mode) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296534 (https://phabricator.wikimedia.org/T405472) (owner: 10Daniel Kinzler) [09:40:28] jouncebot: now [09:40:28] For the next 0 hour(s) and 19 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260611T0800) [09:40:40] (03CR) 10Kamila Součková: [C:03+1] rest gateway: per-policy upfront cost [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296598 (https://phabricator.wikimedia.org/T412586) (owner: 10Daniel Kinzler) [09:40:51] we have the next window, but I will do a backport now if that is ok [09:41:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jiji@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300580 (https://phabricator.wikimedia.org/T291916) (owner: 10Effie Mouzeli) [09:42:01] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es2045.codfw.wmnet with reason: host reimage [09:42:32] FIRING: [12x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [09:42:33] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:et-0/0/0 (Transport: Arelion (IC-398708) {#20260601}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:42:38] (03PS2) 10Blake: mediawiki::web::vhost: Use utf-8 for text/plain and text/html. [puppet] - 10https://gerrit.wikimedia.org/r/1300713 (https://phabricator.wikimedia.org/T428772) [09:42:52] (03Merged) 10jenkins-bot: ProductionServices.php: switch filebackend.php back to rdb1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300580 (https://phabricator.wikimedia.org/T291916) (owner: 10Effie Mouzeli) [09:43:13] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: db1262 crashed - https://phabricator.wikimedia.org/T428832#12008969 (10Jclark-ctr) a:03Jclark-ctr [09:43:17] !log jiji@deploy1003 Started scap sync-world: Backport for [[gerrit:1300580|ProductionServices.php: switch filebackend.php back to rdb1013 (T291916 T419976)]] [09:43:24] T291916: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 [09:43:24] T419976: Upgrade redis_misc hosts to Debian Trixie (Redis 8.0) - https://phabricator.wikimedia.org/T419976 [09:44:02] (03CR) 10Jelto: "As I said in our meeting I'd also set the `profile::gitlab::gitlab_ssh_host` for the replicas. It would be nice to set the hostname first " [puppet] - 10https://gerrit.wikimedia.org/r/1298781 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [09:45:18] (03PS1) 10Dreamy Jazz: hCaptcha: Enable for DiscussionTools on group 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300727 (https://phabricator.wikimedia.org/T426039) [09:45:37] !log jiji@deploy1003 jiji: Backport for [[gerrit:1300580|ProductionServices.php: switch filebackend.php back to rdb1013 (T291916 T419976)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:45:59] (03PS5) 10Daniel Kinzler: rest-gateway: cost limits for action=parse (shadow mode) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296534 (https://phabricator.wikimedia.org/T405472) [09:46:20] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2045.codfw.wmnet with reason: host reimage [09:46:48] (03PS1) 10Marostegui: Revert "installserver: Format es2045 entirely" [puppet] - 10https://gerrit.wikimedia.org/r/1300728 [09:47:32] FIRING: [12x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [09:51:08] (03PS1) 10Volans: wmcs-backups: fix openstack access [puppet] - 10https://gerrit.wikimedia.org/r/1300730 (https://phabricator.wikimedia.org/T428867) [09:51:21] (03PS6) 10Daniel Kinzler: rest-gateway: cost limits for action=parse (shadow mode) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296534 (https://phabricator.wikimedia.org/T405472) [09:52:32] FIRING: [12x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [09:53:39] (03PS1) 10Gkyziridis: wgRestSandboxSpecs: Add Lift Wing API to documentation wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300731 (https://phabricator.wikimedia.org/T427902) [09:54:37] !log jiji@deploy1003 jiji: Continuing with deployment [09:55:56] 06SRE, 06DBA, 06Infrastructure-Foundations: Reimage failure when partitioning and keeping /srv - https://phabricator.wikimedia.org/T428852#12009023 (10Marostegui) [09:57:32] FIRING: [12x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [09:58:59] !log jiji@deploy1003 Finished scap sync-world: Backport for [[gerrit:1300580|ProductionServices.php: switch filebackend.php back to rdb1013 (T291916 T419976)]] (duration: 15m 41s) [09:59:05] T291916: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 [09:59:06] T419976: Upgrade redis_misc hosts to Debian Trixie (Redis 8.0) - https://phabricator.wikimedia.org/T419976 [09:59:37] 06SRE, 06DBA, 06Infrastructure-Foundations: Reimage failure when partitioning and keeping /srv - https://phabricator.wikimedia.org/T428852#12009028 (10Marostegui) 05Open→03Resolved a:03Marostegui The reimage formatting `/srv/` went fine, so I guess it was a case of partition corruption. I will recl... [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260611T1000) [10:00:22] (03PS1) 10Marostegui: Revert "es2042: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1300732 [10:01:02] (03PS1) 10JavierMonton: stream: webrequest.page_view_stats.dev0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300733 (https://phabricator.wikimedia.org/T428725) [10:01:07] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [10:01:39] (03PS2) 10Volans: wmcs-backups: fix openstack access [puppet] - 10https://gerrit.wikimedia.org/r/1300730 (https://phabricator.wikimedia.org/T428867) [10:01:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool es2046', diff saved to https://phabricator.wikimedia.org/P94068 and previous config saved to /var/cache/conftool/dbconfig/20260611-100145-marostegui.json [10:01:54] (03CR) 10Marostegui: [C:03+2] Revert "es2042: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1300732 (owner: 10Marostegui) [10:02:17] (03PS1) 10Dreamy Jazz: HCaptcha: Return 'forceshowcaptcha' error when CAPTCHA forced [extensions/ConfirmEdit] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300734 (https://phabricator.wikimedia.org/T426476) [10:02:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repool es2046', diff saved to https://phabricator.wikimedia.org/P94069 and previous config saved to /var/cache/conftool/dbconfig/20260611-100221-marostegui.json [10:02:23] jouncebot: nowandnext [10:02:23] For the next 0 hour(s) and 57 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260611T1000) [10:02:23] In 1 hour(s) and 57 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260611T1200) [10:02:27] (03CR) 10Brouberol: [C:03+1] Always display Airflow DAG trigger configuration dialog [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300723 (https://phabricator.wikimedia.org/T428872) (owner: 10Aleksandar Mastilovic) [10:02:29] (03CR) 10Brouberol: [C:03+2] Always display Airflow DAG trigger configuration dialog [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300723 (https://phabricator.wikimedia.org/T428872) (owner: 10Aleksandar Mastilovic) [10:02:52] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [10:03:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300734 (https://phabricator.wikimedia.org/T426476) (owner: 10Dreamy Jazz) [10:03:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300727 (https://phabricator.wikimedia.org/T426039) (owner: 10Dreamy Jazz) [10:04:05] (03PS1) 10Marostegui: es2045: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1300735 (https://phabricator.wikimedia.org/T428572) [10:04:40] (03Merged) 10jenkins-bot: hCaptcha: Enable for DiscussionTools on group 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300727 (https://phabricator.wikimedia.org/T426039) (owner: 10Dreamy Jazz) [10:05:03] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [10:05:38] (03CR) 10Marostegui: [C:03+2] es2045: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1300735 (https://phabricator.wikimedia.org/T428572) (owner: 10Marostegui) [10:06:04] (03CR) 10Filippo Giunchedi: [C:03+2] etcd: make etcdctl work out of the box [puppet] - 10https://gerrit.wikimedia.org/r/1299545 (https://phabricator.wikimedia.org/T313030) (owner: 10Filippo Giunchedi) [10:06:29] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [10:07:13] (03CR) 10Volans: "FYI" [puppet] - 10https://gerrit.wikimedia.org/r/1300730 (https://phabricator.wikimedia.org/T428867) (owner: 10Volans) [10:07:28] effie, Dreamy_Jazz: is the train deployment done? I'd like to get started on deploying a couple of patches for the rest-gateway chart... Raine doesn't seem to be around. [10:08:05] duesen: I am around in case you need any help, other than that, my backport is done [10:08:16] Yeah, the train is probably in the other timeslot today [10:08:21] I'm currently using scap [10:08:27] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [10:08:55] ok, I'll go ahead and merge the patches then. I'll give you another heads up before i deploy. probably in 10 to 15 minutes. [10:09:01] ack [10:09:02] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [10:09:12] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2045.codfw.wmnet with OS trixie [10:09:21] (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: cost limits for action=parse (shadow mode) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296534 (https://phabricator.wikimedia.org/T405472) (owner: 10Daniel Kinzler) [10:09:25] duesen: I am around, I +1'd your stuff :D [10:09:27] (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: per-policy upfront cost [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296598 (https://phabricator.wikimedia.org/T412586) (owner: 10Daniel Kinzler) [10:09:33] (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: implement cost-based rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228535 (https://phabricator.wikimedia.org/T412586) (owner: 10Daniel Kinzler) [10:10:16] (03PS1) 10Michael Große: fix: correct intake-url and payload type for NCS experiment events [extensions/WikimediaEvents] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300736 (https://phabricator.wikimedia.org/T422295) [10:10:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/WikimediaEvents] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300736 (https://phabricator.wikimedia.org/T422295) (owner: 10Michael Große) [10:10:43] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply [10:10:51] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply [10:11:11] (03CR) 10Mforns: [C:03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300733 (https://phabricator.wikimedia.org/T428725) (owner: 10JavierMonton) [10:12:09] (03Merged) 10jenkins-bot: rest gateway: implement cost-based rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228535 (https://phabricator.wikimedia.org/T412586) (owner: 10Daniel Kinzler) [10:12:14] (03Merged) 10jenkins-bot: rest gateway: per-policy upfront cost [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296598 (https://phabricator.wikimedia.org/T412586) (owner: 10Daniel Kinzler) [10:12:16] (03Merged) 10jenkins-bot: rest-gateway: cost limits for action=parse (shadow mode) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296534 (https://phabricator.wikimedia.org/T405472) (owner: 10Daniel Kinzler) [10:12:32] FIRING: [14x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [10:12:49] (03CR) 10Majavah: [C:03+1] "I still don't think it's necessary to use this for instance/glance backups, since those happen from the same DC where we can rely on cloud" [puppet] - 10https://gerrit.wikimedia.org/r/1300730 (https://phabricator.wikimedia.org/T428867) (owner: 10Volans) [10:13:44] (03CR) 10Mforns: [C:03+1] stream: webrequest.page_view_stats.dev0 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300733 (https://phabricator.wikimedia.org/T428725) (owner: 10JavierMonton) [10:16:12] effie, Raine: deploying now [10:16:31] 🍿 [10:17:32] FIRING: [15x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [10:18:08] (03CR) 10Volans: "But is the cloud-private availability something neccessary/required/by design?" [puppet] - 10https://gerrit.wikimedia.org/r/1300730 (https://phabricator.wikimedia.org/T428867) (owner: 10Volans) [10:18:28] !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:18:59] !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:19:34] (03Merged) 10jenkins-bot: HCaptcha: Return 'forceshowcaptcha' error when CAPTCHA forced [extensions/ConfirmEdit] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300734 (https://phabricator.wikimedia.org/T426476) (owner: 10Dreamy Jazz) [10:20:06] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1300734|HCaptcha: Return 'forceshowcaptcha' error when CAPTCHA forced (T426476)]], [[gerrit:1300727|hCaptcha: Enable for DiscussionTools on group 1 wikis (T426039)]] [10:20:12] T426476: DiscussionTools hCaptcha: When user encounters AbuseFilter hCaptcha challenge no indication is shown they need to resubmit their edit - https://phabricator.wikimedia.org/T426476 [10:20:13] T426039: hCaptcha DiscussionTools: Rollout to WMF wikis - https://phabricator.wikimedia.org/T426039 [10:21:05] (03PS1) 10Filippo Giunchedi: etcd: switch etcdctl auth based on 'use_client_certs' [puppet] - 10https://gerrit.wikimedia.org/r/1300738 (https://phabricator.wikimedia.org/T313030) [10:21:23] (03CR) 10Filippo Giunchedi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1300738 (https://phabricator.wikimedia.org/T313030) (owner: 10Filippo Giunchedi) [10:22:16] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1300734|HCaptcha: Return 'forceshowcaptcha' error when CAPTCHA forced (T426476)]], [[gerrit:1300727|hCaptcha: Enable for DiscussionTools on group 1 wikis (T426039)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:22:32] FIRING: [15x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [10:22:43] (03CR) 10JavierMonton: stream: webrequest.page_view_stats.dev0 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300733 (https://phabricator.wikimedia.org/T428725) (owner: 10JavierMonton) [10:22:48] staging looks good, applying to codfw [10:22:49] (03CR) 10Majavah: [C:03+1] "Yes, it is there to support this exact communication." [puppet] - 10https://gerrit.wikimedia.org/r/1300730 (https://phabricator.wikimedia.org/T428867) (owner: 10Volans) [10:22:59] (03CR) 10Hashar: "I have manually retriggered the postmerge build by heading to the CI server (`contint.wikimedia.org`) and running:" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1300132 (owner: 10Muehlenhoff) [10:23:01] !log daniel@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [10:23:34] !log daniel@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [10:25:27] Still testing mine [10:26:48] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with deployment [10:26:49] (03PS3) 10Volans: wmcs-backups: fix openstack access [puppet] - 10https://gerrit.wikimedia.org/r/1300730 (https://phabricator.wikimedia.org/T428867) [10:27:02] (03CR) 10Filippo Giunchedi: [C:03+1] "FWIW I don't feel strongly though I'd rather be consistent and use the proxy regardless if we're communicating with the api" [puppet] - 10https://gerrit.wikimedia.org/r/1300730 (https://phabricator.wikimedia.org/T428867) (owner: 10Volans) [10:27:19] (03CR) 10Volans: "Ok, limited to cinder backups then." [puppet] - 10https://gerrit.wikimedia.org/r/1300730 (https://phabricator.wikimedia.org/T428867) (owner: 10Volans) [10:27:47] (03CR) 10Filippo Giunchedi: "PCC is failing due to being unable to detect hosts, a manual run is at https://puppet-compiler.wmflabs.org/output/1300738/8706/conf2004.co" [puppet] - 10https://gerrit.wikimedia.org/r/1300738 (https://phabricator.wikimedia.org/T313030) (owner: 10Filippo Giunchedi) [10:30:59] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] "Trivial change not impacting anything, self-merging" [puppet] - 10https://gerrit.wikimedia.org/r/1300738 (https://phabricator.wikimedia.org/T313030) (owner: 10Filippo Giunchedi) [10:31:08] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1300734|HCaptcha: Return 'forceshowcaptcha' error when CAPTCHA forced (T426476)]], [[gerrit:1300727|hCaptcha: Enable for DiscussionTools on group 1 wikis (T426039)]] (duration: 11m 01s) [10:31:14] T426476: DiscussionTools hCaptcha: When user encounters AbuseFilter hCaptcha challenge no indication is shown they need to resubmit their edit - https://phabricator.wikimedia.org/T426476 [10:31:15] T426039: hCaptcha DiscussionTools: Rollout to WMF wikis - https://phabricator.wikimedia.org/T426039 [10:32:26] !log daniel@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [10:33:02] applying to eqiad [10:33:07] !log daniel@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [10:34:47] FIRING: KubernetesCalicoDown: dse-k8s-worker1009.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1009.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:36:13] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8707/co" [puppet] - 10https://gerrit.wikimedia.org/r/1300730 (https://phabricator.wikimedia.org/T428867) (owner: 10Volans) [10:37:08] (03CR) 10Majavah: [V:03+1 C:03+1] wmcs-backups: fix openstack access [puppet] - 10https://gerrit.wikimedia.org/r/1300730 (https://phabricator.wikimedia.org/T428867) (owner: 10Volans) [10:38:11] (03CR) 10Filippo Giunchedi: [C:03+1] wmcs-backups: fix openstack access [puppet] - 10https://gerrit.wikimedia.org/r/1300730 (https://phabricator.wikimedia.org/T428867) (owner: 10Volans) [10:40:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by javiermonton@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300733 (https://phabricator.wikimedia.org/T428725) (owner: 10JavierMonton) [10:40:25] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [10:40:29] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [10:43:06] effie, Raine: ok done, all good! [10:43:14] (03CR) 10CI reject: [V:04-1] stream: webrequest.page_view_stats.dev0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300733 (https://phabricator.wikimedia.org/T428725) (owner: 10JavierMonton) [10:43:28] duesen: nice \o/ [10:43:33] cheers [10:44:32] (03PS1) 10Muehlenhoff: thumbor: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300744 [10:44:37] (03PS1) 10Filippo Giunchedi: wmcs: switch to production systemd unit alerting [alerts] - 10https://gerrit.wikimedia.org/r/1300745 (https://phabricator.wikimedia.org/T428873) [10:44:55] 10SRE-SLO, 10observability, 06SRE Observability (FY2025/2026-Q1): Add a banner to slo.wikimedia.org explaining rolling vs calendar views - https://phabricator.wikimedia.org/T398313#12009187 (10hnowlan) 05Open→03Declined Pyrra is no longer in use [10:45:15] FIRING: [4x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1009:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [10:48:16] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [10:48:36] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1232: Upgrading db1232.eqiad.wmnet [10:48:56] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1232: Upgrading db1232.eqiad.wmnet [10:49:48] (03PS1) 10Elukey: docker_registry: refactor the nginx config [puppet] - 10https://gerrit.wikimedia.org/r/1300746 (https://phabricator.wikimedia.org/T427175) [10:52:13] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1232.eqiad.wmnet with OS trixie [10:52:16] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [10:52:36] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db2188: Upgrading db2188.codfw.wmnet [10:52:53] (03PS2) 10Elukey: docker_registry: refactor the nginx config [puppet] - 10https://gerrit.wikimedia.org/r/1300746 (https://phabricator.wikimedia.org/T427175) [10:52:58] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2188: Upgrading db2188.codfw.wmnet [10:53:32] (03PS3) 10Elukey: docker_registry: refactor the nginx config [puppet] - 10https://gerrit.wikimedia.org/r/1300746 (https://phabricator.wikimedia.org/T427175) [10:53:44] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1300746 (https://phabricator.wikimedia.org/T427175) (owner: 10Elukey) [10:54:28] (03CR) 10Muehlenhoff: [C:03+2] thumbor: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300744 (owner: 10Muehlenhoff) [10:55:35] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db2188.codfw.wmnet with OS trixie [10:57:32] FIRING: [13x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [10:58:36] 10SRE-SLO, 10Observability-Metrics, 06SRE Observability (FY2025/2026-Q1): Reduce Pyrra's default window from 12w to 4w - https://phabricator.wikimedia.org/T395916#12009230 (10hnowlan) 05Open→03Declined Pyrra is no longer in use [10:58:40] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply [10:59:43] !log jmm@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply [10:59:59] !log jmm@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply [11:00:03] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: db1262 crashed - https://phabricator.wikimedia.org/T428832#12009237 (10Volans) Should the alert on icinga be acked/downtimed? What about the open incident on splunk? (If not resolved it would alert again tomorrow IIRC) [11:00:31] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host testreduce1002.eqiad.wmnet [11:00:32] FIRING: Processor usage over 85%: Alert for device ssw1-d8-eqiad.mgmt.eqiad.wmnet - Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [11:00:43] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [11:00:49] (03CR) 10Volans: [C:03+2] wmcs-backups: fix openstack access [puppet] - 10https://gerrit.wikimedia.org/r/1300730 (https://phabricator.wikimedia.org/T428867) (owner: 10Volans) [11:02:29] (03PS1) 10Effie Mouzeli: aliases: rdb1011 has been decommed [puppet] - 10https://gerrit.wikimedia.org/r/1300748 [11:02:33] jouncebot: nowandnext [11:02:33] No deployments scheduled for the next 0 hour(s) and 57 minute(s) [11:02:33] In 0 hour(s) and 57 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260611T1200) [11:02:42] (03CR) 10Majavah: [C:03+1] icinga: remove toolschecker-based checks [puppet] - 10https://gerrit.wikimedia.org/r/1298742 (https://phabricator.wikimedia.org/T313030) (owner: 10Filippo Giunchedi) [11:02:54] (03PS1) 10Dreamy Jazz: HCaptcha: Return 'forceshowcaptcha' error when CAPTCHA forced [extensions/ConfirmEdit] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1300749 (https://phabricator.wikimedia.org/T426476) [11:02:57] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db1262.eqiad.wmnet with reason: crash [11:03:17] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300750 [11:03:29] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: db1262 crashed - https://phabricator.wikimedia.org/T428832#12009246 (10Marostegui) I just gave it 7 days of downtime and resolved it on splunk. [11:03:35] (03CR) 10Bartosz Dziewoński: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1300713 (https://phabricator.wikimedia.org/T428772) (owner: 10Blake) [11:03:43] (03PS1) 10Dreamy Jazz: hCaptcha: Enable for DiscussionTools on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300751 (https://phabricator.wikimedia.org/T426039) [11:03:44] (03CR) 10Clément Goubert: [C:03+1] wgRestSandboxSpecs: Add Lift Wing API to documentation wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300731 (https://phabricator.wikimedia.org/T427902) (owner: 10Gkyziridis) [11:03:56] (03CR) 10Majavah: [C:03+1] toolforge: remove checker access from k8s::etcd [puppet] - 10https://gerrit.wikimedia.org/r/1299546 (https://phabricator.wikimedia.org/T313030) (owner: 10Filippo Giunchedi) [11:04:06] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [11:04:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1300749 (https://phabricator.wikimedia.org/T426476) (owner: 10Dreamy Jazz) [11:04:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300751 (https://phabricator.wikimedia.org/T426039) (owner: 10Dreamy Jazz) [11:04:25] (03CR) 10Majavah: [C:03+1] Remove toolschecker role/profile [puppet] - 10https://gerrit.wikimedia.org/r/1299547 (https://phabricator.wikimedia.org/T313030) (owner: 10Filippo Giunchedi) [11:04:28] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testreduce1002.eqiad.wmnet [11:04:31] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-misc2001.codfw.wmnet [11:05:08] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [11:05:32] RESOLVED: Processor usage over 85%: Device ssw1-d8-eqiad.mgmt.eqiad.wmnet recovered from Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [11:05:40] (03Merged) 10jenkins-bot: hCaptcha: Enable for DiscussionTools on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300751 (https://phabricator.wikimedia.org/T426039) (owner: 10Dreamy Jazz) [11:06:50] (03PS1) 10Slyngshede: Release version 0.1.17 [software/bitu] - 10https://gerrit.wikimedia.org/r/1300752 [11:06:50] (03PS1) 10Effie Mouzeli: mediawiki-common: remove retired servers from list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300753 [11:07:32] FIRING: [14x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [11:07:56] (03PS2) 10Slyngshede: Release version 0.1.17 [software/bitu] - 10https://gerrit.wikimedia.org/r/1300752 [11:07:59] (03PS2) 10Effie Mouzeli: mediawiki-common: remove retired redis servers from list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300753 [11:08:12] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5006.eqsin.wmnet [11:08:21] (03PS1) 10Clément Goubert: rest-gateway: Restore no-cache for lw-openapi-server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300754 (https://phabricator.wikimedia.org/T427902) [11:08:56] (03PS3) 10Slyngshede: Release version 0.1.17 [software/bitu] - 10https://gerrit.wikimedia.org/r/1300752 [11:09:11] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1232.eqiad.wmnet with reason: host reimage [11:09:49] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc2001.codfw.wmnet [11:09:50] (03PS3) 10Effie Mouzeli: mediawiki-common: remove retired redis servers from list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300753 [11:11:08] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply [11:11:16] jmm@cumin2002 drain-node (PID 2019480) is awaiting input [11:11:40] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Install new MPC10E-10C line cards on cr1-eqiad and cr2-eqiad slot 0. - https://phabricator.wikimedia.org/T426343#12009270 (10cmooney) [11:11:44] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Raise DRBD replication speed for Ganeti clusters - https://phabricator.wikimedia.org/T428878#12009271 (10MoritzMuehlenhoff) p:05Triage→03Medium [11:12:13] ACKNOWLEDGEMENT - SSH on db1262 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Marostegui https://phabricator.wikimedia.org/T428832 https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:12:14] ACKNOWLEDGEMENT - Host db1262 #page is DOWN: PING CRITICAL - Packet loss = 100% Marostegui https://phabricator.wikimedia.org/T428832 [11:12:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5006.eqsin.wmnet [11:12:32] FIRING: [17x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [11:12:46] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5006.eqsin.wmnet [11:13:14] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [11:13:39] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-misc2002.codfw.wmnet [11:14:27] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1232.eqiad.wmnet with reason: host reimage [11:14:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300731 (https://phabricator.wikimedia.org/T427902) (owner: 10Gkyziridis) [11:15:06] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2188.codfw.wmnet with reason: host reimage [11:15:12] (03PS1) 10Hnowlan: thumbor: make log format raw in haproxy, remove bad headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300755 (https://phabricator.wikimedia.org/T368180) [11:15:30] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Install new MPC10E-10C line cards on cr1-eqiad and cr2-eqiad slot 0. - https://phabricator.wikimedia.org/T426343#12009300 (10cmooney) @papaul if you get a few minutes to double-check the above let me know. And specifically on the phys... [11:16:26] (03Merged) 10jenkins-bot: HCaptcha: Return 'forceshowcaptcha' error when CAPTCHA forced [extensions/ConfirmEdit] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1300749 (https://phabricator.wikimedia.org/T426476) (owner: 10Dreamy Jazz) [11:16:37] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: rdb101[56] implementation tracking - https://phabricator.wikimedia.org/T418918#12009317 (10jijiki) 05Open→03Resolved [11:16:47] 06SRE, 10Wikimedia-Apache-configuration: Move kr.wikimedia destination to [[m:Wikimedia Korea]] - https://phabricator.wikimedia.org/T428327#12009320 (10revi) 05Open→03Resolved Was merged, the metawiki-side switchover happened, nothing left here. Closing. [11:16:58] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1300749|HCaptcha: Return 'forceshowcaptcha' error when CAPTCHA forced (T426476)]], [[gerrit:1300751|hCaptcha: Enable for DiscussionTools on all wikis (T426039)]] [11:17:05] T426476: DiscussionTools hCaptcha: When user encounters AbuseFilter hCaptcha challenge no indication is shown they need to resubmit their edit - https://phabricator.wikimedia.org/T426476 [11:17:05] T426039: hCaptcha DiscussionTools: Rollout to WMF wikis - https://phabricator.wikimedia.org/T426039 [11:17:32] FIRING: [18x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [11:18:58] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2188.codfw.wmnet with reason: host reimage [11:19:08] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1300749|HCaptcha: Return 'forceshowcaptcha' error when CAPTCHA forced (T426476)]], [[gerrit:1300751|hCaptcha: Enable for DiscussionTools on all wikis (T426039)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:19:30] FIRING: Processor usage over 85%: Alert for device lsw1-c5-eqiad.mgmt.eqiad.wmnet - Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [11:19:35] Testing [11:19:48] (03PS1) 10Muehlenhoff: proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300756 [11:21:17] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with deployment [11:25:00] (03CR) 10Blake: [C:03+1] mediawiki-common: remove retired redis servers from list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300753 (owner: 10Effie Mouzeli) [11:25:36] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1300749|HCaptcha: Return 'forceshowcaptcha' error when CAPTCHA forced (T426476)]], [[gerrit:1300751|hCaptcha: Enable for DiscussionTools on all wikis (T426039)]] (duration: 08m 38s) [11:25:43] T426476: DiscussionTools hCaptcha: When user encounters AbuseFilter hCaptcha challenge no indication is shown they need to resubmit their edit - https://phabricator.wikimedia.org/T426476 [11:25:43] T426039: hCaptcha DiscussionTools: Rollout to WMF wikis - https://phabricator.wikimedia.org/T426039 [11:27:32] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc2002.codfw.wmnet [11:28:04] (03CR) 10Muehlenhoff: [C:03+2] proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300756 (owner: 10Muehlenhoff) [11:28:20] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300756 (owner: 10Muehlenhoff) [11:28:49] (03PS1) 10Volans: mwopenstackclients.py: fix proxy_url [puppet] - 10https://gerrit.wikimedia.org/r/1300758 (https://phabricator.wikimedia.org/T428867) [11:29:30] RESOLVED: Processor usage over 85%: Device lsw1-c5-eqiad.mgmt.eqiad.wmnet recovered from Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [11:32:07] (03CR) 10Majavah: [C:03+1] mwopenstackclients.py: fix proxy_url [puppet] - 10https://gerrit.wikimedia.org/r/1300758 (https://phabricator.wikimedia.org/T428867) (owner: 10Volans) [11:32:20] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1232.eqiad.wmnet with OS trixie [11:32:41] !log jmm@deploy1003 helmfile [staging] START helmfile.d/services/proton: apply [11:32:45] (03CR) 10Volans: [C:03+2] mwopenstackclients.py: fix proxy_url [puppet] - 10https://gerrit.wikimedia.org/r/1300758 (https://phabricator.wikimedia.org/T428867) (owner: 10Volans) [11:33:35] !log jmm@deploy1003 helmfile [staging] DONE helmfile.d/services/proton: apply [11:34:10] !log jiji@cumin1003 START - Cookbook sre.hosts.decommission for hosts rdb2010.codfw.wmnet [11:34:15] !log jiji@cumin1003 START - Cookbook sre.hosts.decommission for hosts rdb2008.codfw.wmnet [11:35:01] !log jiji@cumin1003 START - Cookbook sre.hosts.decommission for hosts rdb1012.eqiad.wmnet [11:35:50] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2188.codfw.wmnet with OS trixie [11:36:25] !log jmm@deploy1003 helmfile [codfw] START helmfile.d/services/proton: apply [11:37:43] !log jmm@deploy1003 helmfile [codfw] DONE helmfile.d/services/proton: apply [11:37:58] !log jmm@deploy1003 helmfile [eqiad] START helmfile.d/services/proton: apply [11:38:47] !log jiji@cumin1003 START - Cookbook sre.dns.netbox [11:39:25] (03PS1) 10Majavah: Update cr2-codfw cloudsw port [homer/public] - 10https://gerrit.wikimedia.org/r/1300759 (https://phabricator.wikimedia.org/T393552) [11:40:43] (03CR) 10Volans: [C:03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/1300759 (https://phabricator.wikimedia.org/T393552) (owner: 10Majavah) [11:40:59] PROBLEM - SSH on stat1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:41:03] (03CR) 10Majavah: [C:03+2] Update cr2-codfw cloudsw port [homer/public] - 10https://gerrit.wikimedia.org/r/1300759 (https://phabricator.wikimedia.org/T393552) (owner: 10Majavah) [11:41:55] RECOVERY - SSH on stat1010 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:42:32] (03Merged) 10jenkins-bot: Update cr2-codfw cloudsw port [homer/public] - 10https://gerrit.wikimedia.org/r/1300759 (https://phabricator.wikimedia.org/T393552) (owner: 10Majavah) [11:43:04] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1232: Migration of db1232.eqiad.wmnet completed [11:43:09] !log jiji@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: rdb2010.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1003" [11:43:51] !log jiji@cumin1003 START - Cookbook sre.dns.netbox [11:44:29] !log jmm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/proton: apply [11:45:01] (03PS1) 10Cathal Mooney: cr2-codfw: Correct invalid interface names for ACLs [homer/public] - 10https://gerrit.wikimedia.org/r/1300760 (https://phabricator.wikimedia.org/T393552) [11:45:09] (03CR) 10CI reject: [V:04-1] cr2-codfw: Correct invalid interface names for ACLs [homer/public] - 10https://gerrit.wikimedia.org/r/1300760 (https://phabricator.wikimedia.org/T393552) (owner: 10Cathal Mooney) [11:45:59] (03CR) 10Gkyziridis: [C:03+1] rest-gateway: Restore no-cache for lw-openapi-server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300754 (https://phabricator.wikimedia.org/T427902) (owner: 10Clément Goubert) [11:46:03] PROBLEM - SSH on stat1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:46:13] jiji@cumin1003 decommission (PID 3300603) is awaiting input [11:46:18] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db2188: Migration of db2188.codfw.wmnet completed [11:46:43] !log jiji@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:46:44] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts rdb2008.codfw.wmnet [11:46:49] !log jiji@cumin1003 START - Cookbook sre.dns.netbox [11:46:53] RECOVERY - SSH on stat1010 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:48:05] PROBLEM - Check unit status of security_group_ssh-from-restricted-bastion_to_project_trove on cloudcontrol1006 is CRITICAL: CRITICAL: Status of the systemd unit security_group_ssh-from-restricted-bastion_to_project_trove https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:48:07] (03PS2) 10Cathal Mooney: cr2-codfw: Correct invalid interface names for ACLs [homer/public] - 10https://gerrit.wikimedia.org/r/1300760 (https://phabricator.wikimedia.org/T393552) [11:48:59] !log jiji@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: rdb2010.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1003" [11:48:59] !log jiji@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:49:00] (03PS7) 10Arnaudb: gitlab: support extra ssh host_aliases [puppet] - 10https://gerrit.wikimedia.org/r/1298771 (https://phabricator.wikimedia.org/T425441) [11:49:00] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts rdb2010.codfw.wmnet [11:49:09] (03PS3) 10Arnaudb: gitlab: advertise gitlab-ssh.wikimedia.org in UI clone URLs [puppet] - 10https://gerrit.wikimedia.org/r/1298781 (https://phabricator.wikimedia.org/T425441) [11:49:12] (03CR) 10Cathal Mooney: "Taavi beat me to it with another patch, so just modifying the transport cct interface references in this one." [homer/public] - 10https://gerrit.wikimedia.org/r/1300760 (https://phabricator.wikimedia.org/T393552) (owner: 10Cathal Mooney) [11:49:31] FIRING: Processor usage over 85%: Alert for device lsw1-c7-eqiad.mgmt.eqiad.wmnet - Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [11:49:46] that is not good :( [11:49:52] !log jiji@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:49:53] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts rdb1012.eqiad.wmnet [11:50:35] (03CR) 10Majavah: [C:03+1] cr2-codfw: Correct invalid interface names for ACLs [homer/public] - 10https://gerrit.wikimedia.org/r/1300760 (https://phabricator.wikimedia.org/T393552) (owner: 10Cathal Mooney) [11:52:49] (03PS1) 10Effie Mouzeli: site.pp: remove retired redis hosts [puppet] - 10https://gerrit.wikimedia.org/r/1300761 (https://phabricator.wikimedia.org/T428858) [11:53:09] (03PS1) 10Cathal Mooney: Revert "Nokia SR-Linux: get specific component status with gnmic" [puppet] - 10https://gerrit.wikimedia.org/r/1300762 [11:53:19] (03CR) 10Cathal Mooney: [C:03+2] cr2-codfw: Correct invalid interface names for ACLs [homer/public] - 10https://gerrit.wikimedia.org/r/1300760 (https://phabricator.wikimedia.org/T393552) (owner: 10Cathal Mooney) [11:53:43] (03CR) 10CI reject: [V:04-1] Revert "Nokia SR-Linux: get specific component status with gnmic" [puppet] - 10https://gerrit.wikimedia.org/r/1300762 (owner: 10Cathal Mooney) [11:54:23] (03PS1) 10Urbanecm: Remove GrowthExperiments extension from closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300764 (https://phabricator.wikimedia.org/T428884) [11:54:33] jouncebot: nowandnext [11:54:33] No deployments scheduled for the next 0 hour(s) and 5 minute(s) [11:54:33] In 0 hour(s) and 5 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260611T1200) [11:54:39] (03Merged) 10jenkins-bot: cr2-codfw: Correct invalid interface names for ACLs [homer/public] - 10https://gerrit.wikimedia.org/r/1300760 (https://phabricator.wikimedia.org/T393552) (owner: 10Cathal Mooney) [11:54:45] (03CR) 10Urbanecm: [C:03+2] Remove GrowthExperiments extension from closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300764 (https://phabricator.wikimedia.org/T428884) (owner: 10Urbanecm) [11:55:44] (03CR) 10Cathal Mooney: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1300762 (owner: 10Cathal Mooney) [11:56:03] (03Merged) 10jenkins-bot: Remove GrowthExperiments extension from closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300764 (https://phabricator.wikimedia.org/T428884) (owner: 10Urbanecm) [11:56:11] (03PS4) 10Arnaudb: gitlab: advertise gitlab-ssh url on gitlab replicas [puppet] - 10https://gerrit.wikimedia.org/r/1298781 (https://phabricator.wikimedia.org/T425441) [11:56:11] (03CR) 10Arnaudb: "Done, that PS has been amended to expose replicas first, we'll be able to test them ahead of the primary instance change in 1300763" [puppet] - 10https://gerrit.wikimedia.org/r/1298781 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [11:56:22] (03CR) 10Muehlenhoff: "hieradata/hosts/rdb1012.yaml also needs to be removed" [puppet] - 10https://gerrit.wikimedia.org/r/1300761 (https://phabricator.wikimedia.org/T428858) (owner: 10Effie Mouzeli) [11:56:45] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1300764|Remove GrowthExperiments extension from closed wikis (T428884)]] [11:56:50] T428884: Remove GrowthExperiments extension from closed wikis - https://phabricator.wikimedia.org/T428884 [11:58:05] RECOVERY - Check unit status of security_group_ssh-from-restricted-bastion_to_project_trove on cloudcontrol1006 is OK: OK: Status of the systemd unit security_group_ssh-from-restricted-bastion_to_project_trove https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:58:38] (03CR) 10Ayounsi: [C:03+1] "+1 once the commit message is updated to make CI happy" [puppet] - 10https://gerrit.wikimedia.org/r/1300762 (owner: 10Cathal Mooney) [11:58:51] 06SRE, 10homer, 06Infrastructure-Foundations, 10netops: Homer should abort on filter rules applied on non-existent or disabled interfaces - https://phabricator.wikimedia.org/T428886#12009524 (10cmooney) p:05Triage→03Medium Thanks @taavi Yes we can do some validation in Homer to avoid this I think, I'... [11:58:51] !log urbanecm@deploy1003 urbanecm: Backport for [[gerrit:1300764|Remove GrowthExperiments extension from closed wikis (T428884)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:59:09] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1300752 (owner: 10Slyngshede) [11:59:12] !log urbanecm@deploy1003 urbanecm: Continuing with deployment [11:59:13] (03PS1) 10Arnaudb: gitlab: advertise gitlab-ssh url on gitlab primary [puppet] - 10https://gerrit.wikimedia.org/r/1300763 (https://phabricator.wikimedia.org/T425441) [11:59:31] RESOLVED: Processor usage over 85%: Device lsw1-c7-eqiad.mgmt.eqiad.wmnet recovered from Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260611T1200) [12:01:09] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1298781 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [12:01:21] (03PS2) 10Cathal Mooney: Revert "Nokia SR-Linux: get specific component status with gnmic" [puppet] - 10https://gerrit.wikimedia.org/r/1300762 [12:01:24] (03CR) 10Slyngshede: [C:03+2] Release version 0.1.17 [software/bitu] - 10https://gerrit.wikimedia.org/r/1300752 (owner: 10Slyngshede) [12:01:35] (03CR) 10Cathal Mooney: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1300762 (owner: 10Cathal Mooney) [12:02:37] (03CR) 10Jelto: [V:03+1 C:03+1] "lgtm, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1298781 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [12:03:39] !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1300764|Remove GrowthExperiments extension from closed wikis (T428884)]] (duration: 06m 53s) [12:03:44] T428884: Remove GrowthExperiments extension from closed wikis - https://phabricator.wikimedia.org/T428884 [12:06:05] (03PS1) 10Muehlenhoff: Depool puppetserver2002 for rack maintenance [dns] - 10https://gerrit.wikimedia.org/r/1300766 (https://phabricator.wikimedia.org/T428020) [12:09:10] (03CR) 10Muehlenhoff: [C:03+2] sre.puppet.disable-merges: New cookbook to disable Puppet merges temporarily [cookbooks] - 10https://gerrit.wikimedia.org/r/1295425 (https://phabricator.wikimedia.org/T248872) (owner: 10Muehlenhoff) [12:09:30] (03CR) 10Cathal Mooney: [C:03+2] Revert "Nokia SR-Linux: get specific component status with gnmic" [puppet] - 10https://gerrit.wikimedia.org/r/1300762 (owner: 10Cathal Mooney) [12:10:29] !log installing openjdk-21 security updates on Bookworm [12:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:01] RECOVERY - Host db1262 #page is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [12:12:44] * Raine is confused [12:13:31] Raine: it crashed tonight, was acked a bit a go and not is back alive [12:13:42] see the related phab task [12:13:52] T428832 [12:13:53] T428832: db1262 crashed - https://phabricator.wikimedia.org/T428832 [12:14:06] ah, OK, thank you volans <3 [12:16:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:17:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5006.eqsin.wmnet [12:19:36] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1300770 (owner: 10L10n-bot) [12:21:28] !log remove ganeti5006 from eqsin cluster for reimage T428229 [12:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:33] T428229: eqsin: re-image rack 604 servers on new vlan - https://phabricator.wikimedia.org/T428229 [12:24:05] PROBLEM - ganeti-confd running on ganeti5006 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 109 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [12:24:13] (03PS1) 10Daniel Kinzler: rest-gateway: put request ID into rate limit respose [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300775 [12:24:50] FIRING: ProbeDown: Service ganeti5006:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:24:54] (03PS1) 10Giuseppe Lavagetto: UX improvements [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1300776 [12:26:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti5006.eqsin.wmnet with OS bookworm [12:26:26] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: db1262 crashed - https://phabricator.wikimedia.org/T428832#12009664 (10Jclark-ctr) Server is back up right now. I've updated the firmware and opened a Dell support ticket. Please leave this ticket open for one week while I work with Dell to trouble... [12:26:43] !log jmm@cumin2002 START - Cookbook sre.hosts.move-vlan for host ganeti5006 [12:27:10] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] UX improvements [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1300776 (owner: 10Giuseppe Lavagetto) [12:27:57] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [12:28:11] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "UX improvements - oblivian@cumin1003" [12:28:13] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: UX improvements - oblivian@cumin1003 [12:28:33] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1232: Migration of db1232.eqiad.wmnet completed [12:28:34] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [12:29:04] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: UX improvements - oblivian@cumin1003 [12:29:05] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "UX improvements - oblivian@cumin1003" [12:31:47] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2188: Migration of db2188.codfw.wmnet completed [12:31:48] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [12:32:05] (03PS1) 10Slyngshede: IDM: Upgrade IDM to Bitu v0.1.17 [dns] - 10https://gerrit.wikimedia.org/r/1300783 [12:33:52] jmm@cumin2002 reimage (PID 2033402) is awaiting input [12:36:08] (03CR) 10Ladsgroup: [C:03+1] thumbor: make log format raw in haproxy, remove bad headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300755 (https://phabricator.wikimedia.org/T368180) (owner: 10Hnowlan) [12:41:25] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: db1262 crashed - https://phabricator.wikimedia.org/T428832#12009718 (10Marostegui) Thank you John. I won't pool the host back in production until you've done all the things you need with Dell. Thank you! [12:42:59] jouncebot: nowandnext [12:42:59] For the next 0 hour(s) and 17 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260611T1200) [12:42:59] In 0 hour(s) and 17 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260611T1300) [12:43:35] (03PS1) 10Marostegui: db1262: Add note [puppet] - 10https://gerrit.wikimedia.org/r/1300784 (https://phabricator.wikimedia.org/T428832) [12:43:52] (03CR) 10Marostegui: "This is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/1300784 (https://phabricator.wikimedia.org/T428832) (owner: 10Marostegui) [12:44:07] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [12:44:28] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1234: Upgrading db1234.eqiad.wmnet [12:44:33] (03CR) 10Marostegui: [C:03+2] db1262: Add note [puppet] - 10https://gerrit.wikimedia.org/r/1300784 (https://phabricator.wikimedia.org/T428832) (owner: 10Marostegui) [12:44:48] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1234: Upgrading db1234.eqiad.wmnet [12:46:02] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ganeti5006 - jmm@cumin2002" [12:46:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ganeti5006 - jmm@cumin2002" [12:46:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:46:08] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ganeti5006.eqsin.wmnet 9.0.132.10.in-addr.arpa 9.0.0.0.0.0.0.0.2.3.1.0.0.1.0.0.1.0.1.0.0.0.5.e.2.f.d.0.1.0.0.2.ip6.arpa on all recursors [12:46:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ganeti5006.eqsin.wmnet 9.0.132.10.in-addr.arpa 9.0.0.0.0.0.0.0.2.3.1.0.0.1.0.0.1.0.1.0.0.0.5.e.2.f.d.0.1.0.0.2.ip6.arpa on all recursors [12:46:13] !log jmm@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti5006 [12:47:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti5006 [12:47:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ganeti5006 [12:47:48] cwilliams@cumin1003 major-upgrade (PID 3321466) is awaiting input [12:53:09] (03CR) 10Muehlenhoff: [C:03+2] cumin2003: Add host Hiera config [puppet] - 10https://gerrit.wikimedia.org/r/1300141 (https://phabricator.wikimedia.org/T427897) (owner: 10Muehlenhoff) [12:54:27] (03PS1) 10Dreamy Jazz: hCaptcha: Enable for MobileFrontend on all group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300787 (https://phabricator.wikimedia.org/T425940) [12:54:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300787 (https://phabricator.wikimedia.org/T425940) (owner: 10Dreamy Jazz) [12:55:50] !log installing Exim security updates on Bullseye [12:55:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:26] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1234.eqiad.wmnet with OS trixie [12:58:15] For today's backport window, I have a config change but I won't be available until ~30min past the hour. I can self deploy at that point though, or whenever everyone else is done. [13:00:04] Lucas_WMDE, urbanecm, and TheresNoTime: Your horoscope predicts another UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260611T1300). [13:00:05] alexsanford, MichaelG_WMF, georgekyz, and Dreamy_Jazz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:17] \o [13:00:18] o/ [13:00:32] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [13:00:42] hey :) [13:01:17] \o [13:02:00] (I'm around but it looks like others are) [13:02:06] MichaelG_WMF: you need a deployer, right? [13:02:08] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db2202.codfw.wmnet with OS trixie [13:02:17] @Lucas_WMDE yes, please! [13:02:22] okay, I can deploy :) [13:02:25] but also I just saw it’s a backport [13:02:27] georgekyz: I'm around for when your change is up [13:02:31] so let’s do a config change first [13:02:43] (there is nothing to test for my change as wmf.6 will only reach enwiki later today, I think) [13:02:49] ok [13:02:50] I have a config change as well, I am available right now, if there are not other deployments at the time I can fire it up [13:02:57] georgekyz: yup, go ahead [13:03:01] thnx a lot! [13:03:05] and then I’ll run gate-and-submit for the backport in the meantime [13:03:07] I am shooting it [13:03:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by gkyziridis@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300731 (https://phabricator.wikimedia.org/T427902) (owner: 10Gkyziridis) [13:04:23] (03Merged) 10jenkins-bot: wgRestSandboxSpecs: Add Lift Wing API to documentation wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300731 (https://phabricator.wikimedia.org/T427902) (owner: 10Gkyziridis) [13:04:48] !log gkyziridis@deploy1003 Started scap sync-world: Backport for [[gerrit:1300731|wgRestSandboxSpecs: Add Lift Wing API to documentation wikis (T427902)]] [13:04:53] T427902: Expose LiftWing API for serving the openapi-specs through the /docs yaml files. - https://phabricator.wikimedia.org/T427902 [13:06:04] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/WikimediaEvents] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300736 (https://phabricator.wikimedia.org/T422295) (owner: 10Michael Große) [13:06:54] !log echo 'https://api.wikimedia.org/service/lw/specs/openapi.yaml' | mwscript-k8s --attach -- purgeList.php [13:06:55] !log gkyziridis@deploy1003 gkyziridis: Backport for [[gerrit:1300731|wgRestSandboxSpecs: Add Lift Wing API to documentation wikis (T427902)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:33] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:07:40] Proceed to the deployment ? [13:08:06] georgekyz: did you test the change? ^^ [13:08:15] georgekyz: yeah please [13:08:21] Lucas_WMDE: change is broken but it's not the change [13:08:30] it's the backend messing CORS up [13:08:39] I'm deploying the rest-gateway fix [13:08:40] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-06-05 - 2026-06-26): Degraded RAID on an-worker1201 - https://phabricator.wikimedia.org/T428571#12009879 (10BTullis) Hi @jclark-ctr please go ahead and deal whenever is convenient. Thanks. [13:08:49] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Restore no-cache for lw-openapi-server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300754 (https://phabricator.wikimedia.org/T427902) (owner: 10Clément Goubert) [13:09:07] alright hitting continue then ? [13:09:07] so please go forth georgekyz [13:09:12] !log gkyziridis@deploy1003 gkyziridis: Continuing with deployment [13:10:04] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [13:10:19] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [13:10:24] (03Merged) 10jenkins-bot: fix: correct intake-url and payload type for NCS experiment events [extensions/WikimediaEvents] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300736 (https://phabricator.wikimedia.org/T422295) (owner: 10Michael Große) [13:11:06] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [13:11:21] (03Merged) 10jenkins-bot: rest-gateway: Restore no-cache for lw-openapi-server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300754 (https://phabricator.wikimedia.org/T427902) (owner: 10Clément Goubert) [13:11:27] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [13:11:31] !log cgoubert@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [13:11:43] !log cgoubert@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [13:11:48] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [13:12:18] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [13:12:22] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [13:12:43] !log urbanecm@deploy1003 mwscript-k8s job started: extensions/Translate/scripts/moveTranslatableBundle.php --wiki=mediawikiwiki '--reason=per [[:phab:T428900]]' Wikimedia_Apps/iOS_FAQ 'Wikimedia Apps/FAQ/iOS' 'Martin Urbanec (WMF)' # T428900 [13:12:44] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [13:12:47] T428900: Request to move translatable pages: Wikimedia Apps/iOS FAQ and Wikimedia Apps/Android FAQ - https://phabricator.wikimedia.org/T428900 [13:12:52] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1234.eqiad.wmnet with reason: host reimage [13:13:08] !log sudo -i reprepro --noskipold --component thirdparty/openstack-trixie-flamingo-backports update trixie-wikimedia [13:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:35] !log gkyziridis@deploy1003 Finished scap sync-world: Backport for [[gerrit:1300731|wgRestSandboxSpecs: Add Lift Wing API to documentation wikis (T427902)]] (duration: 08m 47s) [13:13:39] T427902: Expose LiftWing API for serving the openapi-specs through the /docs yaml files. - https://phabricator.wikimedia.org/T427902 [13:13:55] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [13:14:17] Finished! Thank you all! [13:14:26] !log urbanecm@deploy1003 mwscript-k8s job started: extensions/Translate/scripts/moveTranslatableBundle.php --wiki=mediawikiwiki '--reason=per [[:phab:T428900]]' Wikimedia_Apps/Android_FAQ 'Wikimedia Apps/FAQ/Android' 'Martin Urbanec (WMF)' # T428900 [13:14:30] thanks! taking over with MichaelG_WMF’s backport [13:15:06] hmm, I’m not getting any responses from the spiderpig API all of a sudden… [13:15:28] ok, it’s back [13:15:41] (03PS1) 10Atsuko: toolhub: switch staging to test opensearch cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300795 (https://phabricator.wikimedia.org/T426073) [13:15:44] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti5006.eqsin.wmnet with reason: host reimage [13:16:03] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1300736|fix: correct intake-url and payload type for NCS experiment events (T422295)]] [13:16:08] T422295: [V2 experiment release] Mobile web account creation form improvements + username TL;DR - https://phabricator.wikimedia.org/T422295 [13:17:45] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [13:18:05] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [13:18:09] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, migr: Backport for [[gerrit:1300736|fix: correct intake-url and payload type for NCS experiment events (T422295)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:18:17] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2202.codfw.wmnet with reason: host reimage [13:18:18] MichaelG_WMF: nothing to test, you said? [13:18:26] Lucas_WMDE: correct [13:18:37] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, migr: Continuing with deployment [13:18:38] alright [13:18:50] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1234.eqiad.wmnet with reason: host reimage [13:19:15] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1300783 (owner: 10Slyngshede) [13:20:57] FIRING: ProbeDown: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#shellbox:4008 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:21:41] (03PS1) 10Clément Goubert: rest-gateway: Vary lw-openapi-server cache on Origin [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300797 (https://phabricator.wikimedia.org/T427902) [13:21:50] (03CR) 10CI reject: [V:04-1] rest-gateway: Vary lw-openapi-server cache on Origin [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300797 (https://phabricator.wikimedia.org/T427902) (owner: 10Clément Goubert) [13:22:03] (03PS2) 10Clément Goubert: rest-gateway: Vary lw-openapi-server cache on Origin [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300797 (https://phabricator.wikimedia.org/T427902) [13:22:32] (03PS2) 10JavierMonton: stream: webrequest.page_view_stats.dev0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300733 (https://phabricator.wikimedia.org/T428725) [13:22:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti5006.eqsin.wmnet with reason: host reimage [13:22:55] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1300736|fix: correct intake-url and payload type for NCS experiment events (T422295)]] (duration: 06m 51s) [13:22:59] T422295: [V2 experiment release] Mobile web account creation form improvements + username TL;DR - https://phabricator.wikimedia.org/T422295 [13:23:10] Dreamy_Jazz: over to you! [13:23:19] Thanks [13:23:47] Lucas_WMDE: Thank you for the backport ❤️ [13:24:15] !log urbanecm@deploy1003 mwscript-k8s job started: extensions/Translate/scripts/moveTranslatableBundle.php --wiki=mediawikiwiki '--reason=per [[:phab:T428900]]' Wikimedia_Apps/Android_FAQ 'Wikimedia Apps/FAQ/Android' 'Martin Urbanec (WMF)' # T428900 [13:24:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300787 (https://phabricator.wikimedia.org/T425940) (owner: 10Dreamy Jazz) [13:24:20] T428900: Request to move translatable pages: Wikimedia Apps/iOS FAQ and Wikimedia Apps/Android FAQ - https://phabricator.wikimedia.org/T428900 [13:25:16] (03Merged) 10jenkins-bot: hCaptcha: Enable for MobileFrontend on all group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300787 (https://phabricator.wikimedia.org/T425940) (owner: 10Dreamy Jazz) [13:25:27] !log urbanecm@deploy1003 mwscript-k8s job started: extensions/Translate/scripts/moveTranslatableBundle.php --wiki=mediawikiwiki '--reason=per [[:phab:T428900]]' Wikimedia_Apps/Android_FAQ 'Wikimedia Apps/FAQ/Android' 'Martin Urbanec (WMF)' # T428900 [13:25:28] (03CR) 10Gkyziridis: [C:03+1] rest-gateway: Vary lw-openapi-server cache on Origin [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300797 (https://phabricator.wikimedia.org/T427902) (owner: 10Clément Goubert) [13:25:40] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1300787|hCaptcha: Enable for MobileFrontend on all group1 wikis (T425940)]] [13:25:45] T425940: hCaptcha: Rollout of MobileFrontend and VisualEditor integrations - https://phabricator.wikimedia.org/T425940 [13:25:57] RESOLVED: ProbeDown: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#shellbox:4008 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:26:30] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Vary lw-openapi-server cache on Origin [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300797 (https://phabricator.wikimedia.org/T427902) (owner: 10Clément Goubert) [13:26:52] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2202.codfw.wmnet with reason: host reimage [13:27:47] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1300787|hCaptcha: Enable for MobileFrontend on all group1 wikis (T425940)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:28:19] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with deployment [13:28:46] (03Merged) 10jenkins-bot: rest-gateway: Vary lw-openapi-server cache on Origin [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300797 (https://phabricator.wikimedia.org/T427902) (owner: 10Clément Goubert) [13:28:52] !log cgoubert@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [13:28:56] !log cgoubert@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [13:29:11] !log cgoubert@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [13:29:22] !log cgoubert@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [13:29:27] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [13:29:47] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [13:32:39] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1300787|hCaptcha: Enable for MobileFrontend on all group1 wikis (T425940)]] (duration: 06m 59s) [13:32:44] T425940: hCaptcha: Rollout of MobileFrontend and VisualEditor integrations - https://phabricator.wikimedia.org/T425940 [13:32:55] jouncebot: nowandnext [13:32:55] For the next 0 hour(s) and 27 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260611T1300) [13:32:55] In 0 hour(s) and 57 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260611T1430) [13:33:44] Dreamy_Jazz: alexsanford’s patch is still left, if they’re ready now [13:33:54] (03CR) 10Slyngshede: [C:03+2] IDM: Upgrade IDM to Bitu v0.1.17 [dns] - 10https://gerrit.wikimedia.org/r/1300783 (owner: 10Slyngshede) [13:34:00] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [13:34:00] Yeah was looking at that :D [13:34:03] I am! I'll do that now [13:34:10] okay! [13:34:11] (03CR) 10Ottomata: [C:03+1] stream: webrequest.page_view_stats.dev0 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300733 (https://phabricator.wikimedia.org/T428725) (owner: 10JavierMonton) [13:34:18] !log slyngshede@dns1004 START - running authdns-update [13:34:36] you’ll noticed we timed it perfectly so the other three deploys would take exactly 30 minutes [13:34:46] Amazing :D [13:34:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by alexsanford@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1298890 (https://phabricator.wikimedia.org/T423120) (owner: 10Alex.sanford) [13:34:53] :D [13:34:55] !log installing dovecot security updates [13:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:55] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1234.eqiad.wmnet with OS trixie [13:36:06] !log slyngshede@dns1004 END - running authdns-update [13:36:13] (03Merged) 10jenkins-bot: Add 2FA enforcement demotion config for phase 3 groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1298890 (https://phabricator.wikimedia.org/T423120) (owner: 10Alex.sanford) [13:36:36] !log alexsanford@deploy1003 Started scap sync-world: Backport for [[gerrit:1298890|Add 2FA enforcement demotion config for phase 3 groups (T423120)]] [13:36:41] T423120: FY25-26 Q4: Phase 3 of 2FA enforcement in Wikimedia production - https://phabricator.wikimedia.org/T423120 [13:37:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300733 (https://phabricator.wikimedia.org/T428725) (owner: 10JavierMonton) [13:37:59] (03PS1) 10Volans: wmcs cinder backups: set temporary retention [puppet] - 10https://gerrit.wikimedia.org/r/1300802 (https://phabricator.wikimedia.org/T428867) [13:38:43] !log alexsanford@deploy1003 alexsanford: Backport for [[gerrit:1298890|Add 2FA enforcement demotion config for phase 3 groups (T423120)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:39:40] !log alexsanford@deploy1003 alexsanford: Continuing with deployment [13:42:23] (03CR) 10Majavah: [C:03+1] wmcs cinder backups: set temporary retention [puppet] - 10https://gerrit.wikimedia.org/r/1300802 (https://phabricator.wikimedia.org/T428867) (owner: 10Volans) [13:42:32] FIRING: [18x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [13:42:33] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:et-0/0/0 (Transport: Arelion (IC-398708) {#20260601}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:42:44] (03PS1) 10Muehlenhoff: testreduce: Enable profile::auto_restarts::service for Dovecot [puppet] - 10https://gerrit.wikimedia.org/r/1300804 (https://phabricator.wikimedia.org/T135991) [13:43:10] (03CR) 10Andrew Bogott: [C:03+1] wmcs cinder backups: set temporary retention [puppet] - 10https://gerrit.wikimedia.org/r/1300802 (https://phabricator.wikimedia.org/T428867) (owner: 10Volans) [13:43:26] (03CR) 10Volans: [C:03+2] wmcs cinder backups: set temporary retention [puppet] - 10https://gerrit.wikimedia.org/r/1300802 (https://phabricator.wikimedia.org/T428867) (owner: 10Volans) [13:43:56] !log alexsanford@deploy1003 Finished scap sync-world: Backport for [[gerrit:1298890|Add 2FA enforcement demotion config for phase 3 groups (T423120)]] (duration: 07m 19s) [13:44:00] T423120: FY25-26 Q4: Phase 3 of 2FA enforcement in Wikimedia production - https://phabricator.wikimedia.org/T423120 [13:44:19] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2202.codfw.wmnet with OS trixie [13:44:24] Done! [13:45:21] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1300804 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:45:29] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [13:45:32] 06SRE, 10homer, 06Infrastructure-Foundations, 10netops: Homer should abort on filter rules applied on non-existent or disabled interfaces - https://phabricator.wikimedia.org/T428886#12010069 (10ayounsi) Quick update after chatting about that with Cathal. For context the current implementation looks like:... [13:45:49] Dreamy_Jazz: did you still want to deploy something? [13:45:59] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [13:46:00] No, thanks [13:46:00] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [13:46:09] oh, but a new change by JavierMonton appeared [13:46:13] :D [13:46:21] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1234: Migration of db1234.eqiad.wmnet completed [13:46:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti5006.eqsin.wmnet with OS bookworm [13:46:22] I was a bit late to scheduled another config change for this backport window, but if it's "free" now, I can do it myself from spiderpig [13:46:28] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [13:46:34] yeah, go ahead I think [13:46:39] ok, thanks! [13:46:49] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:47:31] (03CR) 10TrainBranchBot: "Approved by javiermonton@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300733 (https://phabricator.wikimedia.org/T428725) (owner: 10JavierMonton) [13:47:32] FIRING: [18x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [13:47:33] RESOLVED: JobUnavailable: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:48:20] (03PS2) 10Muehlenhoff: mx-out: Enable profile::auto_restarts::service for Dovecot [puppet] - 10https://gerrit.wikimedia.org/r/1300804 (https://phabricator.wikimedia.org/T135991) [13:49:19] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [13:50:27] !log installing openssl security updates [13:50:28] !log reloading liberica config on lvs5004 [13:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:41] !log slyngshede@cumin1003 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs5004*} and A:liberica [13:50:43] (03Merged) 10jenkins-bot: stream: webrequest.page_view_stats.dev0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300733 (https://phabricator.wikimedia.org/T428725) (owner: 10JavierMonton) [13:50:45] !log javiermonton@deploy1003 Started scap sync-world: Backport for [[gerrit:1300733|stream: webrequest.page_view_stats.dev0 (T428725)]] [13:50:49] T428725: Relative Trending - Milestone 2 - Load baseline into Kafka - https://phabricator.wikimedia.org/T428725 [13:50:57] 06SRE, 10DNS, 06Traffic: 10.67.28.73 reverse DNS showing 2(SERVFAIL) - https://phabricator.wikimedia.org/T428573#12010077 (10CDanis) >>! In T428573#12001514, @cmooney wrote: > It doesn't seem to have a service endpoint registered though, which I think is needed before CoreDNS will publish any records for it:... [13:51:02] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs5004*} and A:liberica [13:52:15] (03CR) 10Hnowlan: [C:03+2] thumbor: make log format raw in haproxy, remove bad headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300755 (https://phabricator.wikimedia.org/T368180) (owner: 10Hnowlan) [13:52:32] FIRING: [18x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [13:52:53] !log javiermonton@deploy1003 javiermonton: Backport for [[gerrit:1300733|stream: webrequest.page_view_stats.dev0 (T428725)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:54:29] (03Merged) 10jenkins-bot: thumbor: make log format raw in haproxy, remove bad headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300755 (https://phabricator.wikimedia.org/T368180) (owner: 10Hnowlan) [13:54:39] !log javiermonton@deploy1003 javiermonton: Continuing with deployment [13:55:38] (03PS1) 10Gkyziridis: ml-services: add liftwing-openapi-server latest version deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300806 (https://phabricator.wikimedia.org/T427902) [13:55:39] !log fabfur@cumin1003 conftool action : set/pooled=yes; selector: name=cp5020.* [13:55:55] !log slyngshede@cumin1003 conftool action : set/pooled=yes; selector: name=cp5024.* [13:57:13] !log fabfur@cumin1003 conftool action : set/pooled=yes; selector: name=cp5024.* [13:57:32] FIRING: [18x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [13:58:57] !log javiermonton@deploy1003 Finished scap sync-world: Backport for [[gerrit:1300733|stream: webrequest.page_view_stats.dev0 (T428725)]] (duration: 08m 12s) [13:59:01] T428725: Relative Trending - Milestone 2 - Load baseline into Kafka - https://phabricator.wikimedia.org/T428725 [14:00:46] !log UTC afternoon backport+config window done [14:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:18] 06SRE, 06Infrastructure-Foundations, 10netops: Nokia SR-Linux: check if we need to filter irb interfaces for DHCP relay / IPv6 RA - https://phabricator.wikimedia.org/T428908 (10cmooney) 03NEW p:05Triage→03Low [14:02:27] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1300804 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:02:32] FIRING: [17x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [14:03:47] (03CR) 10Muehlenhoff: [C:03+2] ganeti5006: set up custom bgp neighbors for private1-604-eqsin vlan [puppet] - 10https://gerrit.wikimedia.org/r/1300702 (https://phabricator.wikimedia.org/T428229) (owner: 10Muehlenhoff) [14:07:15] (03CR) 10Clément Goubert: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1300713 (https://phabricator.wikimedia.org/T428772) (owner: 10Blake) [14:07:32] RESOLVED: [14x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [14:08:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5006.eqsin.wmnet [14:14:27] (03CR) 10Bking: [C:03+1] toolhub: switch staging to test opensearch cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300795 (https://phabricator.wikimedia.org/T426073) (owner: 10Atsuko) [14:18:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5006.eqsin.wmnet [14:19:13] (03PS1) 10Cathal Mooney: gnmic: set prefix to 'openconfig' for system cpu metrics [puppet] - 10https://gerrit.wikimedia.org/r/1300812 [14:21:06] (03CR) 10Ayounsi: [C:03+1] gnmic: set prefix to 'openconfig' for system cpu metrics [puppet] - 10https://gerrit.wikimedia.org/r/1300812 (owner: 10Cathal Mooney) [14:21:42] (03CR) 10Cathal Mooney: [C:03+2] gnmic: set prefix to 'openconfig' for system cpu metrics [puppet] - 10https://gerrit.wikimedia.org/r/1300812 (owner: 10Cathal Mooney) [14:23:46] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ttmserver-test: apply [14:23:50] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ttmserver-test: apply [14:24:07] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti5006.eqsin.wmnet to cluster eqsin02 and group 01 [14:26:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti5006.eqsin.wmnet to cluster eqsin02 and group 01 [14:30:04] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260611T1430) [14:31:52] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1234: Migration of db1234.eqiad.wmnet completed [14:31:53] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [14:32:57] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply [14:33:04] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply [14:33:23] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ttmserver-test: apply [14:33:26] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ttmserver-test: apply [14:34:47] FIRING: KubernetesCalicoDown: dse-k8s-worker1009.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1009.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:35:29] (03PS1) 10Kamila Součková: shellbox-score: increase latency histogram buckets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300817 (https://phabricator.wikimedia.org/T428904) [14:38:17] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Install new MPC10E-10C line cards on cr1-eqiad and cr2-eqiad slot 0. - https://phabricator.wikimedia.org/T426343#12010370 (10ayounsi) Minor, but it might also be a good opportunity to inspect the air filters: https://www.juniper.net/do... [14:40:09] (03CR) 10CDanis: "😭" [puppet] - 10https://gerrit.wikimedia.org/r/1300721 (owner: 10Clément Goubert) [14:42:33] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:et-0/0/0 (Transport: Arelion (IC-398708) {#20260601}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:43:24] !log installing Poppler security updates [14:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:15] FIRING: [4x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1009:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [14:45:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:et-0/0/0 (Transport: Arelion (IC-398708) {#20260601}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:46:13] 10SRE-swift-storage, 10EasyTimeline: "Timeline error. Could not store output files" - https://phabricator.wikimedia.org/T428063#12010395 (10MarioProtIV) >>! In T428063#12007957, @Colinstu wrote: > Will anything need to be done manually on existing pages experiencing this issue? Or once the source code issue is... [14:48:44] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-docker-registry rolling restart_daemons on A:docker-registry [14:51:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-docker-registry (exit_code=0) rolling restart_daemons on A:docker-registry [14:51:18] (03PS1) 10Pppery: Export source strings (Part 1) [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1300820 (https://phabricator.wikimedia.org/T412650) [14:53:03] !log installing Bind security updates (just client-side tools/libraries) [14:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:50] (03CR) 10Alex Paskulin: [C:03+1] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300806 (https://phabricator.wikimedia.org/T427902) (owner: 10Gkyziridis) [14:58:41] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for EChukwukere-WMF - https://phabricator.wikimedia.org/T428827#12010474 (10SLong-WMF) Hello! This is approved. Thank you. [15:00:05] dduvall and jnuche: Time to do the Train log triage deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260611T1500). [15:00:32] (03PS1) 10C. Scott Ananian: T428849: temporarily disable noisy warnings in HandleParsoidSectionLinks [core] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300822 (https://phabricator.wikimedia.org/T428849) [15:00:42] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Install new MPC10E-10C line cards on cr1-eqiad and cr2-eqiad slot 0. - https://phabricator.wikimedia.org/T426343#12010478 (10cmooney) >>! In T426343#12010370, @ayounsi wrote: > Minor, but it might also be a good opportunity to inspect... [15:01:24] (03PS1) 10Pppery: Update source strings (Part 2) [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1300823 (https://phabricator.wikimedia.org/T410849) [15:01:50] hey ops (dduvall jnuche ) we've got a fix for the logspam train blocker T428849 which i'd like to backport before the group2 train rolls [15:01:50] T428849: group1 to 1.47.0-wmf.6 T423915: MediaWiki\OutputTransform\Stages\HandleParsoidSectionLinks::transformDOM: Heading missing for anchor - https://phabricator.wikimedia.org/T428849 [15:02:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [core] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300822 (https://phabricator.wikimedia.org/T428849) (owner: 10C. Scott Ananian) [15:04:49] cscott: hi, there's nothing happening at the moment, I think you can safely backport [15:05:05] thanks! [15:06:28] (03PS1) 10Hnowlan: thumbor: correct haproxy log syntax [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300824 [15:07:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300822 (https://phabricator.wikimedia.org/T428849) (owner: 10C. Scott Ananian) [15:09:08] (03CR) 10AOkoth: "Thank you. I can ping you after merging." [puppet] - 10https://gerrit.wikimedia.org/r/1300156 (https://phabricator.wikimedia.org/T423727) (owner: 10AOkoth) [15:09:57] (03CR) 10Hnowlan: [C:03+2] thumbor: correct haproxy log syntax [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300824 (owner: 10Hnowlan) [15:12:18] (03Merged) 10jenkins-bot: thumbor: correct haproxy log syntax [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300824 (owner: 10Hnowlan) [15:13:14] !log installing libdbi-perl security updates [15:13:15] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply [15:13:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:17] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply [15:13:58] (03PS2) 10Kamila Součková: shellbox-score: increase latency histogram buckets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300817 (https://phabricator.wikimedia.org/T428904) [15:15:52] (03CR) 10BCornwall: [C:03+1] "Weird indentation at the bottom but I recognize that the whole file is not in a good state with that." [puppet] - 10https://gerrit.wikimedia.org/r/1297102 (https://phabricator.wikimedia.org/T427836) (owner: 10Slyngshede) [15:17:53] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply [15:18:01] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply [15:18:48] (03Merged) 10jenkins-bot: T428849: temporarily disable noisy warnings in HandleParsoidSectionLinks [core] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300822 (https://phabricator.wikimedia.org/T428849) (owner: 10C. Scott Ananian) [15:19:14] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1300822|T428849: temporarily disable noisy warnings in HandleParsoidSectionLinks (T428849 T417530)]] [15:19:20] T428849: group1 to 1.47.0-wmf.6 T423915: MediaWiki\OutputTransform\Stages\HandleParsoidSectionLinks::transformDOM: Heading missing for anchor - https://phabricator.wikimedia.org/T428849 [15:19:21] T417530: Parsoid shouldn't wrap wikitext html-ish `` tags in
wrappers - https://phabricator.wikimedia.org/T417530 [15:21:18] !log cscott@deploy1003 cscott: Backport for [[gerrit:1300822|T428849: temporarily disable noisy warnings in HandleParsoidSectionLinks (T428849 T417530)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:25:37] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [15:25:56] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1235: Upgrading db1235.eqiad.wmnet [15:26:17] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1235: Upgrading db1235.eqiad.wmnet [15:26:25] !log cscott@deploy1003 cscott: Continuing with deployment [15:26:36] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [15:26:46] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db2212: Upgrading db2212.codfw.wmnet [15:27:18] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2212: Upgrading db2212.codfw.wmnet [15:29:03] (03CR) 10Ssingh: C:dumps::web::xmldumps block generic user-agents (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1297102 (https://phabricator.wikimedia.org/T427836) (owner: 10Slyngshede) [15:29:48] (03CR) 10Scott French: [C:03+1] shellbox-score: increase latency histogram buckets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300817 (https://phabricator.wikimedia.org/T428904) (owner: 10Kamila Součková) [15:30:22] 06SRE, 06Traffic, 13Patch-For-Review: WE5.2.13 Dumps UA enforcement - https://phabricator.wikimedia.org/T427836#12010744 (10BCornwall) a:05BCornwall→03MShilova_WMF Hi, @MShilova_WMF! @SLyngshede-WMF and I have a patch ready for deployment - this deployment/enforcement will patiently wait until your signal. [15:30:43] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1300822|T428849: temporarily disable noisy warnings in HandleParsoidSectionLinks (T428849 T417530)]] (duration: 11m 29s) [15:30:49] T428849: MediaWiki\OutputTransform\Stages\HandleParsoidSectionLinks::transformDOM: Heading missing for anchor - https://phabricator.wikimedia.org/T428849 [15:30:49] T417530: Parsoid shouldn't wrap wikitext html-ish `` tags in
wrappers - https://phabricator.wikimedia.org/T417530 [15:31:55] !log cwilliams@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [15:32:11] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [15:32:13] !log cwilliams@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [15:32:21] cwilliams@cumin1003 major-upgrade (PID 3344940) is awaiting input [15:35:17] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [15:35:27] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1235: Upgrading db1235.eqiad.wmnet [15:35:30] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply [15:35:37] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1235: Upgrading db1235.eqiad.wmnet [15:36:47] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [15:38:37] cwilliams@cumin1003 major-upgrade (PID 3345617) is awaiting input [15:39:15] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1235.eqiad.wmnet with OS trixie [15:39:26] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply [15:40:25] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db2212.codfw.wmnet with OS trixie [15:40:38] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [15:41:46] (03CR) 10Bartosz Dziewoński: [C:03+1] trafficserver: Add Special:OAuth/approve to multi-DC exemptions [puppet] - 10https://gerrit.wikimedia.org/r/1298383 (https://phabricator.wikimedia.org/T208443) (owner: 10Gergő Tisza) [15:44:24] (03PS1) 10Ozge: ml-services: updgrades editing suggestions versions. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300829 [15:46:19] (03PS2) 10Ozge: ml-services: upgrades editing suggestions versions. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300829 (https://phabricator.wikimedia.org/T428740) [15:47:33] (03CR) 10Hashar: Change update to exactly match the given image name (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1166856 (owner: 10Hashar) [15:48:24] jnuche: i'm done, thanks! [15:49:08] 06SRE, 06Traffic, 13Patch-For-Review: WE5.2.13 Dumps UA enforcement - https://phabricator.wikimedia.org/T427836#12010852 (10xcollazo) CC @BTullis, for visibility. [15:51:02] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300832 [15:51:26] (03CR) 10Kamila Součková: [C:03+2] shellbox-score: increase latency histogram buckets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300817 (https://phabricator.wikimedia.org/T428904) (owner: 10Kamila Součková) [15:53:36] (03CR) 10Dbrant: [C:03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300832 (owner: 10PipelineBot) [15:53:38] !log kamila@deploy1003 helmfile [staging] START helmfile.d/services/shellbox: apply [15:53:41] (03Merged) 10jenkins-bot: shellbox-score: increase latency histogram buckets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300817 (https://phabricator.wikimedia.org/T428904) (owner: 10Kamila Součková) [15:53:46] !log kamila@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox: apply [15:54:12] !log kamila@deploy1003 helmfile [staging] START helmfile.d/services/shellbox: apply [15:54:26] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1235.eqiad.wmnet with reason: host reimage [15:54:46] !log kamila@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox: apply [15:55:00] !log jiji@cumin1003 START - Cookbook sre.hosts.decommission for hosts rdb2007.codfw.wmnet [15:55:10] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:55:20] !log jiji@cumin1003 START - Cookbook sre.hosts.decommission for hosts rdb1011.eqiad.wmnet [15:55:44] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:55:52] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300832 (owner: 10PipelineBot) [15:56:03] !log jiji@cumin1003 START - Cookbook sre.hosts.decommission for hosts rdb2009.codfw.wmnet [15:57:37] !log dbrant@deploy1003 helmfile [staging] START helmfile.d/services/wikifeeds: apply [15:57:50] !log kamila@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox: apply [15:57:51] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [15:57:57] !log dbrant@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [15:58:32] !log kamila@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [15:58:44] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [15:59:07] (03CR) 10ToluAyo: [C:03+1] Gender namespaces on Serbo-Croatian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285467 (https://phabricator.wikimedia.org/T425402) (owner: 10Acamicamacaraca) [15:59:10] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1235.eqiad.wmnet with reason: host reimage [15:59:48] (03PS14) 10Trueg: dse-k8s-services: WDQS deployment helmfile values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297067 (https://phabricator.wikimedia.org/T424338) [16:00:01] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2212.codfw.wmnet with reason: host reimage [16:00:04] jhathaway and rzl: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260611T1600). [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:00:21] !log dbrant@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [16:00:27] !log kamila@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox: apply [16:00:55] !log dbrant@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [16:01:10] !log kamila@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [16:01:18] !log dbrant@deploy1003 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [16:01:31] !log jiji@cumin1003 START - Cookbook sre.dns.netbox [16:01:49] !log dbrant@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [16:04:08] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2212.codfw.wmnet with reason: host reimage [16:04:35] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [16:05:17] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [16:05:59] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ttmserver-test: apply [16:06:03] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ttmserver-test: apply [16:07:08] !log jiji@cumin1003 START - Cookbook sre.dns.netbox [16:07:29] jiji@cumin1003 decommission (PID 3348711) is awaiting input [16:09:04] 10SRE-SLO, 10Observability-Metrics, 13Patch-For-Review: Prometheus/Pyrra: establish backfill process for recording rules - https://phabricator.wikimedia.org/T349521#12010988 (10hnowlan) 05Open→03Resolved This process was built but we have also migrated away from Pyrra. [16:10:14] 10SRE-SLO, 06SRE Observability (FY2025/2026-Q1): Add links in the Pyrra rolling dashboards to point to their calendar ones in Grafana - https://phabricator.wikimedia.org/T398311#12011005 (10hnowlan) 05Open→03Declined Pyrra is no longer in use [16:13:01] jiji@cumin1003 decommission (PID 3348790) is awaiting input [16:13:10] !log jiji@cumin1003 START - Cookbook sre.dns.netbox [16:15:48] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1235.eqiad.wmnet with OS trixie [16:16:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:19:11] jiji@cumin1003 decommission (PID 3348886) is awaiting input [16:19:52] (03PS17) 10CDobbins: varnish: Add CSP report-only header value [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) [16:21:22] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2212.codfw.wmnet with OS trixie [16:23:21] 06SRE, 06Fundraising-Backlog, 06Fundraising-Tech-Roadmap, 10MediaWiki-extensions-CentralNotice, 06Traffic: Set expiry time for GeoIP cookies - https://phabricator.wikimedia.org/T122097#12011063 (10Ejegg) We definitely want to do this as soon as it's convenient for the core team. It'll help cut down on tr... [16:25:24] (03PS1) 10Sergio Gimeno: Remove no longer used product_metrics.homepage_module_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300835 (https://phabricator.wikimedia.org/T365889) [16:25:39] (03PS18) 10CDobbins: varnish: Add CSP report-only header value [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) [16:25:51] 06SRE, 06Fundraising-Backlog, 06Fundraising-Tech-Roadmap, 10MediaWiki-extensions-CentralNotice, 06Traffic: Set expiry time for GeoIP cookies - https://phabricator.wikimedia.org/T122097#12011077 (10AKanji-WMF) Let's aim for Q1 [16:25:52] PROBLEM - Host msw1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [16:26:21] ^me [16:27:30] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1235: Migration of db1235.eqiad.wmnet completed [16:27:51] (03PS19) 10CDobbins: varnish: Add CSP report-only header value [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) [16:27:53] !log jiji@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: rdb2009.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1003" [16:28:56] 06SRE, 06Fundraising-Backlog, 06Fundraising-Tech-Roadmap, 10MediaWiki-extensions-CentralNotice, 06Traffic: Set expiry time for GeoIP cookies - https://phabricator.wikimedia.org/T122097#12011102 (10ssingh) This will require support from Traffic in some capacity, so please let us know and we can prioritize... [16:30:54] RECOVERY - Host msw1-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.92 ms [16:30:58] jiji@cumin1003 decommission (PID 3348886) is awaiting input [16:31:49] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1299590/8709/" [puppet] - 10https://gerrit.wikimedia.org/r/1299590 (https://phabricator.wikimedia.org/T330960) (owner: 10Dzahn) [16:33:09] (03PS20) 10CDobbins: varnish: Add CSP report-only header value [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) [16:33:53] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db2212: Migration of db2212.codfw.wmnet completed [16:34:35] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: netbox accounting report error - - https://phabricator.wikimedia.org/T428936 (10RobH) 03NEW p:05Triage→03Medium [16:34:53] !log jiji@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: rdb2009.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1003" [16:34:53] !log jiji@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:34:54] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts rdb2009.codfw.wmnet [16:35:00] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: netbox accounting report error - - https://phabricator.wikimedia.org/T428936#12011156 (10RobH) It isn't clear to me if this is the same error outlined on T260325. [16:35:04] !log jiji@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [16:35:05] !log jiji@cumin1003 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts rdb1011.eqiad.wmnet [16:35:07] !log jiji@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [16:35:08] !log jiji@cumin1003 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts rdb2007.codfw.wmnet [16:37:28] (03PS15) 10Trueg: dse-k8s-services: WDQS deployment helmfile values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297067 (https://phabricator.wikimedia.org/T424338) [16:39:03] (03CR) 10Dzahn: [V:03+1 C:03+2] "this flips the rsync source and destination for releases uploads" [puppet] - 10https://gerrit.wikimedia.org/r/1299590 (https://phabricator.wikimedia.org/T330960) (owner: 10Dzahn) [16:39:44] (03CR) 10Dzahn: [C:03+2] swich releases.discovery.wmnet from codfw to eqiad [dns] - 10https://gerrit.wikimedia.org/r/1299591 (https://phabricator.wikimedia.org/T330960) (owner: 10Dzahn) [16:41:39] !log releases.wikimedia.org - switching backend from codfw to eqiad - releases1003 is now the source of rsync for uploaded releases files (use releases.discovery.wmnet to not have to think about it) - T418299 [16:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:44] T418299: upgrade releases hosts to trixie - https://phabricator.wikimedia.org/T418299 [16:41:54] !log dzahn@dns1005 START - running authdns-update [16:42:33] !log jiji@cumin1003 START - Cookbook sre.dns.netbox [16:43:50] !log dzahn@dns1005 END - running authdns-update [16:44:06] (03PS1) 10Pppery: Export source strings (Part 1) [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1300820 (https://phabricator.wikimedia.org/T412650) [16:44:42] (03PS1) 10Pppery: Update source strings (Part 2) [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1300823 (https://phabricator.wikimedia.org/T410849) [16:44:42] (03CR) 10Pppery: "Okay, I think this is ready for review now. To translatewiki: see https://translatewiki.net/wiki/User:Pppery/Renames" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1300823 (https://phabricator.wikimedia.org/T410849) (owner: 10Pppery) [16:45:28] !log jiji@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:45:53] (03PS1) 10Anne Tomasevich: Donor Delight Badge: Unify on "Remove badge" language across treatments [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300842 (https://phabricator.wikimedia.org/T427313) [16:46:51] (03PS1) 10Anne Tomasevich: [A11y] Donor Badge: Remove Badge button disappears too quickly [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300843 (https://phabricator.wikimedia.org/T428646) [16:47:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300842 (https://phabricator.wikimedia.org/T427313) (owner: 10Anne Tomasevich) [16:48:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300843 (https://phabricator.wikimedia.org/T428646) (owner: 10Anne Tomasevich) [16:49:21] (03PS1) 10Pppery: Drop fund, phortune, support [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1300846 (https://phabricator.wikimedia.org/T418655) [16:51:18] (03PS1) 10Dduvall: zuul: Remove `tls-server-name` from kubeconfig [puppet] - 10https://gerrit.wikimedia.org/r/1300849 (https://phabricator.wikimedia.org/T424061) [16:51:23] (03PS21) 10CDobbins: varnish: Add CSP report-only header value [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) [16:51:34] FIRING: [2x] ProbeDown: Service releases1003:8080 has failed probes (http_releases_jenkins_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#releases1003:8080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:54:39] (03PS22) 10CDobbins: varnish: Add CSP report-only header value [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) [16:55:27] re: jinxer alert. I just switched the backend of that service. But I can't confirm there is a problem with it. works for me. [16:59:28] (03PS2) 10Pppery: Update source strings (Part 2) [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1300823 (https://phabricator.wikimedia.org/T410849) [17:00:00] (03PS23) 10CDobbins: varnish: Add CSP report-only header value [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) [17:00:05] bd808: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260611T1700). [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260611T1700) [17:02:41] (03PS2) 10Pppery: Drop fund, phortune, support [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1300846 (https://phabricator.wikimedia.org/T418655) [17:02:58] (03CR) 10Ozge: [C:03+2] ml-services: upgrades editing suggestions versions. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300829 (https://phabricator.wikimedia.org/T428740) (owner: 10Ozge) [17:03:23] (03PS2) 10Pppery: Update source strings (Part 2) [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1300823 (https://phabricator.wikimedia.org/T410849) [17:03:35] (03CR) 10Ozge: [V:03+2 C:03+2] ml-services: upgrades editing suggestions versions. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300829 (https://phabricator.wikimedia.org/T428740) (owner: 10Ozge) [17:05:11] (03PS24) 10CDobbins: varnish: Add CSP report-only header value [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) [17:05:14] PROBLEM - jenkins_service_running on releases2003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [17:06:11] (03Merged) 10jenkins-bot: ml-services: upgrades editing suggestions versions. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300829 (https://phabricator.wikimedia.org/T428740) (owner: 10Ozge) [17:06:23] I have a developer portal build to push out in today's window. I'll get started on that shortly. [17:06:34] FIRING: [3x] ProbeDown: Service releases1003:8080 has failed probes (http_releases_jenkins_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:07:22] (03CR) 10Blake: [C:03+2] mediawiki::web::vhost: Use utf-8 for text/plain and text/html. [puppet] - 10https://gerrit.wikimedia.org/r/1300713 (https://phabricator.wikimedia.org/T428772) (owner: 10Blake) [17:08:07] 10ops-eqiad, 06DC-Ops: eqiad cable with unterminated end - https://phabricator.wikimedia.org/T428941 (10RobH) 03NEW p:05Triage→03Low [17:08:41] (03PS1) 10Jforrester: [abstractwiki] Set wgForceUIMsgAsContentMsg for sidebar messages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300858 (https://phabricator.wikimedia.org/T427730) [17:08:52] !log ozge@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [17:09:39] mutante: were you testing releases or releases-jenkins? because I do see a 503 at https://releases-jenkins.wikimedia.org/ [17:09:52] (looking at http_releases_jenkins_wikimedia_org_ip4 in the alert) [17:10:07] rzl: ACK, that's correct. I am uploading the fix for that. [17:10:16] there was another Hiera key to change [17:10:28] ah cool, thanks [17:10:34] thanks as well [17:10:47] 10ops-eqiad, 06DC-Ops: eqiad netbox script errors for 2025-06-11 - https://phabricator.wikimedia.org/T428941#12011351 (10RobH) [17:12:07] (03PS1) 10Dzahn: releases: flip where jenkins service is running and where it's masked [puppet] - 10https://gerrit.wikimedia.org/r/1300859 (https://phabricator.wikimedia.org/T418299) [17:12:40] (03CR) 10Dzahn: [C:03+2] releases: flip where jenkins service is running and where it's masked [puppet] - 10https://gerrit.wikimedia.org/r/1300859 (https://phabricator.wikimedia.org/T418299) (owner: 10Dzahn) [17:13:00] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1235: Migration of db1235.eqiad.wmnet completed [17:13:01] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [17:14:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293173 (https://phabricator.wikimedia.org/T412542) (owner: 10Bartosz Dziewoński) [17:15:08] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2026-06-09-215338 to 2026-06-11-171152 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300860 (https://phabricator.wikimedia.org/T282922) [17:15:23] (03PS1) 10BryanDavis: developer-portal: Bump container to 2026-06-11-122338-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300861 [17:17:03] (03PS25) 10CDobbins: varnish: Add CSP report-only header value [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) [17:17:58] hey folks, about to start a k8s only deploy for an apache config change [17:18:11] bjensen: you've merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/1300713 and have run puppet-agent on the deploy host, correct? [17:18:18] swfrench-wmf: correct [17:18:29] awesome, and your httpbb test now fails (as expected) [17:18:30] 10ops-eqiad, 06DC-Ops: eqiad netbox script errors for 2025-06-11 - https://phabricator.wikimedia.org/T428941#12011399 (10Jclark-ctr) Removed https://netbox.wikimedia.org/dcim/cables/9692/ [17:18:37] yes indeed :) [17:18:43] splendid [17:19:14] (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump container to 2026-06-11-122338-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300861 (owner: 10BryanDavis) [17:19:22] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2212: Migration of db2212.codfw.wmnet completed [17:19:23] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [17:20:02] bjensen: when you run scap, consider adding a reason string to the end that mentions T428772 [17:20:02] !log blake@deploy1003 Started scap sync-world: apache config update (T428772) [17:20:02] T428772: Serve mediawiki keys.txt with UTF-8 charset - https://phabricator.wikimedia.org/T428772 [17:20:11] ah, and you did :) [17:20:21] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2026-06-09-215338 to 2026-06-11-171152 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300860 (https://phabricator.wikimedia.org/T282922) (owner: 10Jforrester) [17:20:53] !log blake@deploy1003 blake: apache config update (T428772) synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:21:10] (03PS1) 10Dzahn: Revert "releases: flip where jenkins service is running and where it's masked" [puppet] - 10https://gerrit.wikimedia.org/r/1300862 [17:21:34] FIRING: [4x] ProbeDown: Service releases1003:443 has failed probes (http_releases_jenkins_wikimedia_org_login_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:21:35] rzl: it wasn't the fix. something about scap. reverting everything instead. [17:21:36] (03PS16) 10BryanDavis: beta: Add a wmf-beta-update-all timer and script [puppet] - 10https://gerrit.wikimedia.org/r/1276813 (https://phabricator.wikimedia.org/T256168) [17:21:37] (03Merged) 10jenkins-bot: developer-portal: Bump container to 2026-06-11-122338-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300861 (owner: 10BryanDavis) [17:22:09] (03CR) 10CI reject: [V:04-1] beta: Add a wmf-beta-update-all timer and script [puppet] - 10https://gerrit.wikimedia.org/r/1276813 (https://phabricator.wikimedia.org/T256168) (owner: 10BryanDavis) [17:22:33] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2026-06-09-215338 to 2026-06-11-171152 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300860 (https://phabricator.wikimedia.org/T282922) (owner: 10Jforrester) [17:23:00] !log bd808@deploy1003 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:23:01] (03CR) 10Dzahn: [C:03+2] Revert "releases: flip where jenkins service is running and where it's masked" [puppet] - 10https://gerrit.wikimedia.org/r/1300862 (owner: 10Dzahn) [17:23:13] !log bd808@deploy1003 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:23:16] swfrench-wmf: it looks like the test is still failing, shall i roll back? [17:23:31] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [17:23:44] !log bd808@deploy1003 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:23:48] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [17:23:57] bjensen: oh, that's curious - does it indicate what assertion failed? [17:23:59] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [17:24:06] swfrench-wmf: yes, it's the assertion i've added [17:24:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:24:16] !log bd808@deploy1003 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:24:17] it looks like the config change may not have been functional [17:24:41] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [17:24:45] !log bd808@deploy1003 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:24:46] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [17:25:08] bjensen: got it. scap is offering you the option to exit, correct? (along with rollback, continue, etc.) [17:25:20] swfrench-wmf: yes, that's right [17:25:25] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [17:25:36] bjensen: cool, select "exit" (rollback won't do anything) [17:25:45] !log blake@deploy1003 Scap cancelled without rolling back. [17:26:06] !log bd808@deploy1003 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:26:21] swfrench-wmf: alright, shall i revert my patch? [17:26:23] bjensen: now, you'll want to revert your puppet patch and go through that same exercise up to the point you stopped. [17:26:34] swfrench-wmf: ack, thanks [17:26:34] FIRING: [4x] ProbeDown: Service releases1003:443 has failed probes (http_releases_jenkins_wikimedia_org_login_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:26:51] (03PS1) 10Blake: Revert "mediawiki::web::vhost: Use utf-8 for text/plain and text/html." [puppet] - 10https://gerrit.wikimedia.org/r/1300863 [17:26:55] (03CR) 10VolkerE: [C:03+1] Donor Delight Badge: Unify on "Remove badge" language across treatments [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300842 (https://phabricator.wikimedia.org/T427313) (owner: 10Anne Tomasevich) [17:26:59] (03PS1) 10Dzahn: Revert "releases: switch active backend from codfw to eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/1300864 [17:27:15] (03CR) 10Dzahn: [C:03+2] Revert "releases: switch active backend from codfw to eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/1300864 (owner: 10Dzahn) [17:27:29] (03CR) 10Scott French: [C:03+1] Revert "mediawiki::web::vhost: Use utf-8 for text/plain and text/html." [puppet] - 10https://gerrit.wikimedia.org/r/1300863 (owner: 10Blake) [17:27:53] (03PS26) 10CDobbins: varnish: Add CSP report-only header value [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) [17:29:09] (03CR) 10Blake: [C:03+2] Revert "mediawiki::web::vhost: Use utf-8 for text/plain and text/html." [puppet] - 10https://gerrit.wikimedia.org/r/1300863 (owner: 10Blake) [17:29:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:31:09] (03PS27) 10CDobbins: varnish: Add CSP report-only header value [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) [17:31:34] RESOLVED: [4x] ProbeDown: Service releases1003:443 has failed probes (http_releases_jenkins_wikimedia_org_login_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:31:48] 10ops-eqiad, 06DC-Ops: eqiad netbox script errors for 2025-06-11 - https://phabricator.wikimedia.org/T428941#12011436 (10RobH) [17:31:57] 10ops-eqiad, 06DC-Ops: eqiad netbox script errors for 2025-06-11 - https://phabricator.wikimedia.org/T428941#12011450 (10RobH) 05Open→03Resolved a:03RobH [17:32:59] jouncebot: nowandnext [17:32:59] For the next 0 hour(s) and 27 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260611T1700) [17:32:59] For the next 0 hour(s) and 27 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260611T1700) [17:32:59] In 0 hour(s) and 27 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260611T1800) [17:33:37] (03PS28) 10CDobbins: varnish: Add CSP report-only header value [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) [17:33:59] Reedy: we're about to roll back something via scap [17:33:59] (03PS1) 10Reedy: UploadWizard.config.php: Fix cc-by-4.0-heirs msg issue [extensions/UploadWizard] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300865 (https://phabricator.wikimedia.org/T428935) [17:34:09] heh [17:34:30] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:34:39] (03PS1) 10Kamila Součková: shellbox-score: increase CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300866 (https://phabricator.wikimedia.org/T428904) [17:34:44] Reedy: or, more specifically, when bjensen finishes their puppet agent run, you should be good to go [17:35:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 17.05% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:36:45] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:37:12] (03PS17) 10BryanDavis: beta: Add a wmf-beta-update-all timer and script [puppet] - 10https://gerrit.wikimedia.org/r/1276813 (https://phabricator.wikimedia.org/T256168) [17:38:04] FIRING: [4x] ProbeDown: Service releases1003:443 has failed probes (http_releases_jenkins_wikimedia_org_login_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:38:21] (03PS1) 10Dzahn: Revert^2 "releases: flip where jenkins service is running and where it's masked" [puppet] - 10https://gerrit.wikimedia.org/r/1300868 [17:38:37] (03CR) 10Dzahn: [C:03+2] Revert^2 "releases: flip where jenkins service is running and where it's masked" [puppet] - 10https://gerrit.wikimedia.org/r/1300868 (owner: 10Dzahn) [17:39:15] Reedy: all yours, if you're ready to deploy? [17:39:51] CI isn't unhappy about the patch, so happy to try ;) [17:39:53] (03CR) 10Reedy: [C:03+2] UploadWizard.config.php: Fix cc-by-4.0-heirs msg issue [extensions/UploadWizard] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300865 (https://phabricator.wikimedia.org/T428935) (owner: 10Reedy) [17:40:03] sounds good, thanks! [17:40:05] (03PS29) 10CDobbins: varnish: Add CSP report-only header value [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) [17:42:06] (03Merged) 10jenkins-bot: UploadWizard.config.php: Fix cc-by-4.0-heirs msg issue [extensions/UploadWizard] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300865 (https://phabricator.wikimedia.org/T428935) (owner: 10Reedy) [17:42:19] (03PS30) 10CDobbins: varnish: Add CSP report-only header value [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) [17:44:29] !log reedy@deploy1003 Started scap sync-world: Backport for [[gerrit:1300865|UploadWizard.config.php: Fix cc-by-4.0-heirs msg issue (T428935 T405146)]] [17:44:36] T428935: wrong translation key of used in 'Heirs' path: mwe-upwiz-license-cc-by-sa-4.0-text, provides misleading licensing info - https://phabricator.wikimedia.org/T428935 [17:44:36] T405146: UploadWizard should handle the case of inheriting copyright - https://phabricator.wikimedia.org/T405146 [17:46:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:46:39] bjensen: swfrench-wmf ^ should I be concerned? [17:46:40] !log reedy@deploy1003 reedy: Backport for [[gerrit:1300865|UploadWizard.config.php: Fix cc-by-4.0-heirs msg issue (T428935 T405146)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:46:49] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:47:22] hm, the same alert was firing prior, I'm not certain [17:48:04] RESOLVED: ProbeDown: Service releases2003:443 has failed probes (http_releases_jenkins_wikimedia_org_login_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#releases2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:48:21] !log reedy@deploy1003 reedy: Continuing with deployment [17:48:32] (03PS1) 10Jforrester: abstractwiki: Temporary config for the automatic Abstract Article generation script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300872 [17:49:52] Reedy: the MediaWikiHighErrorRate alert looks like logspam? (i.e., entirely unrelated to bjensen's change) [17:51:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:52:43] there's a massive amount of TypeError being thrown by Wikibase\Client\Usage\UsageDeduplicator, and it looks like it has been for hours [17:52:45] !log reedy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1300865|UploadWizard.config.php: Fix cc-by-4.0-heirs msg issue (T428935 T405146)]] (duration: 08m 15s) [17:52:51] T428935: wrong translation key of used in 'Heirs' path: mwe-upwiz-license-cc-by-sa-4.0-text, provides misleading licensing info - https://phabricator.wikimedia.org/T428935 [17:52:51] T405146: UploadWizard should handle the case of inheriting copyright - https://phabricator.wikimedia.org/T405146 [17:54:38] (03PS1) 10Dragoniez: jawiki: remove four rights from the eliminator group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300873 (https://phabricator.wikimedia.org/T428942) [17:55:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300873 (https://phabricator.wikimedia.org/T428942) (owner: 10Dragoniez) [17:56:45] (03PS2) 10Dragoniez: jawiki: remove four rights from the eliminator group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300873 (https://phabricator.wikimedia.org/T428942) [18:00:05] dduvall and jnuche: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260611T1800). [18:02:24] (03PS31) 10CDobbins: varnish: Add CSP report-only header value [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) [18:07:16] (03PS32) 10CDobbins: varnish: Add CSP report-only header value [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) [18:09:43] (03PS33) 10CDobbins: varnish: Add CSP report-only header value [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) [18:10:35] (03CR) 10Scott French: [C:03+1] "I'm not sure off hand whether 64 CPU (+ sidecars) fits in the default quota. You may need to adjust that and / or revise this down." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300866 (https://phabricator.wikimedia.org/T428904) (owner: 10Kamila Součková) [18:12:18] !log ozge@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [18:12:28] !log ozge@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [18:18:43] (03PS2) 10Kamila Součková: shellbox-score: increase CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300866 (https://phabricator.wikimedia.org/T428904) [18:26:26] (03CR) 10CI reject: [V:04-1] shellbox-score: increase CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300866 (https://phabricator.wikimedia.org/T428904) (owner: 10Kamila Součková) [18:27:24] cscott: is https://phabricator.wikimedia.org/T428849 still blocking? [18:29:08] i see there was a backport but it appears to be related to the logspam part of it [18:32:09] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: db1262 crashed - https://phabricator.wikimedia.org/T428832#12011703 (10Jclark-ctr) I had already updated the same firmwares and uploaded new Tsr report for dell. Prior to them responding From Dell ` >I found the CPU errors you mentioned... [18:34:47] FIRING: KubernetesCalicoDown: dse-k8s-worker1009.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1009.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [18:38:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:38:43] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133212 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:43:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:43:43] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133212 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:45:15] FIRING: [4x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1009:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [18:46:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:et-0/0/0 (Transport: Arelion (IC-398708) {#20260601}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:53:46] effie: do you know whether https://phabricator.wikimedia.org/T428849 is still a blocker? [18:54:18] ah, sorry it's late for you. i'll ask in slack [18:55:06] dduvall: last log at 18:09. Hopefully fixed? [18:55:32] i wasn't sure if the root cause was fixed or just the logspam [18:57:28] The logspam was the reason for the train blocking, I think. [18:57:30] E.h [18:59:11] cscott: Do you know if it counts as no longer UBN? [19:00:55] dduvall: I was just the messenger in this one, as I spotted it while looking at something else :) [19:01:15] :) ack! [19:02:19] James_F: ok, i will give remove it from blockers [19:02:25] ty! [19:04:48] ergo, here comes the choo choo [19:05:06] (03PS1) 10TrainBranchBot: group2 to 1.47.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300891 (https://phabricator.wikimedia.org/T423915) [19:05:09] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dduvall@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300891 (https://phabricator.wikimedia.org/T423915) (owner: 10TrainBranchBot) [19:06:19] (03Merged) 10jenkins-bot: group2 to 1.47.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300891 (https://phabricator.wikimedia.org/T423915) (owner: 10TrainBranchBot) [19:09:42] (03PS1) 10TChin: [PageViewInfo] Add new config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300892 (https://phabricator.wikimedia.org/T411771) [19:10:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.45% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:10:53] (03PS3) 10Kamila Součková: shellbox-score: increase CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300866 (https://phabricator.wikimedia.org/T428904) [19:12:44] !log dduvall@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.47.0-wmf.6 refs T423915 [19:12:49] T423915: 1.47.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T423915 [19:14:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.54% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:21:28] 06SRE, 06Traffic, 13Patch-For-Review: WE5.2.13 Dumps UA enforcement - https://phabricator.wikimedia.org/T427836#12011841 (10MShilova_WMF) @BCornwall , sounds good. Thank you! I'll update the ticket once we are ready to proceed with the deployment. [19:24:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.9% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:24:45] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.52% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:30:00] (03CR) 10Kamila Součková: "Good point, thank you! Fixed." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300866 (https://phabricator.wikimedia.org/T428904) (owner: 10Kamila Součková) [19:32:53] (03PS1) 10Anne Tomasevich: Donor Delight Badge, styles: Amending to final design review feedback [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300896 (https://phabricator.wikimedia.org/T427313) [19:33:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300896 (https://phabricator.wikimedia.org/T427313) (owner: 10Anne Tomasevich) [19:45:22] (03PS1) 10Cathal Mooney: Interface ACL attachment - base on description not static yaml [homer/public] - 10https://gerrit.wikimedia.org/r/1300900 (https://phabricator.wikimedia.org/T428886) [19:46:47] (03CR) 10CI reject: [V:04-1] Interface ACL attachment - base on description not static yaml [homer/public] - 10https://gerrit.wikimedia.org/r/1300900 (https://phabricator.wikimedia.org/T428886) (owner: 10Cathal Mooney) [19:48:19] (03PS2) 10Cathal Mooney: Interface ACL attachment - base on description not static yaml [homer/public] - 10https://gerrit.wikimedia.org/r/1300900 (https://phabricator.wikimedia.org/T428886) [19:54:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [19:59:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.17% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:00:05] RoanKattouw, urbanecm, TheresNoTime, kindrobot, and cjming: Time to snap out of that daydream and deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260611T2000). [20:00:05] cscott and annet: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:20] o/ [20:00:37] (03PS34) 10CDobbins: varnish: Add CSP report-only header value [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) [20:01:08] (03CR) 10Dzahn: [C:03+2] zuul: Remove `tls-server-name` from kubeconfig [puppet] - 10https://gerrit.wikimedia.org/r/1300849 (https://phabricator.wikimedia.org/T424061) (owner: 10Dduvall) [20:03:41] (03CR) 10Aklapper: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1298763 (https://phabricator.wikimedia.org/T405596) (owner: 10Aklapper) [20:07:40] cscott: around for backport? [20:08:34] Gonna start the backport window with annet's patches [20:09:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:12:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy1003 using scap backport" [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300842 (https://phabricator.wikimedia.org/T427313) (owner: 10Anne Tomasevich) [20:12:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy1003 using scap backport" [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300843 (https://phabricator.wikimedia.org/T428646) (owner: 10Anne Tomasevich) [20:12:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy1003 using scap backport" [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300896 (https://phabricator.wikimedia.org/T427313) (owner: 10Anne Tomasevich) [20:14:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.52% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:17:08] (03Merged) 10jenkins-bot: Donor Delight Badge: Unify on "Remove badge" language across treatments [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300842 (https://phabricator.wikimedia.org/T427313) (owner: 10Anne Tomasevich) [20:17:10] (03Merged) 10jenkins-bot: [A11y] Donor Badge: Remove Badge button disappears too quickly [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300843 (https://phabricator.wikimedia.org/T428646) (owner: 10Anne Tomasevich) [20:17:12] (03Merged) 10jenkins-bot: Donor Delight Badge, styles: Amending to final design review feedback [extensions/WikimediaCustomizations] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300896 (https://phabricator.wikimedia.org/T427313) (owner: 10Anne Tomasevich) [20:17:31] !log jdrewniak@deploy1003 Started scap sync-world: Backport for [[gerrit:1300842|Donor Delight Badge: Unify on "Remove badge" language across treatments (T427313)]], [[gerrit:1300843|[A11y] Donor Badge: Remove Badge button disappears too quickly (T428646)]], [[gerrit:1300896|Donor Delight Badge, styles: Amending to final design review feedback (T427313)]] [20:17:39] T427313: Donor badge experiment: Final design review and adjustments for donor badge - https://phabricator.wikimedia.org/T427313 [20:17:39] T428646: [A11y] Donor Badge: Remove Badge button disappears too quickly - https://phabricator.wikimedia.org/T428646 [20:19:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.59% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:19:40] (03Abandoned) 10Dzahn: Revert "releases: switch active backend from codfw to eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/1300864 (owner: 10Dzahn) [20:19:53] jouncebot: nowandnext [20:19:53] For the next 0 hour(s) and 40 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260611T2000) [20:19:54] In 0 hour(s) and 40 minute(s): Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260611T2100) [20:20:09] (03PS1) 10Dreamy Jazz: RadioRangeBallot: Fix strict mode issue [extensions/SecurePoll] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300905 (https://phabricator.wikimedia.org/T428947) [20:20:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/SecurePoll] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300905 (https://phabricator.wikimedia.org/T428947) (owner: 10Dreamy Jazz) [20:20:35] \o [20:20:54] I have a backport to fix voting on votewiki for the UCoC related active poll [20:21:00] I can self-deploy [20:21:43] Dreamy_Jazz: sounds good, I'll let you know when my patches are done (might take a bit, l10n changes in progress). [20:21:50] Thanks [20:25:20] (03PS1) 10Eric Gardner: Restore MediaViewer toggle in Special:Preferences [extensions/MultimediaViewer] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300906 (https://phabricator.wikimedia.org/T428742) [20:34:51] !log jdrewniak@deploy1003 annet, jdrewniak: Backport for [[gerrit:1300842|Donor Delight Badge: Unify on "Remove badge" language across treatments (T427313)]], [[gerrit:1300843|[A11y] Donor Badge: Remove Badge button disappears too quickly (T428646)]], [[gerrit:1300896|Donor Delight Badge, styles: Amending to final design review feedback (T427313)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug [20:34:51] ). Changes can now be verified there. [20:34:57] T427313: Donor badge experiment: Final design review and adjustments for donor badge - https://phabricator.wikimedia.org/T427313 [20:34:57] T428646: [A11y] Donor Badge: Remove Badge button disappears too quickly - https://phabricator.wikimedia.org/T428646 [20:35:59] !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host releases2003.codfw.wmnet with OS trixie [20:39:16] !log jdrewniak@deploy1003 annet, jdrewniak: Continuing with deployment [20:43:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:46:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 21.58% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:48:15] FIRING: [3x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:50:11] (03PS1) 10Dreamy Jazz: hCaptcha: Enable for badlogin for all small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300911 (https://phabricator.wikimedia.org/T426875) [20:50:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300911 (https://phabricator.wikimedia.org/T426875) (owner: 10Dreamy Jazz) [20:51:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 21.71% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:51:42] !log jdrewniak@deploy1003 Finished scap sync-world: Backport for [[gerrit:1300842|Donor Delight Badge: Unify on "Remove badge" language across treatments (T427313)]], [[gerrit:1300843|[A11y] Donor Badge: Remove Badge button disappears too quickly (T428646)]], [[gerrit:1300896|Donor Delight Badge, styles: Amending to final design review feedback (T427313)]] (duration: 34m 10s) [20:51:48] T427313: Donor badge experiment: Final design review and adjustments for donor badge - https://phabricator.wikimedia.org/T427313 [20:51:48] T428646: [A11y] Donor Badge: Remove Badge button disappears too quickly - https://phabricator.wikimedia.org/T428646 [20:52:35] Dreamy_Jazz: that took a bit, but all done now [20:52:43] Thanks [20:53:04] cscott: You here? [20:53:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:53:16] If not, I'll start on mine [20:53:29] I think he backported that earlier in the day [20:53:34] Yeah looks like it [20:53:38] You can go for it [20:53:49] I have a patch to do after yours if Readers don't mind [20:53:59] (03CR) 10Scott French: [C:03+1] "Ah, turns out the default is 90. Sorry about that." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300866 (https://phabricator.wikimedia.org/T428904) (owner: 10Kamila Součková) [20:54:03] Sure, yeah [20:54:07] I'll ping you [20:54:15] Thanks [20:54:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300911 (https://phabricator.wikimedia.org/T426875) (owner: 10Dreamy Jazz) [20:54:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/SecurePoll] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300905 (https://phabricator.wikimedia.org/T428947) (owner: 10Dreamy Jazz) [20:55:10] Yeah I backported it before group2 rolled. [20:55:43] (03Merged) 10jenkins-bot: hCaptcha: Enable for badlogin for all small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300911 (https://phabricator.wikimedia.org/T426875) (owner: 10Dreamy Jazz) [20:56:22] (03Merged) 10jenkins-bot: RadioRangeBallot: Fix strict mode issue [extensions/SecurePoll] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300905 (https://phabricator.wikimedia.org/T428947) (owner: 10Dreamy Jazz) [20:56:39] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1300911|hCaptcha: Enable for badlogin for all small wikis (T426875)]], [[gerrit:1300905|RadioRangeBallot: Fix strict mode issue (T428947)]] [20:56:46] T426875: hCaptcha: Support usage in "always challenge" SiteKey for badlogin - https://phabricator.wikimedia.org/T426875 [20:56:46] T428947: TypeError: strcmp(): Argument #2 ($string2) must be of type string, int given - https://phabricator.wikimedia.org/T428947 [20:57:10] (03CR) 10Dzahn: [C:04-1] "the http_port also needs to be changed. in old config we use 8080 because that's where jenkins listens. but then it would be http, not htt" [puppet] - 10https://gerrit.wikimedia.org/r/1297216 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [20:57:34] Hi all – I'm planning to backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MultimediaViewer/+/1300890 during the upcoming readers backport window. Should hopefully be quick [20:57:55] EricGardner: New i18n means it won't be. [20:58:08] EricGardner: Expect ~40 minutes, sadly. [20:58:45] Bummer. This is technically restoring some i18n that was erroneously deleted but I imagine it will not make a difference here [20:58:51] Yeah, no, sadly not. [20:59:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:59:17] (03PS1) 10Arlolra: Avoid the escaping from nowiki processing [core] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300913 (https://phabricator.wikimedia.org/T398967) [20:59:18] Brooke was working on fixing this issue; it might get resolved in the next few months if her continuing work lands. [20:59:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [core] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300913 (https://phabricator.wikimedia.org/T398967) (owner: 10Arlolra) [20:59:35] (03PS1) 10Ahmon Dancy: profile::mariadb::beta: Initialize system schema on fresh hosts [puppet] - 10https://gerrit.wikimedia.org/r/1300914 (https://phabricator.wikimedia.org/T428930) [21:00:05] Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260611T2100) [21:00:25] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1300911|hCaptcha: Enable for badlogin for all small wikis (T426875)]], [[gerrit:1300905|RadioRangeBallot: Fix strict mode issue (T428947)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:00:30] (03PS1) 10BPirkle: REST: set new RestModuleOverrides variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300245 (https://phabricator.wikimedia.org/T422756) [21:00:31] (03CR) 10Subramanya Sastry: [C:03+1] Avoid the escaping from nowiki processing [core] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300913 (https://phabricator.wikimedia.org/T398967) (owner: 10Arlolra) [21:00:35] Unless anyone else is waiting, I will proceed with my backport once DreamyJazz's is done [21:00:44] EricGardner: any chance I can squeeze in my patch before [21:01:02] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with deployment [21:01:07] arlolra: sure, assuming it's not another 40min i18n patch? [21:01:32] It is your time though, so not a big deal since you'll be here a while as it is [21:01:36] No, no i18n [21:01:39] yeah go for it [21:01:40] (03PS3) 10Dzahn: contint: switch apache proxying to jenkins to use https [puppet] - 10https://gerrit.wikimedia.org/r/1297216 (https://phabricator.wikimedia.org/T418521) [21:01:43] 06SRE, 07Wikimedia-production-error: Progetto:Patrolling page on itwiki is a HTTP 503 error: "Service Temporarily Unavailable" - https://phabricator.wikimedia.org/T426841#12012266 (10BCornwall) p:05Triage→03Medium [21:01:54] Thanks [21:02:35] (03PS2) 10Ahmon Dancy: profile::mariadb::beta: Initialize system schema on fresh hosts [puppet] - 10https://gerrit.wikimedia.org/r/1300914 (https://phabricator.wikimedia.org/T428930) [21:04:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [21:04:18] 06SRE, 06Data-Persistence: Update roll-restart-reboot-brokers.py to display broker id and FQDN of the broker - https://phabricator.wikimedia.org/T425747#12012271 (10BCornwall) [21:06:41] !log bblack@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-text and not P{cp7008*} and A:cp - Upgrade wmfuniq to 0.3.0 () [21:07:22] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1300911|hCaptcha: Enable for badlogin for all small wikis (T426875)]], [[gerrit:1300905|RadioRangeBallot: Fix strict mode issue (T428947)]] (duration: 10m 43s) [21:07:29] T426875: hCaptcha: Support usage in "always challenge" SiteKey for badlogin - https://phabricator.wikimedia.org/T426875 [21:07:29] T428947: TypeError: strcmp(): Argument #2 ($string2) must be of type string, int given - https://phabricator.wikimedia.org/T428947 [21:07:33] arlolra: Your turn [21:07:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300913 (https://phabricator.wikimedia.org/T398967) (owner: 10Arlolra) [21:10:16] (03CR) 10Ahmon Dancy: "Does this seem ok?" [puppet] - 10https://gerrit.wikimedia.org/r/1300914 (https://phabricator.wikimedia.org/T428930) (owner: 10Ahmon Dancy) [21:14:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [21:14:45] FIRING: [3x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [21:16:16] (03CR) 10BryanDavis: [C:03+1] toolhub: switch staging to test opensearch cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300795 (https://phabricator.wikimedia.org/T426073) (owner: 10Atsuko) [21:17:52] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for Bliviero - https://phabricator.wikimedia.org/T428815#12012309 (10BCornwall) [21:19:30] FIRING: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [21:19:45] RESOLVED: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [21:20:19] (03Merged) 10jenkins-bot: Avoid the escaping from nowiki processing [core] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300913 (https://phabricator.wikimedia.org/T398967) (owner: 10Arlolra) [21:20:37] !log arlolra@deploy1003 Started scap sync-world: Backport for [[gerrit:1300913|Avoid the escaping from nowiki processing (T398967)]] [21:20:42] T398967: Parsoid doesn't process Template:Markup correctly on cbk_zamwiki - https://phabricator.wikimedia.org/T398967 [21:21:12] (03PS1) 10BCornwall: admin: Add bliviero to analytics-private-datausers [puppet] - 10https://gerrit.wikimedia.org/r/1300915 (https://phabricator.wikimedia.org/T428815) [21:22:20] (03CR) 10RLazarus: [C:03+1] admin: Add bliviero to analytics-private-datausers [puppet] - 10https://gerrit.wikimedia.org/r/1300915 (https://phabricator.wikimedia.org/T428815) (owner: 10BCornwall) [21:22:23] !log arlolra@deploy1003 arlolra: Backport for [[gerrit:1300913|Avoid the escaping from nowiki processing (T398967)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:22:24] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Bliviero - https://phabricator.wikimedia.org/T428815#12012317 (10BCornwall) [21:22:43] (03CR) 10BCornwall: [C:03+2] admin: Add bliviero to analytics-private-datausers [puppet] - 10https://gerrit.wikimedia.org/r/1300915 (https://phabricator.wikimedia.org/T428815) (owner: 10BCornwall) [21:22:52] (03PS1) 10Dzahn: contint: add second proxy for jenkins on an external host [puppet] - 10https://gerrit.wikimedia.org/r/1300916 (https://phabricator.wikimedia.org/T418521) [21:24:25] (03CR) 10Dzahn: "I don't like about this change that I can't merge it before we schedule another maintenance window and try again." [puppet] - 10https://gerrit.wikimedia.org/r/1297216 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [21:24:30] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [21:24:55] (03CR) 10CI reject: [V:04-1] contint: add second proxy for jenkins on an external host [puppet] - 10https://gerrit.wikimedia.org/r/1300916 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [21:24:59] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Bliviero - https://phabricator.wikimedia.org/T428815#12012333 (10BCornwall) 05Open→03Resolved a:03BCornwall Hi, @BLiviero-WMF! The access has been granted and should be in ef... [21:25:31] !log arlolra@deploy1003 arlolra: Continuing with deployment [21:25:45] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [21:28:47] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on releases2003.codfw.wmnet with reason: host reimage [21:29:46] !log arlolra@deploy1003 Finished scap sync-world: Backport for [[gerrit:1300913|Avoid the escaping from nowiki processing (T398967)]] (duration: 09m 09s) [21:29:51] EricGardner: thanks for your patience [21:29:51] T398967: Parsoid doesn't process Template:Markup correctly on cbk_zamwiki - https://phabricator.wikimedia.org/T398967 [21:30:15] arlolra: no prob! [21:30:30] (03PS2) 10Dzahn: contint: add second proxy for jenkins on an external host [puppet] - 10https://gerrit.wikimedia.org/r/1300916 (https://phabricator.wikimedia.org/T418521) [21:30:33] I'm going to proceed with my patch now [21:31:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by egardner@deploy1003 using scap backport" [extensions/MultimediaViewer] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300906 (https://phabricator.wikimedia.org/T428742) (owner: 10Eric Gardner) [21:32:34] (03CR) 10CI reject: [V:04-1] contint: add second proxy for jenkins on an external host [puppet] - 10https://gerrit.wikimedia.org/r/1300916 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [21:32:56] (03PS3) 10Dzahn: contint: add second proxy for jenkins on an external host [puppet] - 10https://gerrit.wikimedia.org/r/1300916 (https://phabricator.wikimedia.org/T418521) [21:33:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [21:34:02] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on releases2003.codfw.wmnet with reason: host reimage [21:34:12] (03Merged) 10jenkins-bot: Restore MediaViewer toggle in Special:Preferences [extensions/MultimediaViewer] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1300906 (https://phabricator.wikimedia.org/T428742) (owner: 10Eric Gardner) [21:34:27] !log egardner@deploy1003 Started scap sync-world: Backport for [[gerrit:1300906|Restore MediaViewer toggle in Special:Preferences (T428742)]] [21:34:32] T428742: MediaViewer preference disappeared from Special:Preferences - https://phabricator.wikimedia.org/T428742 [21:34:58] (03CR) 10CI reject: [V:04-1] contint: add second proxy for jenkins on an external host [puppet] - 10https://gerrit.wikimedia.org/r/1300916 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [21:36:37] (03PS3) 10Cathal Mooney: Interface ACL attachment - base on description not static yaml [homer/public] - 10https://gerrit.wikimedia.org/r/1300900 (https://phabricator.wikimedia.org/T428886) [21:37:00] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for EChukwukere-WMF - https://phabricator.wikimedia.org/T428827#12012369 (10BCornwall) [21:38:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [21:39:50] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for EChukwukere-WMF - https://phabricator.wikimedia.org/T428827#12012376 (10BCornwall) a:03EChukwukere-WMF [21:41:19] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for caro - https://phabricator.wikimedia.org/T426995#12012380 (10BCornwall) 05Open→03In progress p:05Triage→03Medium a:03medelius [21:41:58] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for EChukwukere-WMF - https://phabricator.wikimedia.org/T428827#12012383 (10BCornwall) 05Open→03In progress p:05Triage→03Medium [21:43:09] 06SRE, 10SRE-Access-Requests: Requesting access to "analytics-privatedata-users" for Mahmoud Abdelsattar (WMDE) - https://phabricator.wikimedia.org/T428416#12012390 (10BCornwall) p:05Triage→03Medium [21:43:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [21:44:38] 06SRE, 06Traffic, 05Cloud-Services-Origin-User, 07Cloud-Services-Worktype-Unplanned: [puppet] Remove expired and unused certs from modules/profile/files/ssl/ and modules/base/files/ca - https://phabricator.wikimedia.org/T354295#12012402 (10BCornwall) [21:46:49] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [21:48:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [21:49:36] (03CR) 10Jasmine: [C:03+2] kafka-main: clean up host level overrides for kafka-main jdk 21 in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1300287 (https://phabricator.wikimedia.org/T427088) (owner: 10Jasmine) [21:49:50] (03CR) 10Jasmine: [C:03+2] kafka-main: clean up host level overrides for kafka-main jdk 21 in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1300288 (https://phabricator.wikimedia.org/T427088) (owner: 10Jasmine) [21:51:13] (03PS1) 10Dduvall: zuul: Update certificate_authority_data for new cluster [puppet] - 10https://gerrit.wikimedia.org/r/1300922 (https://phabricator.wikimedia.org/T424061) [21:51:40] !log egardner@deploy1003 egardner: Backport for [[gerrit:1300906|Restore MediaViewer toggle in Special:Preferences (T428742)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:51:45] T428742: MediaViewer preference disappeared from Special:Preferences - https://phabricator.wikimedia.org/T428742 [21:52:25] !log egardner@deploy1003 egardner: Continuing with deployment [21:56:54] (03PS1) 10Zabe: BETA: Stop writing to the old file db schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300925 (https://phabricator.wikimedia.org/T428970) [21:58:57] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host releases2003.codfw.wmnet with OS trixie [22:05:18] !log egardner@deploy1003 Finished scap sync-world: Backport for [[gerrit:1300906|Restore MediaViewer toggle in Special:Preferences (T428742)]] (duration: 30m 51s) [22:05:23] T428742: MediaViewer preference disappeared from Special:Preferences - https://phabricator.wikimedia.org/T428742 [22:05:29] Ok, that's a wrap [22:06:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [22:11:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [22:13:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [22:13:57] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ttmserver-test: apply [22:14:02] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ttmserver-test: apply [22:18:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [22:23:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [22:23:50] (03CR) 10Aklapper: [C:03+2] Add locales for all remaining languages [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1221224 (https://phabricator.wikimedia.org/T412651) (owner: 10Pppery) [22:24:02] (03CR) 10Aklapper: [V:03+2 C:03+2] "LGTM, thanks" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1221224 (https://phabricator.wikimedia.org/T412651) (owner: 10Pppery) [22:24:30] (03PS1) 10Bking: WIP: cirrussearch: Flesh out deployment-prep plan [puppet] - 10https://gerrit.wikimedia.org/r/1300927 (https://phabricator.wikimedia.org/T425585) [22:25:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [22:26:02] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [22:27:05] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [22:30:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [22:34:03] (03CR) 10Aklapper: [V:03+2 C:03+2] "Thanks, applies cleanly locally" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1300820 (https://phabricator.wikimedia.org/T412650) (owner: 10Pppery) [22:34:07] (03CR) 10Aklapper: [V:03+2 C:03+2] "Thanks, applies cleanly locally" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1300823 (https://phabricator.wikimedia.org/T410849) (owner: 10Pppery) [22:34:47] FIRING: KubernetesCalicoDown: dse-k8s-worker1009.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1009.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [22:36:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [22:38:20] (03CR) 10Zabe: [C:03+2] BETA: Stop writing to the old file db schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300925 (https://phabricator.wikimedia.org/T428970) (owner: 10Zabe) [22:39:18] (03Merged) 10jenkins-bot: BETA: Stop writing to the old file db schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1300925 (https://phabricator.wikimedia.org/T428970) (owner: 10Zabe) [22:41:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [22:45:15] FIRING: [4x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1009:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [22:45:40] RECOVERY - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1201 is OK: communication: 0 OK : controller: 0 OK : physical_disk: 0 OK : virtual_disk: 0 OK : bbu: 0 OK : enclosure: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [22:46:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:et-0/0/0 (Transport: Arelion (IC-398708) {#20260601}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:47:55] 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-06-05 - 2026-06-26): Follow up on multiple RAID / drive issues - https://phabricator.wikimedia.org/T426610#12012602 (10Jclark-ctr) [22:48:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-06-05 - 2026-06-26): Degraded RAID on an-worker1201 - https://phabricator.wikimedia.org/T428571#12012603 (10Jclark-ctr) 05Open→03Resolved @BTullis Drives have been Swapped. Added this to T426610 to be finished by data-platform [22:49:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [22:49:32] 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-06-05 - 2026-06-26): Follow up on multiple RAID / drive issues - https://phabricator.wikimedia.org/T426610#12012610 (10Jclark-ctr) Updated added T428571 an-worker1201 [22:54:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [22:58:35] 10ops-eqiad, 06SRE, 06DC-Ops: document Old line cards in eqiad Storage. and removal of MPC-3D-16XGE-SFPP line cards from CR1 and CR2 - https://phabricator.wikimedia.org/T428161#12012644 (10Jclark-ctr) a:05Jclark-ctr→03None [23:15:55] (03PS3) 10Pppery: Update source strings (Part 2) [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1300823 (https://phabricator.wikimedia.org/T410849) [23:16:29] (03CR) 10Aklapper: [V:03+2] Update source strings (Part 2) [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1300823 (https://phabricator.wikimedia.org/T410849) (owner: 10Pppery) [23:18:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [23:23:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [23:30:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [23:30:31] (03PS1) 10RLazarus: cli: argparse fix for Python 3.14 compatibility [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1300941 [23:34:33] (03PS3) 10Pppery: Drop fund, phortune, support [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1300846 (https://phabricator.wikimedia.org/T418655) [23:35:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [23:35:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.04% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [23:37:54] (03PS4) 10Pppery: Drop fund, phortune, support [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1300846 (https://phabricator.wikimedia.org/T418655) [23:40:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.04% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [23:44:57] (03PS1) 10Jasmine: Add new control plane wikikube-ctrl1005 to etcd-server SRV record [dns] - 10https://gerrit.wikimedia.org/r/1300942 (https://phabricator.wikimedia.org/T418920) [23:47:51] FIRING: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from gerrit.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [23:48:39] !incidents [23:48:40] 8071 (ACKED) [6x] ATSBackendErrorsHigh cache_text sre (gerrit.discovery.wmnet) [23:48:40] 8068 (RESOLVED) Host db1262 (paged) [23:49:13] !ack 8071 [23:49:13] 8071 (ACKED) [6x] ATSBackendErrorsHigh cache_text sre (gerrit.discovery.wmnet) [23:52:51] FIRING: [7x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from gerrit.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [23:53:07] !incidents [23:53:08] 8071 (ACKED) [6x] ATSBackendErrorsHigh cache_text sre (gerrit.discovery.wmnet) [23:53:08] 8068 (RESOLVED) Host db1262 (paged) [23:55:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate