[00:02:01] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1297268|Redirect unknown wikinews languages to portal (T427126)]] (duration: 07m 02s) [00:02:05] T427126: Cleanup wikinews portal/incubator handling - https://phabricator.wikimedia.org/T427126 [00:16:31] (03CR) 10Jasmine: [C:03+2] kafka-main1007: apply host-level override in advance of trixie upgrade [0] [puppet] - 10https://gerrit.wikimedia.org/r/1285475 (https://phabricator.wikimedia.org/T419216) (owner: 10Jasmine) [00:17:05] !log jasmine@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-main1007.eqiad.wmnet with OS trixie [00:24:40] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:24:44] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 7/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:24:46] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:25:40] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:25:44] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:25:46] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:29:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [00:33:27] !log jasmine@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main1007.eqiad.wmnet with reason: host reimage [00:34:27] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1020:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:38:23] 10SRE-swift-storage, 10EasyTimeline: "Timeline error. Could not store output files" - https://phabricator.wikimedia.org/T428063#11987314 (10TheEssay26) So is it resolved? I've seen that it's been fixed. [00:40:02] !log jasmine@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main1007.eqiad.wmnet with reason: host reimage [00:40:57] 10SRE-swift-storage, 10EasyTimeline: "Timeline error. Could not store output files" - https://phabricator.wikimedia.org/T428063#11987315 (10MarioProtIV) >>! In T428063#11987314, @TheEssay26 wrote: > So is it resolved? I've seen that it's been fixed. It’s not, on newer timelines edited since the bug started wi... [00:49:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [00:56:39] !log jasmine@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main1007.eqiad.wmnet with OS trixie [01:00:02] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1297267 (owner: 10TrainBranchBot) [01:09:52] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1297792 [01:09:52] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1297792 (owner: 10TrainBranchBot) [01:15:47] (03CR) 10Jasmine: [C:03+2] kafka-main1010: apply host-level override in advance of trixie upgrade [0] [puppet] - 10https://gerrit.wikimedia.org/r/1285478 (https://phabricator.wikimedia.org/T419216) (owner: 10Jasmine) [01:16:21] !log jasmine@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-main1010.eqiad.wmnet with OS trixie [01:22:08] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1297792 (owner: 10TrainBranchBot) [01:28:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [01:32:52] !log jasmine@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main1010.eqiad.wmnet with reason: host reimage [01:39:10] !log jasmine@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main1010.eqiad.wmnet with reason: host reimage [01:48:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [01:55:39] !log jasmine@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main1010.eqiad.wmnet with OS trixie [02:08:57] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:33:57] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:41:29] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 119128624 and 8 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:42:29] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3620192 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:11:23] 10SRE-swift-storage, 10EasyTimeline: "Timeline error. Could not store output files" - https://phabricator.wikimedia.org/T428063#11987367 (10Bawolff) >>! In T428063#11986831, @Pppery wrote: > Hmm ... > > The timeline extension tries to store an additional `.map` file for timelines with wikilinks. That must be... [04:18:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:ae5 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:28:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:ae5 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:54:55] (03PS1) 10Kevin Bazira: ml-services: remove gpt-oss-safeguard-20b isvc from experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297796 (https://phabricator.wikimedia.org/T427497) [05:02:29] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:20:52] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [05:21:12] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool es1054: Upgrading es1054.eqiad.wmnet [05:21:30] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool es1054: Upgrading es1054.eqiad.wmnet [05:22:15] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es1054.eqiad.wmnet with OS trixie [05:22:20] (03PS1) 10Marostegui: pc1021: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1297885 [05:24:42] (03CR) 10Marostegui: [C:03+2] pc1021: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1297885 (owner: 10Marostegui) [05:29:30] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11987474 (10Marostegui) I've started mariadb and replication and at the same time I am going to leave a CPU stress test for the whole weekend to see what the host does. [05:37:12] (03CR) 10Ayounsi: [C:03+1] eqsin: remove OSPF on ae0 direct link between CRs [homer/public] - 10https://gerrit.wikimedia.org/r/1297763 (https://phabricator.wikimedia.org/T424611) (owner: 10Cathal Mooney) [05:37:45] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es1054.eqiad.wmnet with reason: host reimage [05:45:17] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1054.eqiad.wmnet with reason: host reimage [05:47:04] (03CR) 10Ayounsi: [C:03+1] "nice!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1297232 (https://phabricator.wikimedia.org/T427393) (owner: 10Cathal Mooney) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260605T0600) [06:01:20] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1054.eqiad.wmnet with OS trixie [06:02:30] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:04:38] marostegui@cumin1003 major-upgrade (PID 1236596) is awaiting input [06:30:34] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN:New switch setup/configuration - https://phabricator.wikimedia.org/T418439#11987569 (10ayounsi) [06:31:12] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11987573 (10ayounsi) @BCornwall good idea! I opened {T428229} [06:39:11] (03PS1) 10Daniel Kinzler: rest-gateway: emit 401 if rate limit is 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298031 (https://phabricator.wikimedia.org/T428184) [06:39:59] (03CR) 10Bartosz Wójtowicz: [C:03+1] ml-services: remove gpt-oss-safeguard-20b isvc from experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297796 (https://phabricator.wikimedia.org/T427497) (owner: 10Kevin Bazira) [06:50:51] (03PS4) 10Jcrespo: backup: Add job ids for read-only backups [puppet] - 10https://gerrit.wikimedia.org/r/1297081 (https://phabricator.wikimedia.org/T424661) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260605T0700) [07:01:07] (03CR) 10Kevin Bazira: [C:03+2] ml-services: remove gpt-oss-safeguard-20b isvc from experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297796 (https://phabricator.wikimedia.org/T427497) (owner: 10Kevin Bazira) [07:03:17] (03Merged) 10jenkins-bot: ml-services: remove gpt-oss-safeguard-20b isvc from experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297796 (https://phabricator.wikimedia.org/T427497) (owner: 10Kevin Bazira) [07:07:56] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [07:11:00] (03PS1) 10Elukey: setup.py: install setuptools for Python > 3.11 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1298036 (https://phabricator.wikimedia.org/T428024) [07:14:33] (03CR) 10Gehel: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297775 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [07:14:46] (03CR) 10Brouberol: [C:03+2] kafka-ui: connect to all kafka clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297775 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [07:16:50] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/kafka-ui: apply [07:17:06] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/kafka-ui: apply [07:17:16] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/kafka-ui: apply [07:17:34] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/kafka-ui: apply [07:25:24] (03CR) 10Elukey: [C:03+2] setup.py: install setuptools for Python > 3.11 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1298036 (https://phabricator.wikimedia.org/T428024) (owner: 10Elukey) [07:27:46] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Build spicerack for Trixie - https://phabricator.wikimedia.org/T428024#11987628 (10elukey) 05Open→03Resolved a:03elukey The new spicerack trixie deb has been deployed on cumin2003 (Trixie), since unit tests are passing I am inclined to close... [07:31:25] (03CR) 10Elukey: Provide downtime duration information in sre.mysql cookbooks (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1297126 (https://phabricator.wikimedia.org/T427780) (owner: 10CWilliams) [07:34:19] (03PS3) 10Thiemo Kreuz (WMDE): Global rollout - Sub-ref deployments to Group 0, Group 1 and frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297681 (https://phabricator.wikimedia.org/T425662) (owner: 10Svantje Lilienthal) [07:35:10] (03CR) 10CI reject: [V:04-1] Global rollout - Sub-ref deployments to Group 0, Group 1 and frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297681 (https://phabricator.wikimedia.org/T425662) (owner: 10Svantje Lilienthal) [07:37:06] (03PS1) 10Jgiannelos: Deploy PRV on 24 wikisource wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1298040 [07:38:51] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [07:39:06] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool es1054: repool after upgrade [07:41:12] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11987651 (10Marostegui) Host crashed after a few minutes stressing its CPU: ` ------------------------------------------------------------------------------- Record: 101 Date/Time: 06/05/2026 05:30... [07:43:13] (03CR) 10Thiemo Kreuz (WMDE): Create dblists for wikis where CheckUser and AbuseFilter are disabled (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254217 (https://phabricator.wikimedia.org/T420063) (owner: 10Dreamy Jazz) [07:43:15] (03CR) 10CWilliams: Provide downtime duration information in sre.mysql cookbooks (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1297126 (https://phabricator.wikimedia.org/T427780) (owner: 10CWilliams) [07:44:19] (03CR) 10Thiemo Kreuz (WMDE): "* It might be possible fix the problem with group2. I tried in patchset 3, but more files need to be re-generated to make the CI jobs happ" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297681 (https://phabricator.wikimedia.org/T425662) (owner: 10Svantje Lilienthal) [07:48:33] (03CR) 10Dreamy Jazz: Create dblists for wikis where CheckUser and AbuseFilter are disabled (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254217 (https://phabricator.wikimedia.org/T420063) (owner: 10Dreamy Jazz) [07:52:24] (03CR) 10Dreamy Jazz: Create dblists for wikis where CheckUser and AbuseFilter are disabled (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254217 (https://phabricator.wikimedia.org/T420063) (owner: 10Dreamy Jazz) [07:55:49] (03CR) 10Elukey: "I left some suggestions to make it a little more resilient but the rest looks good! Very nice new cookbook :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1239896 (https://phabricator.wikimedia.org/T327300) (owner: 10Ayounsi) [07:59:17] (03CR) 10Cathal Mooney: [C:03+2] sre.hosts: Add eqsin old names to LEGACY_VLANS to support move-vlan [cookbooks] - 10https://gerrit.wikimedia.org/r/1297232 (https://phabricator.wikimedia.org/T427393) (owner: 10Cathal Mooney) [07:59:24] (03PS1) 10Brouberol: kafka-ui: disable latest-available version check [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298093 (https://phabricator.wikimedia.org/T428053) [08:00:18] (03CR) 10Cathal Mooney: [C:03+2] eqsin: remove OSPF on ae0 direct link between CRs [homer/public] - 10https://gerrit.wikimedia.org/r/1297763 (https://phabricator.wikimedia.org/T424611) (owner: 10Cathal Mooney) [08:02:28] (03Merged) 10jenkins-bot: sre.hosts: Add eqsin old names to LEGACY_VLANS to support move-vlan [cookbooks] - 10https://gerrit.wikimedia.org/r/1297232 (https://phabricator.wikimedia.org/T427393) (owner: 10Cathal Mooney) [08:02:30] (03Merged) 10jenkins-bot: eqsin: remove OSPF on ae0 direct link between CRs [homer/public] - 10https://gerrit.wikimedia.org/r/1297763 (https://phabricator.wikimedia.org/T424611) (owner: 10Cathal Mooney) [08:03:23] 06SRE: Rework ACLs on Kafka 3.x clusters - https://phabricator.wikimedia.org/T425528#11987680 (10elukey) >>! In T425528#11987201, @colewhite wrote: >>>! In T425528#11981310, @elukey wrote: >> @colewhite @tappof @andrea.denisse Hi! I have to add some ACLs to both Kafka logging clusters, I am going to add some rat... [08:04:02] (03CR) 10Elukey: [C:03+1] kafka-ui: disable latest-available version check [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298093 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [08:04:19] (03CR) 10Brouberol: [C:03+2] kafka-ui: disable latest-available version check [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298093 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [08:07:24] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/kafka-ui: apply [08:07:40] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/kafka-ui: apply [08:07:46] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/kafka-ui: apply [08:08:01] (03PS3) 10CWilliams: Provide downtime duration information in sre.mysql cookbooks [software/spicerack] - 10https://gerrit.wikimedia.org/r/1297126 (https://phabricator.wikimedia.org/T427780) [08:08:04] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/kafka-ui: apply [08:17:05] (03PS4) 10CWilliams: Provide downtime duration information in sre.mysql cookbooks [software/spicerack] - 10https://gerrit.wikimedia.org/r/1297126 (https://phabricator.wikimedia.org/T427780) [08:24:28] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool es1054: repool after upgrade [08:25:57] (03CR) 10Elukey: [C:03+1] Provide downtime duration information in sre.mysql cookbooks [software/spicerack] - 10https://gerrit.wikimedia.org/r/1297126 (https://phabricator.wikimedia.org/T427780) (owner: 10CWilliams) [08:26:16] (03PS2) 10Ayounsi: network data.yaml: add new per-rack vlan ranges for eqiad ab refresh [puppet] - 10https://gerrit.wikimedia.org/r/1297685 (https://phabricator.wikimedia.org/T418012) (owner: 10Cathal Mooney) [08:26:20] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1297685 (https://phabricator.wikimedia.org/T418012) (owner: 10Cathal Mooney) [08:30:38] (03CR) 10Ayounsi: [C:03+1] network data.yaml: add new per-rack vlan ranges for eqiad ab refresh [puppet] - 10https://gerrit.wikimedia.org/r/1297685 (https://phabricator.wikimedia.org/T418012) (owner: 10Cathal Mooney) [08:32:17] 10ops-codfw, 06SRE, 06DC-Ops: Move test host in codfw rack B3 or D3 - https://phabricator.wikimedia.org/T428041#11987794 (10ayounsi) a:03Jhancock.wm Thanks, I'm all done, you can decom the server anytime now. [08:38:28] (03PS4) 10Thiemo Kreuz (WMDE): Global rollout - Sub-ref deployments to Group 0, Group 1 and frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297681 (https://phabricator.wikimedia.org/T425662) (owner: 10Svantje Lilienthal) [08:40:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 08 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297681 (https://phabricator.wikimedia.org/T425662) (owner: 10Svantje Lilienthal) [08:47:34] Does anyone mind if I do a services deploy? [08:50:08] Mvolz: o/ ideally we don't do deployments on Friday unless really necessary, but if it is urgent you can proceed [08:51:44] 06SRE, 06Infrastructure-Foundations, 10netops, 06ServiceOps new: Nokia SR-Linux DHCP Relay Bug - https://phabricator.wikimedia.org/T411054#11987826 (10cmooney) To confirm the bug is fixed in relese 26.3.2: ` DHCP Release:26.3.2 Section:Resolved issues Functional area:System When using DHCP relay, a DHC... [09:02:54] 10SRE-swift-storage, 10Ceph, 06Infrastructure-Foundations, 06Machine-Learning-Team (Q4 FY2025-26): Move the Docker Registry's /ml prefix to S3/apus - https://phabricator.wikimedia.org/T420978#11987833 (10achou) [09:05:16] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Install new MPC10E-10C line cards on cr1-eqiad and cr2-eqiad slot 0. - https://phabricator.wikimedia.org/T426343#11987845 (10cmooney) [09:07:06] (03PS1) 10Brouberol: kafka: add the DescribeConfigs cluster ACL for anonymous users [puppet] - 10https://gerrit.wikimedia.org/r/1298097 (https://phabricator.wikimedia.org/T428234) [09:09:46] (03PS2) 10Brouberol: kafka: add the DescribeConfigs cluster ACL for anonymous users [puppet] - 10https://gerrit.wikimedia.org/r/1298097 (https://phabricator.wikimedia.org/T428234) [09:11:26] (03CR) 10Svantje Lilienthal: [C:03+1] Global rollout - Sub-ref deployments to Group 0, Group 1 and frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297681 (https://phabricator.wikimedia.org/T425662) (owner: 10Svantje Lilienthal) [09:12:26] 06SRE, 06Infrastructure-Foundations, 10netops: Cookbook sre.network.configure-switch-interfaces failing on upgraded Juniper switch - https://phabricator.wikimedia.org/T428071#11987870 (10ayounsi) As far as I understand the cookbook does `show configuration interfaces xe-0/0/41 | display json ` and not `show... [09:12:34] (03CR) 10WMDE-Fisch: [C:03+1] "Since this is just temporary I guess it's fine to list all wikis instead of creating a dblist." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297681 (https://phabricator.wikimedia.org/T425662) (owner: 10Svantje Lilienthal) [09:12:36] (03PS1) 10Ayounsi: configure_switch_interfaces: handle error case [cookbooks] - 10https://gerrit.wikimedia.org/r/1298100 (https://phabricator.wikimedia.org/T428071) [09:15:00] (03PS1) 10Ozge: ml-services: makes editing-suggestions available in both eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298101 (https://phabricator.wikimedia.org/T427794) [09:21:08] (03CR) 10Ilias Sarantopoulos: [C:03+1] "lgtm! just one comment: memory limits and requests seem extremely high for a service that just uses a csv unless the csv is large." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298101 (https://phabricator.wikimedia.org/T427794) (owner: 10Ozge) [09:22:14] 06SRE, 10SRE-Access-Requests: SSH key replacement for tchanders - https://phabricator.wikimedia.org/T417056#11987896 (10Tchanders) 05Resolved→03Open Hi, I need to update my key again (hard disk failed, lost the old keys). Can we re-use this task? [09:23:40] (03CR) 10Ozge: [C:03+2] ml-services: makes editing-suggestions available in both eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298101 (https://phabricator.wikimedia.org/T427794) (owner: 10Ozge) [09:25:48] (03Merged) 10jenkins-bot: ml-services: makes editing-suggestions available in both eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298101 (https://phabricator.wikimedia.org/T427794) (owner: 10Ozge) [09:26:15] (03CR) 10Ozge: [C:03+2] "sorry! I have just seen this message." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298101 (https://phabricator.wikimedia.org/T427794) (owner: 10Ozge) [09:27:20] 06SRE, 06Data-Engineering, 10Observability-Logging, 10Wikimedia-Logstash, and 3 others: Produce ECS formatted logstash logs to Event Platform, allowing them to be queried in the WMF Data Lake with SQL - https://phabricator.wikimedia.org/T291645#11987903 (10BTullis) 05Open→03Resolved I think that we... [09:28:08] (03CR) 10Cathal Mooney: [C:03+1] configure_switch_interfaces: handle error case [cookbooks] - 10https://gerrit.wikimedia.org/r/1298100 (https://phabricator.wikimedia.org/T428071) (owner: 10Ayounsi) [09:29:55] (03PS3) 10Ozge: ml-services: makes editing-suggestions publicly available [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297748 (https://phabricator.wikimedia.org/T427794) [09:31:13] (03CR) 10Thiemo Kreuz (WMDE): [C:03+1] "I think we found a solution that works without a custom dblist. While we can't subtract groups from groups we can subtract individual wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297681 (https://phabricator.wikimedia.org/T425662) (owner: 10Svantje Lilienthal) [09:31:34] !log ozge@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:32:00] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Cookbook sre.network.configure-switch-interfaces failing on upgraded Juniper switch - https://phabricator.wikimedia.org/T428071#11987916 (10cmooney) Ok thanks! My bad on the command getting run. Let's see how we get on with the patch <3 [09:32:37] (03CR) 10Trueg: wdqs-backend: Deployment chart for the WDQS triple-store (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286374 (https://phabricator.wikimedia.org/T425007) (owner: 10Trueg) [09:37:29] (03PS4) 10Ozge: ml-services: makes editing-suggestions publicly available [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297748 (https://phabricator.wikimedia.org/T427794) [09:37:42] (03PS5) 10Ozge: ml-services: makes editing-suggestions publicly available [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297748 (https://phabricator.wikimedia.org/T427794) [09:40:44] (03PS1) 10Tiziano Fogli: slothslos/report2drive: move secrets under profile [labs/private] - 10https://gerrit.wikimedia.org/r/1298117 (https://phabricator.wikimedia.org/T425795) [09:40:59] (03CR) 10Tiziano Fogli: [C:03+2] slothslos/report2drive: move secrets under profile [labs/private] - 10https://gerrit.wikimedia.org/r/1298117 (https://phabricator.wikimedia.org/T425795) (owner: 10Tiziano Fogli) [09:41:10] (03CR) 10Tiziano Fogli: [V:03+2 C:03+2] slothslos/report2drive: move secrets under profile [labs/private] - 10https://gerrit.wikimedia.org/r/1298117 (https://phabricator.wikimedia.org/T425795) (owner: 10Tiziano Fogli) [09:44:22] RECOVERY - Confd vcl based reload on cp6009 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:44:28] RECOVERY - Confd vcl based reload on cp6014 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:44:34] RECOVERY - Confd vcl based reload on cp6012 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:44:38] RECOVERY - Confd vcl based reload on cp6010 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:45:38] RECOVERY - Confd vcl based reload on cp6011 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:46:19] (03CR) 10Elukey: [C:03+1] "Let's also add the rule in the missing clusters!" [puppet] - 10https://gerrit.wikimedia.org/r/1298097 (https://phabricator.wikimedia.org/T428234) (owner: 10Brouberol) [09:46:28] PROBLEM - Confd vcl based reload on cp6013 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [09:48:25] (03CR) 10Brouberol: "Sorry, what missing clusters?" [puppet] - 10https://gerrit.wikimedia.org/r/1298097 (https://phabricator.wikimedia.org/T428234) (owner: 10Brouberol) [09:50:04] (03PS1) 10Elukey: redfish: improve add_account with AccountTypes [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1298125 (https://phabricator.wikimedia.org/T426180) [09:50:30] (03CR) 10Brouberol: "To clarify, the `--operation DescribeConfigs` flag was already present in _some_ clusters, but missing in the ones that I added it in." [puppet] - 10https://gerrit.wikimedia.org/r/1298097 (https://phabricator.wikimedia.org/T428234) (owner: 10Brouberol) [09:50:36] (03CR) 10Elukey: redfish: improve add_account with AccountTypes (035 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293593 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [10:01:00] (03CR) 10Elukey: redfish: improve add_account with AccountTypes (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293593 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [10:07:22] (03PS1) 10Tiziano Fogli: slothslos/report2drive: replace array with hash [labs/private] - 10https://gerrit.wikimedia.org/r/1298136 (https://phabricator.wikimedia.org/T425795) [10:07:36] (03CR) 10Tiziano Fogli: [C:03+2] slothslos/report2drive: replace array with hash [labs/private] - 10https://gerrit.wikimedia.org/r/1298136 (https://phabricator.wikimedia.org/T425795) (owner: 10Tiziano Fogli) [10:07:38] (03CR) 10Tiziano Fogli: [V:03+2 C:03+2] slothslos/report2drive: replace array with hash [labs/private] - 10https://gerrit.wikimedia.org/r/1298136 (https://phabricator.wikimedia.org/T425795) (owner: 10Tiziano Fogli) [10:13:05] 10ops-eqiad, 06DBA, 06DC-Ops: db1274 is not booting up - https://phabricator.wikimedia.org/T428240 (10FCeratto-WMF) 03NEW [10:13:15] 10ops-eqiad, 06DBA, 06DC-Ops: db1274 is not booting up - https://phabricator.wikimedia.org/T428240#11988010 (10FCeratto-WMF) [10:13:16] (03CR) 10Clément Goubert: [C:03+1] liftwing-openapi-server: Add new admin_ng service for serving OpenAPI specs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1297168 (https://phabricator.wikimedia.org/T427902) (owner: 10Gkyziridis) [10:15:02] (03PS6) 10Ozge: ml-services: makes editing-suggestions publicly available [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297748 (https://phabricator.wikimedia.org/T427794) [10:15:17] 10ops-eqiad, 06DBA, 06DC-Ops: db1274 is not booting up - https://phabricator.wikimedia.org/T428240#11988012 (10FCeratto-WMF) [10:16:28] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11988013 (10cmooney) >>! In T427393#11987566, @ayounsi wrote: > @BCornwall good idea! I opened {T428229} Nice one. I think we can pro... [10:23:33] (03CR) 10Clément Goubert: [C:03+1] ml-services: add liftwing-openapi-server deployment (038 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297167 (https://phabricator.wikimedia.org/T427902) (owner: 10Gkyziridis) [10:27:25] (03CR) 10JavierMonton: [C:03+1] kafka: add the DescribeConfigs cluster ACL for anonymous users [puppet] - 10https://gerrit.wikimedia.org/r/1298097 (https://phabricator.wikimedia.org/T428234) (owner: 10Brouberol) [10:41:24] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1274 is not booting up - https://phabricator.wikimedia.org/T428240#11988108 (10Jclark-ctr) a:03Jclark-ctr [10:46:15] (03CR) 10Clément Goubert: [C:03+1] dns: Add liftwing-openapi-server CNAME records (033 comments) [dns] - 10https://gerrit.wikimedia.org/r/1297710 (https://phabricator.wikimedia.org/T427902) (owner: 10Gkyziridis) [10:54:50] !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:55:44] !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260605T0700) [11:00:05] jelto, arnoldokoth, mutante, and arnaudb: May I have your attention please! GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260605T1100) [11:04:02] (03CR) 10Klausman: [C:03+1] liftwing-openapi-server: Add new admin_ng service for serving OpenAPI specs [puppet] - 10https://gerrit.wikimedia.org/r/1297168 (https://phabricator.wikimedia.org/T427902) (owner: 10Gkyziridis) [11:07:30] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:17:30] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:24:35] (03CR) 10Elukey: redfish: improve add_account with AccountTypes (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293593 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [11:28:26] !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:29:11] !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:54:51] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1274 is not booting up - https://phabricator.wikimedia.org/T428240#11988275 (10Jclark-ctr) Performed a flea power drain, and the server came back up. I am currently updating the BIOS, then I will pull a TSR report and open a Dell support ticket for documentation and t... [12:03:42] (03CR) 10Ayounsi: [C:03+2] configure_switch_interfaces: handle error case [cookbooks] - 10https://gerrit.wikimedia.org/r/1298100 (https://phabricator.wikimedia.org/T428071) (owner: 10Ayounsi) [12:04:06] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Cookbook sre.network.configure-switch-interfaces failing on upgraded Juniper switch - https://phabricator.wikimedia.org/T428071#11988320 (10ayounsi) a:03ayounsi [12:06:56] !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [12:07:13] !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [12:07:17] (03Merged) 10jenkins-bot: configure_switch_interfaces: handle error case [cookbooks] - 10https://gerrit.wikimedia.org/r/1298100 (https://phabricator.wikimedia.org/T428071) (owner: 10Ayounsi) [12:07:49] !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [12:08:01] !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [12:08:33] (03PS5) 10Jcrespo: backup: Add job ids for read-only backups [puppet] - 10https://gerrit.wikimedia.org/r/1297081 (https://phabricator.wikimedia.org/T424661) [12:11:04] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1297081 (https://phabricator.wikimedia.org/T424661) (owner: 10Jcrespo) [12:12:31] 10SRE-swift-storage, 10EasyTimeline: "Timeline error. Could not store output files" - https://phabricator.wikimedia.org/T428063#11988425 (10Aklapper) [12:16:17] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1274 is not booting up - https://phabricator.wikimedia.org/T428240#11988452 (10Jclark-ctr) Dell SR 227400671 [12:19:50] (03CR) 10Jcrespo: [C:03+2] backup: Add job ids for read-only backups [puppet] - 10https://gerrit.wikimedia.org/r/1297081 (https://phabricator.wikimedia.org/T424661) (owner: 10Jcrespo) [12:23:32] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1274 is not booting up - https://phabricator.wikimedia.org/T428240#11988500 (10Marostegui) p:05Triage→03Medium [12:24:38] 06SRE, 10SRE-Access-Requests: SSH key replacement for tchanders - https://phabricator.wikimedia.org/T417056#11988504 (10Tchanders) >>! In T417056#11987896, @Tchanders wrote: > Hi, I need to update my key again (hard disk failed, lost the old keys). Can we re-use this task? Assuming we can us this task, here i... [12:28:26] !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [12:28:38] !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [12:29:35] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2202.codfw.wmnet with reason: Reboot [12:30:13] (03CR) 10Brouberol: "Oooh you mean add it by hand? Yes of course" [puppet] - 10https://gerrit.wikimedia.org/r/1298097 (https://phabricator.wikimedia.org/T428234) (owner: 10Brouberol) [12:30:15] (03CR) 10Brouberol: [C:03+2] kafka: add the DescribeConfigs cluster ACL for anonymous users [puppet] - 10https://gerrit.wikimedia.org/r/1298097 (https://phabricator.wikimedia.org/T428234) (owner: 10Brouberol) [12:30:29] !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [12:30:42] !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [12:35:37] (03CR) 10Daniel Kinzler: [C:03+2] Rakefile: Run chart specific tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282965 (https://phabricator.wikimedia.org/T424824) (owner: 10Daniel Kinzler) [12:35:57] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:39:31] (03CR) 10Ayounsi: "I think overall I'd prefer a single dashboard for everything interfaces related (bandwidth, drop, description, etc) a bit like in LibreNMS" [alerts] - 10https://gerrit.wikimedia.org/r/1297747 (owner: 10Cathal Mooney) [12:43:57] RESOLVED: JobUnavailable: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:51:08] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:51:53] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:02:22] (03Merged) 10jenkins-bot: Rakefile: Run chart specific tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282965 (https://phabricator.wikimedia.org/T424824) (owner: 10Daniel Kinzler) [13:05:02] 10ops-eqiad, 06SRE, 06DC-Ops: C/D refresh Nokia switches Exhaust direction is reversed - https://phabricator.wikimedia.org/T428260 (10Jclark-ctr) 03NEW [13:06:34] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1274 is not booting up - https://phabricator.wikimedia.org/T428240#11988684 (10Jclark-ctr) I ran the CPU stress test for approximately 30 minutes and did not encounter any issues. I think the server is good to be repooled. Please leave the ticket open so I can provid... [13:07:37] 10ops-eqiad, 06SRE, 06DC-Ops: C/D refresh Nokia switches Exhaust direction is reversed - https://phabricator.wikimedia.org/T428260#11988686 (10Jclark-ctr) [13:08:19] 10ops-eqiad, 06SRE, 06DC-Ops: C/D refresh Nokia switches Exhaust direction is reversed - https://phabricator.wikimedia.org/T428260#11988691 (10Jclark-ctr) a:03Jclark-ctr [13:13:02] 10ops-eqiad, 06SRE, 06DC-Ops: C/D refresh Nokia switches Exhaust direction is reversed - https://phabricator.wikimedia.org/T428260#11988742 (10Jclark-ctr) {F86754991} Current fans installed are F2B. (Front to Back). They should be B2F ( Back to Front) [13:13:32] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1274 is not booting up - https://phabricator.wikimedia.org/T428240#11988744 (10Marostegui) @FCeratto-WMF it is probably better to start mariadb and replication and leave it replicating throughout the weekend and then repool on Monday if all works fine. [13:19:33] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1274 is not booting up - https://phabricator.wikimedia.org/T428240#11988758 (10FCeratto-WMF) @Marostegui db1274 is not ready for replication (there's no /srv data, no MariaDB installed and it's not in zarcillo yet) as it's part of the new batch. [13:20:45] 10ops-eqiad, 06SRE, 06DC-Ops: C/D refresh Nokia switches Exhaust direction is reversed - https://phabricator.wikimedia.org/T428260#11988760 (10ayounsi) From the quote in {T368959} we ordered FtB switches The row A/B are BtF {T412711}. Can you follow up with Nokia/Myriad to know if we ca "just" replace th... [13:21:04] (03CR) 10Gergő Tisza: [C:03+1] Configure wgOAuthAutoApprove['protocols'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1293173 (https://phabricator.wikimedia.org/T412542) (owner: 10Bartosz Dziewoński) [13:23:39] 07sre-alert-triage, 06Data-Platform-SRE (2026-06-05 - 2026-06-26): Alert in need of triage: PuppetFailure (instance an-test-client1002:9100) - https://phabricator.wikimedia.org/T427399#11988777 (10Gehel) [13:23:43] 07sre-alert-triage, 06Data-Platform-SRE (2026-06-05 - 2026-06-26): Alert in need of triage: ResourceQuotaMemoryLimitsWarning - https://phabricator.wikimedia.org/T426589#11988780 (10Gehel) [13:23:48] 07sre-alert-triage, 06Data-Platform-SRE (2026-06-05 - 2026-06-26): Alert in need of triage: KubernetesAPIErrorRate - https://phabricator.wikimedia.org/T414413#11988784 (10Gehel) [13:24:44] 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-06-05 - 2026-06-26): Follow up on multiple RAID / drive issues - https://phabricator.wikimedia.org/T426610#11988806 (10Gehel) [13:24:48] 10ops-eqiad, 06SRE, 06DC-Ops: C/D refresh Nokia switches Exhaust direction is reversed - https://phabricator.wikimedia.org/T428260#11988807 (10ayounsi) Looks like it's a known issue since at least that comment https://phabricator.wikimedia.org/T412711#11582154 [13:25:27] 06SRE, 06Data-Platform-SRE (2026-06-05 - 2026-06-26): archiva1002 has stale jobs in /var/cache/archiva that uses all the disk space - https://phabricator.wikimedia.org/T425083#11988821 (10Gehel) [13:28:27] 10ops-codfw, 06SRE, 06DC-Ops, 06Wikidata Platform Team, 06Data-Platform-SRE (2026-06-05 - 2026-06-26): Q4:rack/setup/install dse-k8s-wdqs200[1-4] (formerly wdqs20[28-31]) - https://phabricator.wikimedia.org/T423312#11988899 (10Gehel) [13:31:23] 10ops-eqiad, 06SRE, 06DC-Ops, 06Wikidata Platform Team, and 2 others: Q4:rack/setup/install dse-k8s-wdqs100[1-3] (formerly wdqs103[6-8]) - https://phabricator.wikimedia.org/T423314#11988942 (10Gehel) [13:33:51] 10SRE-SLO, 10observability, 10Wikidata, 06Wikidata Platform Team, and 3 others: Update WDQS SLOs to reflect graph split changes - https://phabricator.wikimedia.org/T393966#11988994 (10Gehel) [13:35:15] 06SRE, 06Data-Engineering, 10Observability-Logging, 10Wikimedia-Logstash, and 3 others: Produce ECS formatted logstash logs to Event Platform, allowing them to be queried in the WMF Data Lake with SQL - https://phabricator.wikimedia.org/T291645#11989026 (10Gehel) [13:35:25] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1274 is not booting up - https://phabricator.wikimedia.org/T428240#11989032 (10Marostegui) Ah sure! Thanks Then let's keep the ticket opened as John mentioned whilst we wait for Dell to come back. [13:35:44] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for OSleger_WMF - https://phabricator.wikimedia.org/T428262 (10OSleger-WMF) 03NEW [13:36:43] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [13:36:44] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [13:37:17] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for OSleger_WMF - https://phabricator.wikimedia.org/T428262#11989058 (10SLopes-WMF) As Otto's manager, I approve this request. [13:37:49] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [13:38:26] (03PS1) 10Brouberol: dse-k8s-aux: define internal kafka-ui disc and svc records [dns] - 10https://gerrit.wikimedia.org/r/1298262 (https://phabricator.wikimedia.org/T428053) [13:38:28] (03PS1) 10Brouberol: Cleanup kafka-ui records pointing to the dse-k8s ingress [dns] - 10https://gerrit.wikimedia.org/r/1298263 (https://phabricator.wikimedia.org/T428053) [13:38:30] (03PS1) 10Brouberol: aux-k8s: define the kafka-ui namespace in both clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298266 (https://phabricator.wikimedia.org/T428053) [13:38:30] (03PS1) 10Brouberol: aux-k8s: define the kafka-ui kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1298264 (https://phabricator.wikimedia.org/T428234) [13:38:32] (03PS1) 10Brouberol: aux-k8s: define the kafka-ui helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298267 (https://phabricator.wikimedia.org/T428053) [13:38:32] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [13:38:34] (03PS1) 10Brouberol: dse-k8s: remove the kafka-ui namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298268 (https://phabricator.wikimedia.org/T428053) [13:38:38] (03PS1) 10Brouberol: dse-k8s: remove kafka-ui kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1298265 (https://phabricator.wikimedia.org/T428053) [13:38:42] (03PS1) 10Brouberol: dse-k8s: remove the kafka-ui helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298269 (https://phabricator.wikimedia.org/T428053) [13:39:10] (03CR) 10CI reject: [V:04-1] aux-k8s: define the kafka-ui kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1298264 (https://phabricator.wikimedia.org/T428234) (owner: 10Brouberol) [13:39:17] (03CR) 10CI reject: [V:04-1] dse-k8s-aux: define internal kafka-ui disc and svc records [dns] - 10https://gerrit.wikimedia.org/r/1298262 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [13:40:29] (03CR) 10CI reject: [V:04-1] aux-k8s: define the kafka-ui helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298267 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [13:40:41] (03CR) 10CI reject: [V:04-1] dse-k8s: remove the kafka-ui namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298268 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [13:40:49] (03PS2) 10Brouberol: aux-k8s: define the kafka-ui kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1298264 (https://phabricator.wikimedia.org/T428053) [13:40:49] (03PS2) 10Brouberol: dse-k8s: remove kafka-ui kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1298265 (https://phabricator.wikimedia.org/T428053) [13:40:59] (03CR) 10CI reject: [V:04-1] dse-k8s: remove the kafka-ui helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298269 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [13:41:17] (03CR) 10Reedy: [C:03+1] Periodic jobs: add demote_ineligible_users (and _central_ counterpart) [puppet] - 10https://gerrit.wikimedia.org/r/1285315 (https://phabricator.wikimedia.org/T425396) (owner: 10Mszwarc) [13:41:25] (03CR) 10CI reject: [V:04-1] aux-k8s: define the kafka-ui kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1298264 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [13:43:20] (03PS3) 10Brouberol: aux-k8s: define the kafka-ui kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1298264 (https://phabricator.wikimedia.org/T428053) [13:43:20] (03PS3) 10Brouberol: dse-k8s: remove kafka-ui kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1298265 (https://phabricator.wikimedia.org/T428053) [13:46:20] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11989080 (10ssingh) >>! In T414411#11986915, @BCornwall wrote: > We discussed this and the general consensus seemed to be to just decomm the server and wait for the refresh which is happening shortly... [13:57:16] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [13:57:27] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [14:08:14] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [14:08:20] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [14:16:22] (03PS2) 10Elukey: redfish: improve add_account with AccountTypes [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1298125 (https://phabricator.wikimedia.org/T426180) [14:17:54] (03PS1) 10Ssingh: admin: update SSH key for tchanders [puppet] - 10https://gerrit.wikimedia.org/r/1298282 [14:18:37] (03PS2) 10Brouberol: aux-k8s: define the kafka-ui namespace in both clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298266 (https://phabricator.wikimedia.org/T428053) [14:18:37] (03PS2) 10Brouberol: aux-k8s: define the kafka-ui helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298267 (https://phabricator.wikimedia.org/T428053) [14:18:37] (03PS2) 10Brouberol: dse-k8s: remove the kafka-ui namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298268 (https://phabricator.wikimedia.org/T428053) [14:18:38] (03PS2) 10Brouberol: dse-k8s: remove the kafka-ui helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298269 (https://phabricator.wikimedia.org/T428053) [14:18:39] (03PS1) 10Brouberol: CI: add aux-k8s-codfw to the list of environments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298283 (https://phabricator.wikimedia.org/T428053) [14:18:48] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [14:18:53] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [14:20:28] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1013.eqiad.wmnet, wdqs1016.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:21:28] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:21:52] (03CR) 10CI reject: [V:04-1] redfish: improve add_account with AccountTypes [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1298125 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [14:22:42] (03Abandoned) 10Elukey: redfish: improve add_account with AccountTypes [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1298125 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [14:28:30] 06SRE, 06Traffic, 06Traffic-Icebox, 07Community-Wishlist, and 2 others: Support Encrypted Client Hello (ECH) on Wikimedia servers - https://phabricator.wikimedia.org/T205378#11989226 (10mikez-WMF) Hi, Just for visibility if anyone is interested: - This relevant [[ https://people.wikimedia.org/~sukhe/ec... [14:32:39] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [14:32:45] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [14:34:34] (03PS1) 10Lucas Werkmeister (WMDE): Add Wikidata configuration for WikiProject links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1298293 (https://phabricator.wikimedia.org/T422935) [14:34:44] (03PS8) 10Elukey: redfish: improve add_account with AccountTypes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293593 (https://phabricator.wikimedia.org/T426180) [14:37:14] (03CR) 10Lucas Werkmeister (WMDE): "To check that I got the long property lists for the first three WikiProjects right, I used the following snippet (in the `shell` maintenan" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1298293 (https://phabricator.wikimedia.org/T422935) (owner: 10Lucas Werkmeister (WMDE)) [14:37:28] 06SRE, 10SRE-Access-Requests: Requesting access to Cassandra staging for akhatun - https://phabricator.wikimedia.org/T427701#11989254 (10Raine) [14:37:29] 06SRE, 10SRE-Access-Requests: Requesting access to restricted for Audrey Penven - https://phabricator.wikimedia.org/T427531#11989255 (10Dzahn) I verified the SSH key out of band, via email. [14:37:41] (03CR) 10Lucas Werkmeister (WMDE): [C:04-1] "TODO to self: would be nice to test this in mw-experimental" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1298293 (https://phabricator.wikimedia.org/T422935) (owner: 10Lucas Werkmeister (WMDE)) [14:37:42] 06SRE, 10SRE-Access-Requests: Requesting access to restricted for Audrey Penven - https://phabricator.wikimedia.org/T427531#11989256 (10Dzahn) [14:40:26] (03CR) 10JHathaway: redfish: improve add_account with AccountTypes (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293593 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [14:40:28] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2015.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:40:28] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:41:02] (03CR) 10Tiziano Fogli: [C:03+2] puppetmaster: remove obsolete alerts [alerts] - 10https://gerrit.wikimedia.org/r/1297117 (https://phabricator.wikimedia.org/T426809) (owner: 10Tiziano Fogli) [14:42:28] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:42:28] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:43:13] (03PS1) 10Dzahn: admin: upgrade Audrey Penven from ldap_only to restricted [puppet] - 10https://gerrit.wikimedia.org/r/1298299 (https://phabricator.wikimedia.org/T427531) [14:44:13] (03CR) 10CI reject: [V:04-1] admin: upgrade Audrey Penven from ldap_only to restricted [puppet] - 10https://gerrit.wikimedia.org/r/1298299 (https://phabricator.wikimedia.org/T427531) (owner: 10Dzahn) [14:45:06] (03PS2) 10Dzahn: admin: upgrade Audrey Penven from ldap_only to restricted [puppet] - 10https://gerrit.wikimedia.org/r/1298299 (https://phabricator.wikimedia.org/T427531) [14:45:11] (03CR) 10Ssingh: varnish: Add CSP report-only header value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [14:46:47] (03PS7) 10Trueg: dse-k8s-services: WDQS deployment helmfile values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297067 (https://phabricator.wikimedia.org/T424338) [14:47:56] (03PS1) 10Dzahn: add new language Magahi (mag) [dns] - 10https://gerrit.wikimedia.org/r/1298301 (https://phabricator.wikimedia.org/T428266) [14:49:15] (03CR) 10CI reject: [V:04-1] dse-k8s-services: WDQS deployment helmfile values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297067 (https://phabricator.wikimedia.org/T424338) (owner: 10Trueg) [14:51:29] (03PS8) 10Trueg: dse-k8s-services: WDQS deployment helmfile values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297067 (https://phabricator.wikimedia.org/T424338) [14:52:02] (03Abandoned) 10Jforrester: wmnet: Add new CNAMEs for Wikifunctions replacement evaluators [dns] - 10https://gerrit.wikimedia.org/r/1289393 (https://phabricator.wikimedia.org/T417870) (owner: 10Jforrester) [14:52:08] (03Abandoned) 10Jforrester: services: Add Wikifunctions's Rust-based evaluator ingress endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1289395 (https://phabricator.wikimedia.org/T417870) (owner: 10Jforrester) [14:52:15] (03Abandoned) 10Jforrester: services: Turn Wikifunctions's Rust-based evaluator endpoints to prod state [puppet] - 10https://gerrit.wikimedia.org/r/1289396 (https://phabricator.wikimedia.org/T417870) (owner: 10Jforrester) [14:52:20] (03Abandoned) 10Jforrester: profile::services_proxy::envoy: Add Wikifunctions's Rust-based eval endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1289397 (https://phabricator.wikimedia.org/T417870) (owner: 10Jforrester) [14:52:27] (03Abandoned) 10Jforrester: wikifunctions: Add extraFQDNs for the Rust-based evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1289399 (https://phabricator.wikimedia.org/T417870) (owner: 10Jforrester) [14:52:31] (03CR) 10Lucas Werkmeister (WMDE): [C:04-1] "> TODO to self: would be nice to test this in mw-experimental" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1298293 (https://phabricator.wikimedia.org/T422935) (owner: 10Lucas Werkmeister (WMDE)) [14:52:35] (03PS9) 10Elukey: redfish: improve add_account with AccountTypes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293593 (https://phabricator.wikimedia.org/T426180) [15:00:32] (03PS2) 10Brouberol: dse-k8s-aux: migrate internal kafka-ui disc and svc records to k8s-aux [dns] - 10https://gerrit.wikimedia.org/r/1298262 (https://phabricator.wikimedia.org/T428053) [15:01:37] (03Abandoned) 10Brouberol: Cleanup kafka-ui records pointing to the dse-k8s ingress [dns] - 10https://gerrit.wikimedia.org/r/1298263 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [15:08:58] (03PS1) 10Btullis: Switch from 4 wdqs namespaces to 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298307 (https://phabricator.wikimedia.org/T422522) [15:11:54] (03CR) 10Elukey: redfish: improve add_account with AccountTypes (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293593 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [15:13:29] (03PS1) 10Btullis: Update the k8s deployment tokens for wdqs namespaces [puppet] - 10https://gerrit.wikimedia.org/r/1298308 (https://phabricator.wikimedia.org/T422522) [15:13:39] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1298308 (https://phabricator.wikimedia.org/T422522) (owner: 10Btullis) [15:16:40] (03CR) 10Trueg: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298307 (https://phabricator.wikimedia.org/T422522) (owner: 10Btullis) [15:17:00] (03CR) 10Trueg: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1298308 (https://phabricator.wikimedia.org/T422522) (owner: 10Btullis) [15:29:39] (03Abandoned) 10Dzahn: Revert "ci: switch jenkins proxy target to new discovery name" [puppet] - 10https://gerrit.wikimedia.org/r/1297190 (owner: 10Dzahn) [15:35:50] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [15:47:00] 10ops-codfw, 06SRE, 06DC-Ops: codfw: move public baremetal servers to per rack vlan - https://phabricator.wikimedia.org/T428060#11989610 (10ssingh) > dns[2004-2006].wikimedia.org - Need special care to not cause traffic imbalance @ssingh Do all three need to happen at the same time? Because that's a proble... [15:49:54] (03CR) 10Atsuko: [C:03+1] Switch from 4 wdqs namespaces to 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298307 (https://phabricator.wikimedia.org/T422522) (owner: 10Btullis) [15:50:43] (03CR) 10Atsuko: [C:03+1] Update the k8s deployment tokens for wdqs namespaces [puppet] - 10https://gerrit.wikimedia.org/r/1298308 (https://phabricator.wikimedia.org/T422522) (owner: 10Btullis) [15:51:16] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 288.18 ms [15:55:51] (03PS1) 10Elukey: role::cache::{text,upload}: enable webrequest tagging globally [puppet] - 10https://gerrit.wikimedia.org/r/1298318 (https://phabricator.wikimedia.org/T402512) [16:08:33] 06SRE, 10SRE-swift-storage, 06Commons, 10media-backups, 10MediaWiki-File-management: Uncompressed TIFFs on commons - https://phabricator.wikimedia.org/T427949#11989682 (10Ladsgroup) I thought I gave an update here. The bot is now running and compressing tiffs: https://commons.wikimedia.org/w/index.php?ti... [16:08:57] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:15:47] (03CR) 10Dzahn: [C:03+2] add new language Magahi (mag) [dns] - 10https://gerrit.wikimedia.org/r/1298301 (https://phabricator.wikimedia.org/T428266) (owner: 10Dzahn) [16:16:13] !log dzahn@dns1005 START - running authdns-update [16:16:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:17:11] !log DNS - adding new project language "mag" - Magahi - a language spoken in India and Nepal by about 12 million native speakers (T428266) [16:17:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:15] T428266: Create Wikipedia Magahi - https://phabricator.wikimedia.org/T428266 [16:17:43] !log dzahn@dns1005 END - running authdns-update [16:26:33] (03PS10) 10Elukey: redfish: improve add_account with AccountTypes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293593 (https://phabricator.wikimedia.org/T426180) [16:28:37] (03CR) 10Elukey: redfish: improve add_account with AccountTypes (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293593 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [16:33:57] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:35:57] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:42:37] (03PS1) 10Jgreen: Switch fundraising default bastion back to eqiad [dns] - 10https://gerrit.wikimedia.org/r/1298323 [16:44:02] (03CR) 10Jgreen: [C:03+2] Switch fundraising default bastion back to eqiad [dns] - 10https://gerrit.wikimedia.org/r/1298323 (owner: 10Jgreen) [16:44:27] !log jgreen@dns1004 START - running authdns-update [16:45:59] !log jgreen@dns1004 END - running authdns-update [16:48:19] 06SRE, 10SRE-Access-Requests: SSH key replacement for tchanders - https://phabricator.wikimedia.org/T417056#11989884 (10Dzahn) a:05tappof→03None [16:58:28] (03PS1) 10Andrew Bogott: add-security-group-to-project.py [puppet] - 10https://gerrit.wikimedia.org/r/1298325 (https://phabricator.wikimedia.org/T422801) [16:59:01] (03CR) 10CI reject: [V:04-1] add-security-group-to-project.py [puppet] - 10https://gerrit.wikimedia.org/r/1298325 (https://phabricator.wikimedia.org/T422801) (owner: 10Andrew Bogott) [17:05:08] (03PS2) 10Andrew Bogott: add-security-group-to-project.py [puppet] - 10https://gerrit.wikimedia.org/r/1298325 (https://phabricator.wikimedia.org/T422801) [17:05:40] (03PS3) 10Andrew Bogott: add-security-group-to-project.py [puppet] - 10https://gerrit.wikimedia.org/r/1298325 (https://phabricator.wikimedia.org/T422801) [17:07:45] 06SRE, 10SRE-Access-Requests: SSH key replacement for tchanders - https://phabricator.wikimedia.org/T417056#11989970 (10Dzahn) Yes, in this case we can just reuse the ticket. Just adjusted assignee because we have rotating clinic duty to handle these. [17:07:58] (03CR) 10CI reject: [V:04-1] add-security-group-to-project.py [puppet] - 10https://gerrit.wikimedia.org/r/1298325 (https://phabricator.wikimedia.org/T422801) (owner: 10Andrew Bogott) [17:08:31] (03PS4) 10Andrew Bogott: add-security-group-to-project.py [puppet] - 10https://gerrit.wikimedia.org/r/1298325 (https://phabricator.wikimedia.org/T422801) [17:09:39] 10ops-eqiad, 06SRE, 06DC-Ops: C/D refresh Nokia switches Exhaust direction is reversed - https://phabricator.wikimedia.org/T428260#11989975 (10RobH) I've dropped an email to our vendors to figure out if this can be something we swap: > Can previously ordered 7220 IXR- D2L AC FtB be swapped to back to fron... [17:14:36] (03PS1) 10Atsuko: admin_ng/dse-k8s: create opensearch ClusterIssuer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298327 (https://phabricator.wikimedia.org/T427517) [17:15:57] (03PS1) 10VadymTS1: English Wikiversity: Add new user group "autopatrolled" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1298328 (https://phabricator.wikimedia.org/T428269) [17:17:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 08 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1298328 (https://phabricator.wikimedia.org/T428269) (owner: 10VadymTS1) [17:20:11] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for OSleger_WMF - https://phabricator.wikimedia.org/T428262#11989991 (10Dzahn) @thcipriani Here is a request for "deployment" group. [17:20:13] (03PS1) 10SD0001: tables-catalog: set betafeatures_user_counts to public visibility [puppet] - 10https://gerrit.wikimedia.org/r/1298329 (https://phabricator.wikimedia.org/T402145) [17:25:01] (03CR) 10CI reject: [V:04-1] admin_ng/dse-k8s: create opensearch ClusterIssuer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298327 (https://phabricator.wikimedia.org/T427517) (owner: 10Atsuko) [17:36:11] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1298325 (https://phabricator.wikimedia.org/T422801) (owner: 10Andrew Bogott) [17:42:19] (03CR) 10Brouberol: "The CI is failing with `authSecret needs a key` because the same way you probably added the authKey to the private puppet repo, you also n" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298327 (https://phabricator.wikimedia.org/T427517) (owner: 10Atsuko) [17:43:36] (03PS5) 10Andrew Bogott: add-security-group-to-project.py [puppet] - 10https://gerrit.wikimedia.org/r/1298325 (https://phabricator.wikimedia.org/T422801) [17:43:52] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1298325 (https://phabricator.wikimedia.org/T422801) (owner: 10Andrew Bogott) [17:59:33] (03CR) 10CDobbins: varnish: Add CSP report-only header value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [18:00:45] (03CR) 10Ssingh: "Can you add the PCC link for this as well? Trying to see where it is failing." [puppet] - 10https://gerrit.wikimedia.org/r/1297769 (owner: 10CDobbins) [18:01:59] (03CR) 10CDobbins: "https://puppet-compiler.wmflabs.org/output/1297769/8657/cp2044.codfw.wmnet/change.cp2044.codfw.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/1297769 (owner: 10CDobbins) [18:02:34] (03PS6) 10Andrew Bogott: add-security-group-to-project.py [puppet] - 10https://gerrit.wikimedia.org/r/1298325 (https://phabricator.wikimedia.org/T422801) [18:02:41] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1298325 (https://phabricator.wikimedia.org/T422801) (owner: 10Andrew Bogott) [18:24:38] (03PS7) 10Andrew Bogott: add-security-group-to-project.py [puppet] - 10https://gerrit.wikimedia.org/r/1298325 (https://phabricator.wikimedia.org/T422801) [18:26:01] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1298325 (https://phabricator.wikimedia.org/T422801) (owner: 10Andrew Bogott) [18:29:00] (03PS8) 10Andrew Bogott: add-security-group-to-project.py [puppet] - 10https://gerrit.wikimedia.org/r/1298325 (https://phabricator.wikimedia.org/T422801) [18:29:05] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1298325 (https://phabricator.wikimedia.org/T422801) (owner: 10Andrew Bogott) [18:35:01] (03PS9) 10Andrew Bogott: add-security-group-to-project.py [puppet] - 10https://gerrit.wikimedia.org/r/1298325 (https://phabricator.wikimedia.org/T422801) [18:35:12] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1298325 (https://phabricator.wikimedia.org/T422801) (owner: 10Andrew Bogott) [18:42:16] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [18:47:18] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 220.92 ms [18:53:57] FIRING: JobUnavailable: Reduced availability for job rsyslog-receiver in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:08:46] 06SRE, 06serviceops-deprecated, 10Thumbor: Thumbor-k8s performance improvements - https://phabricator.wikimedia.org/T333445#11990173 (10Izno) [19:08:48] 06SRE, 06serviceops-deprecated, 10Thumbor, 13Patch-For-Review: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196#11990174 (10Izno) [19:22:54] (03CR) 10Ladsgroup: [C:04-1] "I'd also would need a sign off from the privacy team" [puppet] - 10https://gerrit.wikimedia.org/r/1298329 (https://phabricator.wikimedia.org/T402145) (owner: 10SD0001) [19:33:38] (03CR) 10Dzahn: "So.. yea.. we have been discussing this one before during the last migration I think. The Hiera key value here is passed through to the cl" [puppet] - 10https://gerrit.wikimedia.org/r/1297236 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [19:34:02] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for OSleger_WMF - https://phabricator.wikimedia.org/T428262#11990183 (10thcipriani) Approved. @Osleger-WMF for backports, you'll also want these bits: - Our web deploy tool [[https://wikitech.wikimedia.org/wiki/Scap/SpiderPig|SpiderPig]] also requi... [19:36:32] (03CR) 10Dzahn: "class profile::ci (" [puppet] - 10https://gerrit.wikimedia.org/r/1297236 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [19:37:44] (03CR) 10Dzahn: "The question is: this all looks to me like it is supposed to STOP AND MASK jenkins UNLESS it is on the $manager_host." [puppet] - 10https://gerrit.wikimedia.org/r/1297236 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [19:54:51] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for OSleger_WMF - https://phabricator.wikimedia.org/T428262#11990196 (10Dzahn) [19:58:47] (03CR) 10Cathal Mooney: "heh ok. well look the dashboard is not new it’s been there for a year or something." [alerts] - 10https://gerrit.wikimedia.org/r/1297747 (owner: 10Cathal Mooney) [20:02:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290093 (https://phabricator.wikimedia.org/T107188) (owner: 10Krinkle) [20:03:09] (03Merged) 10jenkins-bot: Enable wmgUseUrlShortenerLegacy on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1290093 (https://phabricator.wikimedia.org/T107188) (owner: 10Krinkle) [20:10:35] !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1290093|Enable wmgUseUrlShortenerLegacy on test2wiki (T107188)]] [20:10:39] T107188: Sunset ShortUrl extension in favour of UrlShortener extension - https://phabricator.wikimedia.org/T107188 [20:12:41] !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1290093|Enable wmgUseUrlShortenerLegacy on test2wiki (T107188)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:16:23] !log krinkle@deploy1003 krinkle: Continuing with deployment [20:16:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:20:38] !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1290093|Enable wmgUseUrlShortenerLegacy on test2wiki (T107188)]] (duration: 10m 02s) [20:20:42] T107188: Sunset ShortUrl extension in favour of UrlShortener extension - https://phabricator.wikimedia.org/T107188 [20:54:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.42% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:56:11] !log Running `mwscript-k8s extensions/MediaModeration/maintenance/scanFilesInScanTable.php --wiki="commonswiki" --use-jobqueue --poll-sleep=30 --verbose` (after stopping the other commons scan) [20:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.52% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:01:36] !log Running `mwscript-k8s extensions/MediaModeration/maintenance/scanFilesInScanTable.php --wiki="commonswiki" --use-jobqueue --poll-sleep=10 --verbose` (after stopping the other commons scan) [21:01:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.07% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:08:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.07% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:35:02] (03CR) 10Cathal Mooney: [C:03+1] Decom cookbook: user Homer (except for VCs) and clean BGP [cookbooks] - 10https://gerrit.wikimedia.org/r/1275821 (owner: 10Ayounsi) [22:06:57] (03PS1) 10DDesouza: miscweb(design-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298358 (https://phabricator.wikimedia.org/T344471) [22:07:07] (03CR) 10CI reject: [V:04-1] miscweb(design-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298358 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza) [22:08:56] (03PS2) 10DDesouza: miscweb(design-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298358 (https://phabricator.wikimedia.org/T344471) [22:11:43] (03CR) 10DDesouza: [C:03+2] miscweb(design-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298358 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza) [22:14:14] (03Merged) 10jenkins-bot: miscweb(design-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298358 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza) [22:15:29] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [22:15:41] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [22:15:43] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [22:15:55] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [22:15:56] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [22:16:11] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [22:21:52] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11990446 (10BCornwall) [22:23:14] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11990448 (10BCornwall) 05Open→03Resolved [22:55:57] FIRING: JobUnavailable: Reduced availability for job rsyslog-receiver in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:02:57] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for OSleger_WMF - https://phabricator.wikimedia.org/T428262#11990509 (10Dzahn) [23:04:25] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for OSleger_WMF - https://phabricator.wikimedia.org/T428262#11990510 (10Dzahn) Thanks all. Almost all boxes are already checked now. Just need to verify the SSH key outside of the ticket. @Aklapper Want to verify if Phab user is linked properly? [23:05:27] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for OSleger_WMF - https://phabricator.wikimedia.org/T428262#11990513 (10Dzahn) @OSleger-WMF Could you maybe send a direct email between our official wikimedia inboxes to confirm this is you and the correct key? I am dzahn@ . Cheers [23:40:02] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1298367 [23:40:02] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1298367 (owner: 10TrainBranchBot) [23:43:14] 10SRE-swift-storage, 10MediaWiki-Uploading: "Could not read file" error during upload - https://phabricator.wikimedia.org/T428315#11990562 (10Pppery) [23:51:07] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1298367 (owner: 10TrainBranchBot)