[00:16:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:30:57] 10SRE-Access-Requests, 10fundraising-tech-ops: Fundraising access request for Daniel Miranda - https://phabricator.wikimedia.org/T429276 (10dmiranda) 03NEW [00:31:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:54:48] FIRING: KubernetesCalicoDown: dse-k8s-worker1009.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1009.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [00:54:53] (03CR) 10Scott French: "Great, that's quite helpful to know as well!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1301313 (https://phabricator.wikimedia.org/T427668) (owner: 10Blake) [00:55:16] FIRING: [4x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1009:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [01:10:59] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.47.0-wmf.7 [core] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302298 (https://phabricator.wikimedia.org/T423916) [01:11:02] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.47.0-wmf.7 [core] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302298 (https://phabricator.wikimedia.org/T423916) (owner: 10TrainBranchBot) [01:12:02] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1302299 [01:12:02] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1302299 (owner: 10TrainBranchBot) [01:20:02] (03Merged) 10jenkins-bot: Branch commit for wmf/1.47.0-wmf.7 [core] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302298 (https://phabricator.wikimedia.org/T423916) (owner: 10TrainBranchBot) [01:20:08] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1302299 (owner: 10TrainBranchBot) [01:33:03] (03PS2) 10RLazarus: cli: argparse fix for Python 3.14 compatibility [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1300941 [01:48:02] (03CR) 10RLazarus: "Before (Python 3.13):" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1300941 (owner: 10RLazarus) [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260616T0200) [02:07:39] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:32:35] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:34:39] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:36:03] (03CR) 10Scott French: [C:03+1] "This seems better to me - i.e., I'm not sure I can think of a case where the nightly group was strictly necessary (the only essential thin" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1300941 (owner: 10RLazarus) [02:37:05] (03PS3) 10Krinkle: Disable ShortUrl on hiwiki, hiwikiversity, maiwiki, knwiki, knwikisource, tcywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302274 (https://phabricator.wikimedia.org/T107188) [02:52:34] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:00:04] Deploy window Automatic deployment of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260616T0300) [03:01:40] (03PS1) 10TrainBranchBot: testwikis to 1.47.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302319 (https://phabricator.wikimedia.org/T423916) [03:01:43] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by mwpresync@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302319 (https://phabricator.wikimedia.org/T423916) (owner: 10TrainBranchBot) [03:07:29] (03Merged) 10jenkins-bot: testwikis to 1.47.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302319 (https://phabricator.wikimedia.org/T423916) (owner: 10TrainBranchBot) [03:07:37] (03PS1) 10DLynch: EditChecks: Namespace tracking object for seen/shown/used checks [extensions/VisualEditor] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302320 [03:12:44] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:27:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:00:04] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260616T0400) [04:05:30] !log mwpresync@deploy1003 Pruned MediaWiki: 1.47.0-wmf.4 (duration: 05m 29s) [04:11:02] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Standardize management routers interfaces - https://phabricator.wikimedia.org/T421674#12021945 (10Papaul) @ayounsi @cmooney i was looking at moving the mgmt interface to irb.900 and I noticed on all the mr's there is the default DHCP... [04:12:44] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:37:06] (03PS1) 10Papaul: Add interface irb.900 to security zone mgmt [homer/public] - 10https://gerrit.wikimedia.org/r/1302337 (https://phabricator.wikimedia.org/T421674) [04:40:00] (03CR) 10Papaul: "Please just review I will merge once i am done with the netbox and private repo changes. thanks" [homer/public] - 10https://gerrit.wikimedia.org/r/1302337 (https://phabricator.wikimedia.org/T421674) (owner: 10Papaul) [04:54:25] 07sre-alert-triage, 06Data-Platform-SRE (2026-06-05 - 2026-06-26), 13Patch-For-Review: Alert in need of triage: ResourceQuotaMemoryLimitsWarning - https://phabricator.wikimedia.org/T426589#12021988 (10RKemper) Keep forgetting to deploy this deployment-charts patch; will deploy tomorrow [04:54:48] FIRING: KubernetesCalicoDown: dse-k8s-worker1009.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1009.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [04:55:01] !log T427951 Deleted 4 leftover mirrored dev/test topics from kafka-test: `eqiad.mediawiki.{page_html_content_change.dev{1,4},page_edit_type_simple.dev0}`, `eqiad.mw_page_edit_type_enrich.error` [04:55:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:55:05] T427951: Delete some unused development topics on Kafka Jumbo - https://phabricator.wikimedia.org/T427951 [04:55:16] FIRING: [4x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1009:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [05:00:26] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Standardize management routers interfaces - https://phabricator.wikimedia.org/T421674#12021991 (10Papaul) [05:37:35] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Standardize management routers interfaces - https://phabricator.wikimedia.org/T421674#12022005 (10ayounsi) Indeed, we can remove that DHCP pool. [05:41:42] (03CR) 10Ayounsi: [C:03+1] Add interface irb.900 to security zone mgmt [homer/public] - 10https://gerrit.wikimedia.org/r/1302337 (https://phabricator.wikimedia.org/T421674) (owner: 10Papaul) [05:41:56] 06SRE, 10observability, 06SRE Observability: Alerts showing "AlertLintProblem" - MySQLReplicaNotUsingGTID - https://phabricator.wikimedia.org/T427469#12022010 (10Marostegui) Thanks for taking such a detailed look! >>! In T427469#12008708, @tappof wrote: > > The really interesting question at this point is... [05:42:37] (03PS2) 10WMDE-Fisch: Update VE core submodule to master (3e79e9934) [extensions/VisualEditor] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302170 (https://phabricator.wikimedia.org/T397319) [05:42:53] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: es2045 down - https://phabricator.wikimedia.org/T429113#12022015 (10Marostegui) Thanks @Jhancock.wm I will start the cloning and repooling. Thank you! [05:44:27] (03PS1) 10WMDE-Fisch: Improve click intent event logging and exposure tracking [extensions/Cite] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302629 (https://phabricator.wikimedia.org/T426974) [05:47:23] Damn. I needed to update a cherry pick to wmf/1.47.0-wmf.6 and now I accidentally pushed it to that branch instead of adding a new patch set. Now I wanted to deploy that in the next window but obviously cant schedule merged patches.... https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/1302170 [05:47:54] Anyone any ideas how to fix that best new? [05:48:02] *now [05:52:32] (03CR) 10WMDE-Fisch: "recheck" [extensions/VisualEditor] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302170 (https://phabricator.wikimedia.org/T397319) (owner: 10WMDE-Fisch) [05:58:15] ( I could "just" try to scap backport on deployment now or during the window ) [05:58:47] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [05:59:08] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool es1047: Upgrading es1047.eqiad.wmnet [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260616T0600) [06:00:05] marostegui, Amir1, and federico3: Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260616T0600). Please do the needful. [06:01:11] 06SRE, 10homer, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Homer should abort on filter rules applied on non-existent or disabled interfaces - https://phabricator.wikimedia.org/T428886#12022047 (10ayounsi) yeah me neither, it's a tradeoff (engineering time/impact) that I think is accepta... [06:03:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 16 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/Cite] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302629 (https://phabricator.wikimedia.org/T426974) (owner: 10WMDE-Fisch) [06:24:34] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.depool (exit_code=99) depool es1047: Upgrading es1047.eqiad.wmnet [06:24:34] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [06:24:44] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [06:24:47] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [06:25:35] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es1047.eqiad.wmnet with OS trixie [06:26:26] 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-ctrl2006.mgmt:22 - https://phabricator.wikimedia.org/T429283 (10phaultfinder) 03NEW [06:30:20] (03PS1) 10Kevin Bazira: ml-services: deploy cope-b-a4b isvc that was migrated from HF transformers to vLLM 0.22.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302641 (https://phabricator.wikimedia.org/T427497) [06:34:25] 06SRE, 06Infrastructure-Foundations, 06Traffic: Scaling urldownloaders by adding redundancy and load balancing - https://phabricator.wikimedia.org/T429175#12022111 (10ayounsi) Thanks for the great writeup. For the advantages and drawbacks you listed, my preference would go to (3) liberica/pybal. To me the ru... [06:40:29] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es1047.eqiad.wmnet with reason: host reimage [06:41:53] (03PS3) 10Ryan Kemper: cirrussearch: Add minimal opensearch config for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/1302280 (https://phabricator.wikimedia.org/T425585) (owner: 10Bking) [06:42:36] (03CR) 10Ryan Kemper: [C:03+1] "Looks good (fixed some inconsequential typos in PS3)" [puppet] - 10https://gerrit.wikimedia.org/r/1302280 (https://phabricator.wikimedia.org/T425585) (owner: 10Bking) [06:43:28] 06SRE, 06Infrastructure-Foundations, 06Traffic: Scaling urldownloaders by adding redundancy and load balancing - https://phabricator.wikimedia.org/T429175#12022117 (10MoritzMuehlenhoff) Let's wait until Liberica is available and then go with 3. [06:44:17] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1047.eqiad.wmnet with reason: host reimage [06:47:38] (03CR) 10Muehlenhoff: [C:03+2] mx-out: Enable profile::auto_restarts::service for Dovecot [puppet] - 10https://gerrit.wikimedia.org/r/1300804 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [06:50:35] !log hashar@deploy1003 Started deploy [integration/docroot@2165507]: build: Updating js-yaml to 4.2.0 [06:50:51] !log hashar@deploy1003 Finished deploy [integration/docroot@2165507]: build: Updating js-yaml to 4.2.0 (duration: 00m 16s) [06:51:59] 06SRE, 10observability, 06SRE Observability: Alerts showing "AlertLintProblem" - MySQLReplicaNotUsingGTID - https://phabricator.wikimedia.org/T427469#12022131 (10tappof) Yeah, you're correct. While taking a look, I didn't notice the "no" here (I only focused on "off" and "disabled"... my bad): ` case "no",... [06:52:34] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:53:03] (03CR) 10Muehlenhoff: [C:03+2] Blocklisting more unused packet mangling/network scheduler modules [puppet] - 10https://gerrit.wikimedia.org/r/1301368 (owner: 10Muehlenhoff) [06:57:06] 06SRE, 10observability, 06SRE Observability: Alerts showing "AlertLintProblem" - MySQLReplicaNotUsingGTID - https://phabricator.wikimedia.org/T427469#12022152 (10Marostegui) I will prepare a patch, thanks! [07:00:04] Amir1, urbanecm, and awight: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260616T0700). [07:00:04] WMDE-Fisch: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:09] I'll self-serve \o [07:03:27] WMDE-Fisch: I'll have a private code deployment to do. Could you please ping me, when you're done with your backports? [07:03:48] Sure [07:05:39] I currently have unexpected commits from https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1302319 [07:07:12] jeena dduvall ? ^ [07:07:21] (03CR) 10Slyngshede: [C:03+1] admin: Add echukwukere to analytics-private-data [puppet] - 10https://gerrit.wikimedia.org/r/1302270 (https://phabricator.wikimedia.org/T428827) (owner: 10BCornwall) [07:07:32] I see there's uncommited code in PrivateSettings.php (which was deployed yesterday). Not sure it that's what scap complains about, but I can commit it [07:08:01] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1047.eqiad.wmnet with OS trixie [07:08:18] (03CR) 10Muehlenhoff: "On a side note: I think on Trixie we're actually really close to just build Homer from Python packages as provided by Debian; junos-eznc i" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1302127 (owner: 10Ayounsi) [07:08:19] (03CR) 10Slyngshede: [C:03+1] admin: Add mfossati to ml-lab-users [puppet] - 10https://gerrit.wikimedia.org/r/1302272 (https://phabricator.wikimedia.org/T429148) (owner: 10BCornwall) [07:09:57] I've committed these leftover changes in PrivateSettings.php [07:11:52] WMDE-Fisch: I think the automated deploy to testwikis failed because the branch didn't merge yet so it's okay to deploy those commits from that change [07:12:52] jeena: So I would deploy the with my change? [07:13:04] yeah if you continue it should be fine [07:13:10] Okay :-) [07:14:02] !log wmde-fisch@deploy1003 Started scap sync-world: Backport for [[gerrit:1302170|Update VE core submodule to master (3e79e9934) (T397319 T428764)]] [07:14:08] T397319: VisualDiff: Subref footnote markers show the wrong number - https://phabricator.wikimedia.org/T397319 [07:14:09] T428764: VE assigns the wrong reference name when re-using a reference in an article that already uses sub-references with VE auto ref names - https://phabricator.wikimedia.org/T428764 [07:19:24] 10SRE-swift-storage, 06Commons: Compressing TIFF files from the Library of Congress - https://phabricator.wikimedia.org/T429264#12022228 (10Aklapper) [07:24:09] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: deploy cope-b-a4b isvc that was migrated from HF transformers to vLLM 0.22.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302641 (https://phabricator.wikimedia.org/T427497) (owner: 10Kevin Bazira) [07:24:55] (03CR) 10Ilias Sarantopoulos: ml-services: deploy cope-b-a4b isvc that was migrated from HF transformers to vLLM 0.22.1 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302641 (https://phabricator.wikimedia.org/T427497) (owner: 10Kevin Bazira) [07:27:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:29:06] (03CR) 10Brouberol: [C:03+1] Deploy the new version of the ceph-csi plugin [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302177 (https://phabricator.wikimedia.org/T428385) (owner: 10Btullis) [07:30:40] 10SRE-swift-storage, 06Commons: Compressing TIFF files from the Library of Congress - https://phabricator.wikimedia.org/T429264#12022233 (10MatthewVernon) Likely so, yes. The issue (as we'll have to get to with T427949) is to avoid ending up storing the uncompressed versions forever in swift (in deleted items)... [07:33:44] !log wmde-fisch@deploy1003 wmde-fisch: Backport for [[gerrit:1302170|Update VE core submodule to master (3e79e9934) (T397319 T428764)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:33:50] T397319: VisualDiff: Subref footnote markers show the wrong number - https://phabricator.wikimedia.org/T397319 [07:33:51] T428764: VE assigns the wrong reference name when re-using a reference in an article that already uses sub-references with VE auto ref names - https://phabricator.wikimedia.org/T428764 [07:33:51] Testing [07:36:08] !log wmde-fisch@deploy1003 wmde-fisch: Continuing with deployment [07:44:07] Seems I cant do the 2nd backport in that window .... :-( [07:45:26] In terms of time? AFAIK, there's 2 hours of unallocated time afterwards, so maybe nobody's going to complain [07:45:38] RIght :-) [07:47:03] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1302262 (https://phabricator.wikimedia.org/T429129) (owner: 10JHathaway) [07:49:29] (03PS1) 10Brouberol: airflow-fr-tech: configure s3_fr_tech connection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302668 (https://phabricator.wikimedia.org/T429048) [07:50:16] !log wmde-fisch@deploy1003 Finished scap sync-world: Backport for [[gerrit:1302170|Update VE core submodule to master (3e79e9934) (T397319 T428764)]] (duration: 36m 13s) [07:50:22] T397319: VisualDiff: Subref footnote markers show the wrong number - https://phabricator.wikimedia.org/T397319 [07:50:22] T428764: VE assigns the wrong reference name when re-using a reference in an article that already uses sub-references with VE auto ref names - https://phabricator.wikimedia.org/T428764 [07:50:59] (03PS2) 10Brouberol: airflow-fr-tech: configure s3_fr_tech connection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302668 (https://phabricator.wikimedia.org/T429048) [07:51:57] So I'll continue with another backport now [07:52:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by wmde-fisch@deploy1003 using scap backport" [extensions/Cite] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302629 (https://phabricator.wikimedia.org/T426974) (owner: 10WMDE-Fisch) [07:53:16] (03Merged) 10jenkins-bot: Improve click intent event logging and exposure tracking [extensions/Cite] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302629 (https://phabricator.wikimedia.org/T426974) (owner: 10WMDE-Fisch) [07:54:00] !log wmde-fisch@deploy1003 Started scap sync-world: Backport for [[gerrit:1302629|Improve click intent event logging and exposure tracking]] [07:54:04] (03CR) 10Slyngshede: [C:03+2] P:idp allow services to require MFA [puppet] - 10https://gerrit.wikimedia.org/r/1302126 (owner: 10Slyngshede) [07:58:01] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool es1047: repool after upgrade [07:58:16] !log wmde-fisch@deploy1003 wmde-fisch: Backport for [[gerrit:1302629|Improve click intent event logging and exposure tracking]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:58:41] !log wmde-fisch@deploy1003 wmde-fisch: Continuing with deployment [07:59:33] (03PS2) 10Fabfur: cache::haproxy: remove x-provenance feature flag [puppet] - 10https://gerrit.wikimedia.org/r/1301429 (https://phabricator.wikimedia.org/T427068) [07:59:52] (03PS1) 10Marostegui: mysql-gtid.yaml: Add pint [alerts] - 10https://gerrit.wikimedia.org/r/1302724 (https://phabricator.wikimedia.org/T427469) [08:00:34] (03CR) 10Fabfur: "Thanks for the review, I added the check on deployment-prep, in case it fails I can add this variable as empty so it should be NOOP even t" [puppet] - 10https://gerrit.wikimedia.org/r/1301429 (https://phabricator.wikimedia.org/T427068) (owner: 10Fabfur) [08:00:41] !log update bird on ganeti7001 to 2.18.2-1~wmf12u1 [08:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:56] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1301429 (https://phabricator.wikimedia.org/T427068) (owner: 10Fabfur) [08:04:21] (03CR) 10Muehlenhoff: [C:03+2] Add cumin2003 as additional git peer for Homer [puppet] - 10https://gerrit.wikimedia.org/r/1301330 (https://phabricator.wikimedia.org/T427897) (owner: 10Muehlenhoff) [08:05:31] !log wmde-fisch@deploy1003 Finished scap sync-world: Backport for [[gerrit:1302629|Improve click intent event logging and exposure tracking]] (duration: 11m 31s) [08:08:08] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1302299 (owner: 10TrainBranchBot) [08:08:31] WMDE-Fisch: Can I proceed with my deployment? [08:08:45] Yes thanks! [08:08:48] Msz2001: [08:08:55] (03CR) 10Fabfur: cache::haproxy: req.provenance to txn.provenance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1301431 (https://phabricator.wikimedia.org/T427068) (owner: 10Fabfur) [08:09:01] Okay, I'll do that shortly [08:09:47] (03PS3) 10Urbanecm: [Growth] wikidatawiki: Enable Growth features [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297161 (https://phabricator.wikimedia.org/T418115) [08:10:53] (03CR) 10Urbanecm: [Growth] wikidatawiki: Enable Growth features [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297161 (https://phabricator.wikimedia.org/T418115) (owner: 10Urbanecm) [08:13:14] (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: remove 'key' label from neutron_netns metrics [puppet] - 10https://gerrit.wikimedia.org/r/1302164 (https://phabricator.wikimedia.org/T328502) (owner: 10Filippo Giunchedi) [08:13:23] (03PS2) 10Filippo Giunchedi: prometheus: remove 'key' label from neutron_netns metrics [puppet] - 10https://gerrit.wikimedia.org/r/1302164 (https://phabricator.wikimedia.org/T328502) [08:17:00] (03CR) 10Urbanecm: [C:04-2] "should be only deplyoed with wmf.7 on all wikis (so when wmf.8 starts its journey)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1301349 (https://phabricator.wikimedia.org/T426742) (owner: 10Sergio Gimeno) [08:17:28] !log atsuko@deploy1003 mwscript-k8s job started: foreachwikiindblist mwscript.dblist extensions/Translate/scripts/ttmserver-export.php --ttmserver codfw-k8s # T425377: populating translation memory (ttmserver-export.php) on codfw-k8s (dblist: https://phabricator.wikimedia.org/P94157) [08:17:33] T425377: Migrate Ttmserver (Translatewiki application) indices from production OpenSearch to OpenSearch on k8s - https://phabricator.wikimedia.org/T425377 [08:17:59] (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: remove 'key' label from neutron_netns metrics [puppet] - 10https://gerrit.wikimedia.org/r/1302164 (https://phabricator.wikimedia.org/T328502) (owner: 10Filippo Giunchedi) [08:19:50] !log mszwarc@deploy1003 Synchronized private/SuggestedInvestigationsSignals: Private code deployment for Suggested Investigations (duration: 06m 03s) [08:20:29] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1302732 [08:20:29] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1302732 (owner: 10TrainBranchBot) [08:21:00] Msz2001: Can you ping me when done? [08:21:07] I'd like to use scap :D [08:21:11] Sure, one moment [08:21:11] jouncebot: nowandnext [08:21:11] No deployments scheduled for the next 1 hour(s) and 38 minute(s) [08:21:11] In 1 hour(s) and 38 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260616T1000) [08:22:13] (03CR) 10Filippo Giunchedi: team-wmcs: introduce per-namespace neutron conntrack alert (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1302151 (https://phabricator.wikimedia.org/T328502) (owner: 10Filippo Giunchedi) [08:23:09] !log mszwarc@deploy1003 Synchronized private/PrivateSettings.php: Private code deployment for Suggested Investigations (duration: 02m 23s) [08:23:43] Dreamy_Jazz: Okay, I'm done [08:24:57] (03CR) 10Filippo Giunchedi: openstack: deprecate ensure_running_kvm_instances check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1302114 (https://phabricator.wikimedia.org/T328502) (owner: 10Filippo Giunchedi) [08:25:14] (03PS1) 10Dreamy Jazz: hCaptcha: Enable for MobileFrontend in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302735 (https://phabricator.wikimedia.org/T425940) [08:26:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302735 (https://phabricator.wikimedia.org/T425940) (owner: 10Dreamy Jazz) [08:26:38] (03CR) 10Filippo Giunchedi: [C:03+2] toolforge: remove toolschecker from legacy redirector [puppet] - 10https://gerrit.wikimedia.org/r/1302110 (https://phabricator.wikimedia.org/T313030) (owner: 10Filippo Giunchedi) [08:26:56] (03Merged) 10jenkins-bot: hCaptcha: Enable for MobileFrontend in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302735 (https://phabricator.wikimedia.org/T425940) (owner: 10Dreamy Jazz) [08:27:21] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1302735|hCaptcha: Enable for MobileFrontend in all wikis (T425940)]] [08:27:25] T425940: hCaptcha: Rollout of MobileFrontend and VisualEditor integrations - https://phabricator.wikimedia.org/T425940 [08:28:40] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1302732 (owner: 10TrainBranchBot) [08:29:18] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1302735|hCaptcha: Enable for MobileFrontend in all wikis (T425940)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:29:57] (03CR) 10Filippo Giunchedi: [C:03+2] openstack: deprecate ensure_running_kvm_instances check [puppet] - 10https://gerrit.wikimedia.org/r/1302114 (https://phabricator.wikimedia.org/T328502) (owner: 10Filippo Giunchedi) [08:32:54] !log installing nginx security updates [08:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:14] Still testing mine [08:34:59] (03PS3) 10Arnaudb: gitlab: add a hiera key for broadcast_message banner [puppet] - 10https://gerrit.wikimedia.org/r/1302733 (https://phabricator.wikimedia.org/T425441) [08:35:11] (03PS3) 10Arnaudb: gitlab: announce the SSH hostname migration via banner [puppet] - 10https://gerrit.wikimedia.org/r/1302734 (https://phabricator.wikimedia.org/T425441) [08:39:11] (03PS1) 10Brouberol: data-platform: add alert on kafka-jumbo partition sizes [alerts] - 10https://gerrit.wikimedia.org/r/1302737 (https://phabricator.wikimedia.org/T429127) [08:39:48] (03PS1) 10Dreamy Jazz: Revert "hCaptcha: Enable for MobileFrontend in all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302738 [08:40:42] So the editor didn't open when on the debug servers consistently [08:40:48] (03PS1) 10Abijeet Patro: ULS rewrite: Lock body scroll when open on mobile [extensions/UniversalLanguageSelector] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302739 [08:41:17] (03CR) 10CI reject: [V:04-1] data-platform: add alert on kafka-jumbo partition sizes [alerts] - 10https://gerrit.wikimedia.org/r/1302737 (https://phabricator.wikimedia.org/T429127) (owner: 10Brouberol) [08:41:56] (03CR) 10Joal: [C:03+1] data-platform: add alert on kafka-jumbo partition sizes [alerts] - 10https://gerrit.wikimedia.org/r/1302737 (https://phabricator.wikimedia.org/T429127) (owner: 10Brouberol) [08:42:26] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with deployment [08:42:40] I'm going to assume this was an issue with the debug servers only [08:42:42] I have a revert ready [08:43:25] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool es1047: repool after upgrade [08:43:30] (03PS2) 10Brouberol: data-platform: add alert on kafka-jumbo partition sizes [alerts] - 10https://gerrit.wikimedia.org/r/1302737 (https://phabricator.wikimedia.org/T429127) [08:45:36] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [08:45:53] (03PS2) 10Abijeet Patro: ULS rewrite: Lock body scroll when open on mobile [extensions/UniversalLanguageSelector] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302739 [08:45:57] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool es1036: Upgrading es1036.eqiad.wmnet [08:46:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 17 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/UniversalLanguageSelector] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302739 (owner: 10Abijeet Patro) [08:46:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 17 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/UniversalLanguageSelector] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302739 (owner: 10Abijeet Patro) [08:46:44] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1302735|hCaptcha: Enable for MobileFrontend in all wikis (T425940)]] (duration: 19m 23s) [08:46:48] T425940: hCaptcha: Rollout of MobileFrontend and VisualEditor integrations - https://phabricator.wikimedia.org/T425940 [08:47:17] (03CR) 10Ayounsi: [C:03+1] Allow cumin2003 in IRC notifications [puppet] - 10https://gerrit.wikimedia.org/r/1302147 (https://phabricator.wikimedia.org/T427897) (owner: 10Muehlenhoff) [08:47:18] (03PS1) 10Abijeet Patro: ULS rewrite: Fix settings dialog width and field sizing [extensions/UniversalLanguageSelector] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302743 (https://phabricator.wikimedia.org/T416512) [08:47:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 17 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/UniversalLanguageSelector] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302743 (https://phabricator.wikimedia.org/T416512) (owner: 10Abijeet Patro) [08:47:47] (03PS2) 10Effie Mouzeli: site.pp: remove retired redis hosts [puppet] - 10https://gerrit.wikimedia.org/r/1300761 (https://phabricator.wikimedia.org/T428858) [08:47:58] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool es1036: Upgrading es1036.eqiad.wmnet [08:48:38] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es1036.eqiad.wmnet with OS trixie [08:49:01] (03PS3) 10Effie Mouzeli: site.pp: remove retired redis hosts [puppet] - 10https://gerrit.wikimedia.org/r/1300761 (https://phabricator.wikimedia.org/T428858) [08:51:07] Dreamy_Jazz: are you done? :) [08:51:26] Maybe, it seems some caches need clearing but they might be clearing themselves [08:51:47] (I was not seeing the expected code being served on non-debug servers until several hard refreshes later) [08:52:03] i see [08:52:04] i can wait :) [08:53:00] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: decommission rdb2007-rdb2010.codfw.wmnet - https://phabricator.wikimedia.org/T428561#12022501 (10jijiki) 05In progress→03Open a:05jijiki→03None [08:53:08] (03PS1) 10CWilliams: Cookbook sre.mysql.upgrade should not accept multiple hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1302745 (https://phabricator.wikimedia.org/T429230) [08:53:50] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: decommission rdb101[12].eqiad.wmnet - https://phabricator.wikimedia.org/T428858#12022513 (10jijiki) [08:54:19] urbanecm: You can go ahead, it seems stable [08:54:48] FIRING: KubernetesCalicoDown: dse-k8s-worker1009.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1009.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:55:01] (03PS1) 10Marostegui: pc1022: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1302747 [08:55:16] FIRING: [4x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1009:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [08:56:17] (03CR) 10CI reject: [V:04-1] Cookbook sre.mysql.upgrade should not accept multiple hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1302745 (https://phabricator.wikimedia.org/T429230) (owner: 10CWilliams) [08:56:24] !log uploaded bird 2.18.2-1~wmf12u1 to bookworm-wikimedia T429285 [08:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1154:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1154 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:57:32] ty Dreamy_Jazz [08:57:41] (03CR) 10Urbanecm: [C:03+2] [Growth] wikidatawiki: Enable Growth features [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297161 (https://phabricator.wikimedia.org/T418115) (owner: 10Urbanecm) [08:59:26] (03PS2) 10Kevin Bazira: ml-services: deploy cope-b-a4b isvc that was migrated from HF transformers to vLLM 0.22.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302641 (https://phabricator.wikimedia.org/T427497) [09:00:23] (03Merged) 10jenkins-bot: [Growth] wikidatawiki: Enable Growth features [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297161 (https://phabricator.wikimedia.org/T418115) (owner: 10Urbanecm) [09:00:40] !log urbanecm@deploy1003 mwscript-k8s job started: foreachwikiindblist wikidata WikimediaMaintenance:createExtensionTables.php GrowthExperiments # T418115 [09:00:44] T418115: Configure newcomer dashboard for Wikidata (per community consensus) - https://phabricator.wikimedia.org/T418115 [09:01:02] !log uploaded bird 2.18.2-1~wmf13u1 to trixie-wikimedia T429285 [09:01:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:53] and...scap pulls in wmf.7. gonna be a long sync :-/ [09:02:44] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1297161|[Growth] wikidatawiki: Enable Growth features (T418115)]] [09:04:02] (03CR) 10Fabfur: "Double checked and confirmed that x-provenance is set to true (with all that entails) also on cache hosts of deployment-prep. In this case" [puppet] - 10https://gerrit.wikimedia.org/r/1301429 (https://phabricator.wikimedia.org/T427068) (owner: 10Fabfur) [09:04:15] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es1036.eqiad.wmnet with reason: host reimage [09:04:48] !log urbanecm@deploy1003 urbanecm: Backport for [[gerrit:1297161|[Growth] wikidatawiki: Enable Growth features (T418115)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:06:04] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:06:36] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:07:02] !log tappof@cumin1003 START - Cookbook sre.metamonitoring.downtime Downtime for 0:05:00 of prometheus/deadmanswitchnotified, prometheus/deadmanswitchonamdb, prometheus/extmon on 2 host(s) with reason: cookbook test [09:07:03] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [09:07:10] !log tappof@cumin1003 END (PASS) - Cookbook sre.metamonitoring.downtime (exit_code=0) Downtime for 0:05:00 of prometheus/deadmanswitchnotified, prometheus/deadmanswitchonamdb, prometheus/extmon on 2 host(s) with reason: cookbook test [09:07:58] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [09:08:43] Fatal exception of type "Wikimedia\Rdbms\DBQueryError", not what i wanted to see... [09:08:52] (03CR) 10Jelto: [C:03+1] "technically this looks good to me. For the rollout tomorrow I have two thoughts:" [puppet] - 10https://gerrit.wikimedia.org/r/1300763 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [09:09:08] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1036.eqiad.wmnet with reason: host reimage [09:09:11] ...did my createExtensionTables not go through? [09:10:34] of course, because virtual-growthexperiments is not yet defined [09:10:36] (03CR) 10Marostegui: [C:03+2] pc1022: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1302747 (owner: 10Marostegui) [09:10:41] (03CR) 10JMeybohm: [V:03+2 C:03+2] Add 1.29.4, drop 1.15.7 [debs/istioctl] - 10https://gerrit.wikimedia.org/r/1300138 (https://phabricator.wikimedia.org/T427401) (owner: 10JMeybohm) [09:10:55] which also means i silently created those DBs on main [09:10:56] grr [09:11:36] (03PS3) 10Brouberol: data-platform: add alert on kafka-jumbo partition sizes [alerts] - 10https://gerrit.wikimedia.org/r/1302737 (https://phabricator.wikimedia.org/T429127) [09:11:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1154:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1154 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:12:32] (03PS1) 10Filippo Giunchedi: openstack: deprecate icinga check-flavor_aggregates [puppet] - 10https://gerrit.wikimedia.org/r/1302748 (https://phabricator.wikimedia.org/T328502) [09:12:34] (03PS1) 10Filippo Giunchedi: openstack: deprecate check-cinder-snapshot-leaks [puppet] - 10https://gerrit.wikimedia.org/r/1302749 (https://phabricator.wikimedia.org/T328502) [09:13:55] !log php multiversion/MWScript.php WikimediaMaintenance:createExtensionTables.php --wiki={testwikidatawiki,wikidatawiki} growthexperiments # T418115, within mw-debug [09:13:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:59] T418115: Configure newcomer dashboard for Wikidata (per community consensus) - https://phabricator.wikimedia.org/T418115 [09:14:53] ok, no DB error now [09:14:56] !log urbanecm@deploy1003 urbanecm: Continuing with deployment [09:18:22] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [09:19:13] !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1297161|[Growth] wikidatawiki: Enable Growth features (T418115)]] (duration: 16m 29s) [09:19:17] T418115: Configure newcomer dashboard for Wikidata (per community consensus) - https://phabricator.wikimedia.org/T418115 [09:19:56] (03CR) 10Marostegui: Cookbook sre.mysql.upgrade should not accept multiple hosts (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1302745 (https://phabricator.wikimedia.org/T429230) (owner: 10CWilliams) [09:20:10] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [09:21:50] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [09:23:36] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [09:24:11] (03PS1) 10Volans: local CI: force docker arch on macos [puppet] - 10https://gerrit.wikimedia.org/r/1302758 [09:24:14] (03CR) 10Btullis: Presto memory tuning, resource groups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1285926 (https://phabricator.wikimedia.org/T424112) (owner: 10Aleksandar Mastilovic) [09:24:44] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [09:25:37] (03CR) 10Btullis: Presto memory tuning, resource groups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1285926 (https://phabricator.wikimedia.org/T424112) (owner: 10Aleksandar Mastilovic) [09:25:38] (03PS2) 10Volans: local CI: force docker arch on macos [puppet] - 10https://gerrit.wikimedia.org/r/1302758 [09:26:24] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [09:26:26] (03CR) 10Gmodena: dse-k8s-services: WDQS deployment helmfile values (037 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297067 (https://phabricator.wikimedia.org/T424338) (owner: 10Trueg) [09:26:31] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1036.eqiad.wmnet with OS trixie [09:26:42] (03PS3) 10Volans: local CI: force docker arch on linux [puppet] - 10https://gerrit.wikimedia.org/r/1302758 [09:27:08] (03PS3) 10Arnaudb: ssh-client-config: use wmf-prod known_hosts for gitlab [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1302759 (https://phabricator.wikimedia.org/T425441) [09:28:08] (03CR) 10Arnaudb: "good catch! thanks for spotting it. I've updated wmf-laptop in 1302759 to use the proper known hosts file." [puppet] - 10https://gerrit.wikimedia.org/r/1300763 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [09:28:11] jouncebot: nowandnext [09:28:11] No deployments scheduled for the next 0 hour(s) and 31 minute(s) [09:28:11] In 0 hour(s) and 31 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260616T1000) [09:28:24] (03PS1) 10Gerrit maintenance bot: mariadb: Promote es2037 to es6 master [puppet] - 10https://gerrit.wikimedia.org/r/1302760 (https://phabricator.wikimedia.org/T429303) [09:28:28] (03Abandoned) 10Dreamy Jazz: Revert "hCaptcha: Enable for MobileFrontend in all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302738 (owner: 10Dreamy Jazz) [09:29:24] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 8 hosts with reason: Primary switchover es6 T429303 [09:29:28] T429303: Switchover es6 master (es2035 -> es2037) - https://phabricator.wikimedia.org/T429303 [09:29:32] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [09:29:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set es2037 with weight 0 T429303', diff saved to https://phabricator.wikimedia.org/P94162 and previous config saved to /var/cache/conftool/dbconfig/20260616-092937-marostegui.json [09:30:03] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [09:30:04] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [09:30:24] (03CR) 10Marostegui: [C:03+2] mariadb: Promote es2037 to es6 master [puppet] - 10https://gerrit.wikimedia.org/r/1302760 (https://phabricator.wikimedia.org/T429303) (owner: 10Gerrit maintenance bot) [09:30:24] (03PS1) 10Dreamy Jazz: hCaptcha: Enable for UploadWizard on all wikis with it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302762 (https://phabricator.wikimedia.org/T426126) [09:30:34] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [09:30:48] (03PS2) 10Dreamy Jazz: hCaptcha: Enable for UploadWizard on all wikis with it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302762 (https://phabricator.wikimedia.org/T426126) [09:30:50] !log Starting es6 codfw failover from es2035 to es2037 - T429303 [09:30:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:21] !log imported istioctl 1.29.4-1 to bookworm-/trixie-wikimedia - T427401 [09:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:25] T427401: Update istio to 1.29 - https://phabricator.wikimedia.org/T427401 [09:31:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote es2037 to es6 primary T429303', diff saved to https://phabricator.wikimedia.org/P94163 and previous config saved to /var/cache/conftool/dbconfig/20260616-093149-marostegui.json [09:32:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302762 (https://phabricator.wikimedia.org/T426126) (owner: 10Dreamy Jazz) [09:32:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool es2035 T429303', diff saved to https://phabricator.wikimedia.org/P94164 and previous config saved to /var/cache/conftool/dbconfig/20260616-093247-marostegui.json [09:33:42] (03PS1) 10Filippo Giunchedi: openstack: deprecate icinga check-neutron-conntrack [puppet] - 10https://gerrit.wikimedia.org/r/1302764 (https://phabricator.wikimedia.org/T328502) [09:34:15] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [09:34:25] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool es2035: Upgrading es2035.codfw.wmnet [09:34:36] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool es2035: Upgrading es2035.codfw.wmnet [09:34:56] (03CR) 10Gmodena: Added DNS entries for the new WDQS 2 deployments in DSE K8s. (039 comments) [dns] - 10https://gerrit.wikimedia.org/r/1301301 (https://phabricator.wikimedia.org/T428925) (owner: 10Trueg) [09:35:08] PROBLEM - orchestrator resolve cache non-FQDNs on dborch1002 is CRITICAL: CRITICAL: 2 non-FQDN entries in orchestrator resolve cache: https://wikitech.wikimedia.org/wiki/Orchestrator [09:35:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299626 (https://phabricator.wikimedia.org/T426206) (owner: 10Neriah) [09:35:31] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es2035.codfw.wmnet with OS trixie [09:37:01] !log urbanecm@deploy1003 mwscript-k8s job started: extensions/GrowthExperiments/maintenance/refreshUserImpactData.php --registeredWithin=2week --hasEditsAtLeast=3 --ignoreIfUpdatedWithin=6hour --verbose --use-job-queue # T418115 [09:37:08] T418115: Configure newcomer dashboard for Wikidata (per community consensus) - https://phabricator.wikimedia.org/T418115 [09:37:12] (03Merged) 10jenkins-bot: hCaptcha: Enable for UploadWizard on all wikis with it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302762 (https://phabricator.wikimedia.org/T426126) (owner: 10Dreamy Jazz) [09:37:13] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool es1036: Migration of es1036.eqiad.wmnet completed [09:37:37] !log urbanecm@deploy1003 mwscript-k8s job started: extensions/GrowthExperiments/maintenance/refreshUserImpactData.php --wiki=wikidatawiki --registeredWithin=1year --editedWithin=2week --hasEditsAtLeast=3 --ignoreIfUpdatedWithin=6hour --verbose --use-job-queue # T418115 [09:37:38] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1302762|hCaptcha: Enable for UploadWizard on all wikis with it (T426126)]] [09:37:44] T426126: Implement CAPTCHA support in the Upload Wizard - https://phabricator.wikimedia.org/T426126 [09:37:58] !log urbanecm@deploy1003 mwscript-k8s job started: extensions/GrowthExperiments/maintenance/refreshUserImpactData.php --wiki=wikidatawiki --registeredWithin=2week --hasEditsAtLeast=3 --ignoreIfUpdatedWithin=6hour --verbose --use-job-queue # T418115 [09:38:22] (03CR) 10Filippo Giunchedi: [C:03+2] team-wmcs: do not pint-warn on NeutronAgentAdminDown [alerts] - 10https://gerrit.wikimedia.org/r/1302150 (https://phabricator.wikimedia.org/T328502) (owner: 10Filippo Giunchedi) [09:39:37] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1302762|hCaptcha: Enable for UploadWizard on all wikis with it (T426126)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:42:59] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with deployment [09:43:24] !log Drop wrongly created table son testwikidatawiki s3 master T429304 [09:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:29] T429304: Drop GrowthExperiments DB tables from *main cluster* - https://phabricator.wikimedia.org/T429304 [09:44:41] marostegui: thank you ❤️ for your quick assistance, as always [09:44:55] urbanecm: no worries! let me know if you locate the other tables, but they aren't on s8 master from what i can see [09:46:08] (03CR) 10JMeybohm: data-platform: add alert on kafka-jumbo partition sizes (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1302737 (https://phabricator.wikimedia.org/T429127) (owner: 10Brouberol) [09:47:17] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1302762|hCaptcha: Enable for UploadWizard on all wikis with it (T426126)]] (duration: 09m 38s) [09:47:21] T426126: Implement CAPTCHA support in the Upload Wizard - https://phabricator.wikimedia.org/T426126 [09:47:39] FIRING: [2x] SystemdUnitFailed: database-backups-snapshots.service on cumin2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:47:43] FIRING: PuppetFailure: Puppet has failed on cumin2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:48:59] !log urbanecm@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [09:49:39] !log urbanecm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [09:52:29] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es2035.codfw.wmnet with reason: host reimage [09:59:39] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2035.codfw.wmnet with reason: host reimage [09:59:55] marostegui: i located them. somehow, i did something even worse... i created `wikidatawiki` on `s3` [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260616T1000) [10:02:01] (03PS1) 10Atsuko: translate: remove CirrusSearch endpoints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302197 (https://phabricator.wikimedia.org/T425377) [10:03:15] urbanecm: the database is in s3 with only those tables [10:03:25] urbanecm: you also created the db there? [10:03:46] urbanecm: https://phabricator.wikimedia.org/P94167 [10:03:50] (03CR) 10DCausse: [C:03+1] translate: remove CirrusSearch endpoints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302197 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [10:04:24] marostegui: without intending to, but yes [10:04:37] urbanecm: should I drop it entirely? [10:04:41] yes please [10:04:47] the db included? [10:04:50] yes [10:05:03] (03CR) 10Clément Goubert: "Yup good catch." [puppet] - 10https://gerrit.wikimedia.org/r/1302106 (https://phabricator.wikimedia.org/T418492) (owner: 10Clément Goubert) [10:05:08] ok, let me do safety checks urbanecm [10:06:13] (03PS2) 10Clément Goubert: redirects.dat: Funnel api.w.o to mw.o/wiki/Wikimedia_APIs [puppet] - 10https://gerrit.wikimedia.org/r/1302106 (https://phabricator.wikimedia.org/T418492) [10:06:51] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302197 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [10:07:15] (03CR) 10Kamila Součková: [C:03+2] shellbox-score: increase CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300866 (https://phabricator.wikimedia.org/T428904) (owner: 10Kamila Součková) [10:09:37] (03PS21) 10Trueg: dse-k8s-services: WDQS deployment helmfile values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297067 (https://phabricator.wikimedia.org/T424338) [10:09:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [10:11:52] (03CR) 10Trueg: dse-k8s-services: WDQS deployment helmfile values (037 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297067 (https://phabricator.wikimedia.org/T424338) (owner: 10Trueg) [10:12:06] (03CR) 10CI reject: [V:04-1] dse-k8s-services: WDQS deployment helmfile values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297067 (https://phabricator.wikimedia.org/T424338) (owner: 10Trueg) [10:13:57] (03PS22) 10Trueg: dse-k8s-services: WDQS deployment helmfile values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297067 (https://phabricator.wikimedia.org/T424338) [10:15:16] urbanecm: done [10:16:08] (03Merged) 10jenkins-bot: shellbox-score: increase CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300866 (https://phabricator.wikimedia.org/T428904) (owner: 10Kamila Součková) [10:16:35] thanks! ❤️ [10:17:09] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2035.codfw.wmnet with OS trixie [10:17:16] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [10:18:15] !log kamila@deploy1003 helmfile [staging] START helmfile.d/services/shellbox: apply [10:18:27] !log kamila@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox: apply [10:18:29] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [10:18:38] !log kamila@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox: apply [10:19:41] !log kamila@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [10:20:43] (03CR) 10Muehlenhoff: [C:03+2] Depool puppetserver2002 for rack maintenance [dns] - 10https://gerrit.wikimedia.org/r/1300766 (https://phabricator.wikimedia.org/T428020) (owner: 10Muehlenhoff) [10:20:43] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [10:20:55] !log jmm@dns1004 START - running authdns-update [10:21:08] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [10:21:09] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [10:21:36] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [10:22:08] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:22:37] !log jmm@dns1004 END - running authdns-update [10:22:44] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool es1036: Migration of es1036.eqiad.wmnet completed [10:22:45] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [10:24:06] !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for db1154.eqiad.wmnet [10:24:07] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db1154.eqiad.wmnet [10:24:11] !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for db1155.eqiad.wmnet [10:24:11] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db1155.eqiad.wmnet [10:24:18] !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for 11 hosts [10:24:24] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 11 hosts [10:24:40] !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for an-redacteddb1001.eqiad.wmnet [10:24:41] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for an-redacteddb1001.eqiad.wmnet [10:24:44] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool es2035: Migration of es2035.codfw.wmnet completed [10:24:47] !log kamila@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox: apply [10:25:32] !log kamila@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [10:29:58] (03PS1) 10Clément Goubert: tls_terminator: Convert size to kB for rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/1302772 (https://phabricator.wikimedia.org/T414440) [10:33:29] (03PS1) 10Elukey: docker_registry: add comment and small tweaks to the nginx config [puppet] - 10https://gerrit.wikimedia.org/r/1302774 (https://phabricator.wikimedia.org/T427175) [10:34:01] (03PS2) 10Elukey: docker_registry: add comment and small tweaks to the nginx config [puppet] - 10https://gerrit.wikimedia.org/r/1302774 (https://phabricator.wikimedia.org/T427175) [10:35:41] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1302774 (https://phabricator.wikimedia.org/T427175) (owner: 10Elukey) [10:37:49] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: decommission rdb101[12].eqiad.wmnet - https://phabricator.wikimedia.org/T428858#12022935 (10Jclark-ctr) a:03Jclark-ctr [10:39:46] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [10:42:40] (03PS1) 10JavierMonton: stream: mw-page-html-feature-counts-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302777 (https://phabricator.wikimedia.org/T429127) [10:45:14] (03PS2) 10Trueg: Added DNS entries for the new WDQS 2 deployments in DSE K8s. [dns] - 10https://gerrit.wikimedia.org/r/1301301 (https://phabricator.wikimedia.org/T428925) [10:47:19] (03CR) 10Trueg: Added DNS entries for the new WDQS 2 deployments in DSE K8s. (038 comments) [dns] - 10https://gerrit.wikimedia.org/r/1301301 (https://phabricator.wikimedia.org/T428925) (owner: 10Trueg) [10:49:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool es1037 T429118', diff saved to https://phabricator.wikimedia.org/P94172 and previous config saved to /var/cache/conftool/dbconfig/20260616-104931-marostegui.json [10:49:36] T429118: Migrate es6 section to Debian Trixie - https://phabricator.wikimedia.org/T429118 [10:49:57] (03PS1) 10Elukey: preseed: fix kafka-logging* recipe [puppet] - 10https://gerrit.wikimedia.org/r/1302781 (https://phabricator.wikimedia.org/T418929) [10:52:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1298293 (https://phabricator.wikimedia.org/T422935) (owner: 10Lucas Werkmeister (WMDE)) [10:53:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299943 (https://phabricator.wikimedia.org/T422936) (owner: 10Sadiya.mohammed13) [10:53:18] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1302781 (https://phabricator.wikimedia.org/T418929) (owner: 10Elukey) [10:55:13] (03CR) 10Clément Goubert: [C:03+1] docker_registry: add comment and small tweaks to the nginx config [puppet] - 10https://gerrit.wikimedia.org/r/1302774 (https://phabricator.wikimedia.org/T427175) (owner: 10Elukey) [10:56:15] (03CR) 10Elukey: [C:03+2] preseed: fix kafka-logging* recipe [puppet] - 10https://gerrit.wikimedia.org/r/1302781 (https://phabricator.wikimedia.org/T418929) (owner: 10Elukey) [10:57:47] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1300761 (https://phabricator.wikimedia.org/T428858) (owner: 10Effie Mouzeli) [10:58:03] (03PS1) 10Trueg: dse-k8s-services: Enable ingress on WDQS namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302784 (https://phabricator.wikimedia.org/T429313) [10:59:02] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1302759 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [10:59:59] (03CR) 10Arnaudb: [C:03+2] ssh-client-config: use wmf-prod known_hosts for gitlab [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1302759 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [11:03:09] (03PS3) 10AOkoth: mariadb: add grants for phab2003 [puppet] - 10https://gerrit.wikimedia.org/r/1300156 (https://phabricator.wikimedia.org/T423727) [11:06:32] !log installing Bird security updates on routed Ganeti nodes [11:06:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:44] (03CR) 10AOkoth: [C:03+2] mariadb: add grants for phab2003 [puppet] - 10https://gerrit.wikimedia.org/r/1300156 (https://phabricator.wikimedia.org/T423727) (owner: 10AOkoth) [11:10:14] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool es2035: Migration of es2035.codfw.wmnet completed [11:10:15] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [11:12:19] (03CR) 10Arnaudb: [V:03+2 C:03+2] ssh-client-config: use wmf-prod known_hosts for gitlab [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1302759 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [11:13:27] (03CR) 10Jelto: [C:03+1] ssh-client-config: use wmf-prod known_hosts for gitlab [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1302759 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [11:21:10] (03CR) 10Muehlenhoff: [C:03+2] Allow cumin2003 in IRC notifications [puppet] - 10https://gerrit.wikimedia.org/r/1302147 (https://phabricator.wikimedia.org/T427897) (owner: 10Muehlenhoff) [11:22:45] FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [11:25:08] (03PS2) 10Tiziano Fogli: nrpewrapper: Improve team/severity override handling [puppet] - 10https://gerrit.wikimedia.org/r/1302785 (https://phabricator.wikimedia.org/T395446) [11:26:00] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: decommission rdb101[12].eqiad.wmnet - https://phabricator.wikimedia.org/T428858#12023169 (10Jclark-ctr) [11:26:07] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: decommission rdb101[12].eqiad.wmnet - https://phabricator.wikimedia.org/T428858#12023170 (10Jclark-ctr) 05Open→03Resolved [11:27:24] (03CR) 10Muehlenhoff: [C:03+2] Apply urldownloader role to urldownloader2005 [puppet] - 10https://gerrit.wikimedia.org/r/1295455 (https://phabricator.wikimedia.org/T427282) (owner: 10Muehlenhoff) [11:27:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:33:14] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound errors on interface cr1-eqiad:ae2 (asw2-b-eqiad:ae1) - https://phabricator.wikimedia.org/T429116#12023208 (10Jclark-ctr) Verified cables seated properly and checked serials yesterday Errors seemed to of cleared will monitor for a week before closing {F88875075} [11:34:53] (03PS1) 10Jgiannelos: Bump wikimedia/parsoid to 0.24.0-a10 [vendor] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302792 (https://phabricator.wikimedia.org/T417530) [11:35:09] (03PS1) 10Jgiannelos: Bump wikimedia/parsoid to 0.24.0-a10 [core] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302793 (https://phabricator.wikimedia.org/T429187) [11:36:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [vendor] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302792 (https://phabricator.wikimedia.org/T417530) (owner: 10Jgiannelos) [11:36:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [core] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302793 (https://phabricator.wikimedia.org/T429187) (owner: 10Jgiannelos) [11:40:22] (03PS1) 10Dreamy Jazz: Revert "hCaptcha: Enable for UploadWizard on all wikis with it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302794 [11:41:02] jouncebot: nowandnext [11:41:02] No deployments scheduled for the next 0 hour(s) and 18 minute(s) [11:41:02] In 0 hour(s) and 18 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260616T1200) [11:41:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302794 (owner: 10Dreamy Jazz) [11:41:59] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1078 [11:42:17] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1078 [11:42:35] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1285926 (https://phabricator.wikimedia.org/T424112) (owner: 10Aleksandar Mastilovic) [11:42:39] (03Merged) 10jenkins-bot: Revert "hCaptcha: Enable for UploadWizard on all wikis with it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302794 (owner: 10Dreamy Jazz) [11:43:04] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1302794|Revert "hCaptcha: Enable for UploadWizard on all wikis with it"]] [11:43:06] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [11:45:00] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1302794|Revert "hCaptcha: Enable for UploadWizard on all wikis with it"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:46:04] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:46:15] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1078 [11:46:34] !log cmooney@cumin1003 START - Cookbook sre.mysql.depool depool db2153: codfw rack a5 depool for switch maintenance T428020 [11:46:39] T428020: codfw: rack A5 maintenance - https://phabricator.wikimedia.org/T428020 [11:46:54] !log cmooney@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2153: codfw rack a5 depool for switch maintenance T428020 [11:47:32] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with deployment [11:47:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [11:47:49] !log cmooney@cumin1003 START - Cookbook sre.mysql.depool depool db2154: codfw rack a5 depool for switch maintenance T428020 [11:48:20] !log cmooney@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2154: codfw rack a5 depool for switch maintenance T428020 [11:48:23] !log cmooney@cumin1003 START - Cookbook sre.mysql.depool depool db2157: codfw rack a5 depool for switch maintenance T428020 [11:48:40] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1078 [11:48:44] !log cmooney@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2157: codfw rack a5 depool for switch maintenance T428020 [11:48:48] !log cmooney@cumin1003 START - Cookbook sre.mysql.depool depool db2175: codfw rack a5 depool for switch maintenance T428020 [11:49:08] !log cmooney@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2175: codfw rack a5 depool for switch maintenance T428020 [11:49:11] !log cmooney@cumin1003 START - Cookbook sre.mysql.depool depool db2176: codfw rack a5 depool for switch maintenance T428020 [11:49:31] !log cmooney@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2176: codfw rack a5 depool for switch maintenance T428020 [11:49:43] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302798 [11:51:49] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1302794|Revert "hCaptcha: Enable for UploadWizard on all wikis with it"]] (duration: 08m 45s) [11:55:05] 10ops-eqiad, 06SRE, 06DC-Ops: Q3 :rack/setup/install cloudvirt refresh - https://phabricator.wikimedia.org/T425088#12023307 (10Jclark-ctr) Looks like this is failing with the provision script. @cmooney Said he can take a look later to resolve it ` cloudvirt1078 (WMF11006): unsupported case, found 2 v4 p... [11:55:35] 10ops-eqiad, 06SRE, 06DC-Ops: Q3 :rack/setup/install cloudvirt refresh - https://phabricator.wikimedia.org/T425088#12023308 (10Jclark-ctr) [11:57:03] !log cmooney@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve2001.codfw.wmnet [11:58:53] 10ops-eqiad, 06SRE, 06DC-Ops: Q3 :rack/setup/install cloudvirt refresh - https://phabricator.wikimedia.org/T425088#12023311 (10Jclark-ctr) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260616T1200) [12:01:02] (03Abandoned) 10Fabfur: cache::haproxy: req.provenance to txn.provenance [puppet] - 10https://gerrit.wikimedia.org/r/1301431 (https://phabricator.wikimedia.org/T427068) (owner: 10Fabfur) [12:01:15] (03Abandoned) 10Fabfur: cache::haproxy: log txn.provenance variable for haproxykafka [puppet] - 10https://gerrit.wikimedia.org/r/1301432 (https://phabricator.wikimedia.org/T427068) (owner: 10Fabfur) [12:01:31] (03CR) 10Fabfur: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1301429 (https://phabricator.wikimedia.org/T427068) (owner: 10Fabfur) [12:01:45] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-logging1006.eqiad.wmnet with OS trixie [12:01:57] !log cmooney@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2012.codfw.wmnet [12:02:10] !log cmooney@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve2001.codfw.wmnet [12:02:32] !log cmooney@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2012.codfw.wmnet [12:03:02] !log cmooney@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2013.codfw.wmnet [12:03:39] !log cmooney@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2013.codfw.wmnet [12:03:44] !log cmooney@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2014.codfw.wmnet [12:04:17] !log cmooney@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2014.codfw.wmnet [12:04:22] !log cmooney@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2017.codfw.wmnet [12:04:56] !log cmooney@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2017.codfw.wmnet [12:05:06] !log cmooney@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2018.codfw.wmnet [12:05:28] (03CR) 10Elukey: [C:03+1] puppet-merge: disable colors if we don't have a tty [puppet] - 10https://gerrit.wikimedia.org/r/1302262 (https://phabricator.wikimedia.org/T429129) (owner: 10JHathaway) [12:05:35] (03PS11) 10Aleksandar Mastilovic: Presto memory tuning, resource groups [puppet] - 10https://gerrit.wikimedia.org/r/1285926 (https://phabricator.wikimedia.org/T424112) [12:05:36] (03PS2) 10Aleksandar Mastilovic: Presto memory tuning, resource groups [puppet] - 10https://gerrit.wikimedia.org/r/1298852 (https://phabricator.wikimedia.org/T424112) [12:05:40] !log cmooney@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2018.codfw.wmnet [12:05:45] !log cmooney@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2041.codfw.wmnet [12:06:18] !log cmooney@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2041.codfw.wmnet [12:06:23] !log cmooney@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2044.codfw.wmnet [12:06:56] !log cmooney@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2044.codfw.wmnet [12:06:56] !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on lsw1-a5-codfw,lsw1-a5-codfw IPv6,lsw1-a5-codfw.mgmt,ssw1-a[1,8]-codfw.mgmt with reason: switch upgrrade [12:07:01] !log cmooney@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2051.codfw.wmnet [12:07:45] (03PS1) 10Seanleong-wmde: Hotfix for T428620 [extensions/Wikibase] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302804 (https://phabricator.wikimedia.org/T428620) [12:08:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/Wikibase] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302804 (https://phabricator.wikimedia.org/T428620) (owner: 10Seanleong-wmde) [12:10:45] !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 29 hosts with reason: lsw1-a5-codfw JunOS upgrade [12:12:04] !log cmooney@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2051.codfw.wmnet [12:12:09] !log cmooney@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2074.codfw.wmnet [12:12:45] !log cmooney@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2074.codfw.wmnet [12:12:50] !log cmooney@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2075.codfw.wmnet [12:13:24] !log cmooney@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2075.codfw.wmnet [12:13:29] !log cmooney@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2076.codfw.wmnet [12:14:02] !log cmooney@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2076.codfw.wmnet [12:14:07] !log cmooney@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2077.codfw.wmnet [12:14:44] !log cmooney@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2077.codfw.wmnet [12:14:49] !log cmooney@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2078.codfw.wmnet [12:15:22] !log cmooney@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2078.codfw.wmnet [12:15:27] !log cmooney@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2091.codfw.wmnet [12:16:00] !log cmooney@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2091.codfw.wmnet [12:16:06] !log cmooney@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2092.codfw.wmnet [12:16:39] !log cmooney@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2092.codfw.wmnet [12:16:44] !log cmooney@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2242.codfw.wmnet [12:17:19] !log cmooney@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2242.codfw.wmnet [12:17:24] !log cmooney@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2243.codfw.wmnet [12:17:30] (03CR) 10Aleksandar Mastilovic: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1285926 (https://phabricator.wikimedia.org/T424112) (owner: 10Aleksandar Mastilovic) [12:17:53] (03PS1) 10Muehlenhoff: Apply urldownloader role to urldownloader1005/1006/2006 [puppet] - 10https://gerrit.wikimedia.org/r/1302805 (https://phabricator.wikimedia.org/T427282) [12:17:58] !log cmooney@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2243.codfw.wmnet [12:18:04] !log cmooney@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2254.codfw.wmnet [12:18:51] (03PS2) 10CWilliams: Cookbook sre.mysql.upgrade should not accept multiple hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1302745 (https://phabricator.wikimedia.org/T429230) [12:19:12] !log cmooney@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2254.codfw.wmnet [12:19:17] !log cmooney@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2255.codfw.wmnet [12:19:31] (03CR) 10CI reject: [V:04-1] Hotfix for T428620 [extensions/Wikibase] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302804 (https://phabricator.wikimedia.org/T428620) (owner: 10Seanleong-wmde) [12:19:51] !log cmooney@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2255.codfw.wmnet [12:21:41] (03CR) 10CWilliams: Cookbook sre.mysql.upgrade should not accept multiple hosts (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1302745 (https://phabricator.wikimedia.org/T429230) (owner: 10CWilliams) [12:22:00] (03CR) 10Kevin Bazira: [C:03+2] ml-services: deploy cope-b-a4b isvc that was migrated from HF transformers to vLLM 0.22.1 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302641 (https://phabricator.wikimedia.org/T427497) (owner: 10Kevin Bazira) [12:22:14] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging1006.eqiad.wmnet with reason: host reimage [12:23:43] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-logging1007.eqiad.wmnet with OS trixie [12:24:13] !log reboot lsw1-a5-codfw to complete JunOS upgrade T428020 [12:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:17] T428020: codfw: rack A5 maintenance - https://phabricator.wikimedia.org/T428020 [12:24:20] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-logging1008.eqiad.wmnet with OS trixie [12:24:34] (03CR) 10Elukey: [C:03+2] docker_registry: add comment and small tweaks to the nginx config [puppet] - 10https://gerrit.wikimedia.org/r/1302774 (https://phabricator.wikimedia.org/T427175) (owner: 10Elukey) [12:24:45] (03Merged) 10jenkins-bot: ml-services: deploy cope-b-a4b isvc that was migrated from HF transformers to vLLM 0.22.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302641 (https://phabricator.wikimedia.org/T427497) (owner: 10Kevin Bazira) [12:26:18] (03CR) 10Jelto: [V:03+1] "I tested the binary in WMCS on Kubernetes 1.31 (k3s) on bookworm: installing, listing and deleting charts works fine." [debs/helm3] - 10https://gerrit.wikimedia.org/r/1300145 (https://phabricator.wikimedia.org/T427403) (owner: 10Jelto) [12:26:29] I didn't puppet-merge, almost got the time window wrong topranks (sorryy) [12:27:53] elukey: np! you remembered tthough :) [12:28:31] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [12:28:54] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-logging1006.eqiad.wmnet with reason: host reimage [12:30:04] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#12023473 (10jcrespo) [12:30:21] (03CR) 10Jelto: [C:03+1] "great thanks! I think we would also have to release a new version version `1.0.6` (and tell people to install it if they have problem with" [puppet] - 10https://gerrit.wikimedia.org/r/1300763 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [12:30:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-a8-codfw:et-0/0/4 (Core: lsw1-a5-codfw:et-0/0/54 {#230403800028}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-a8-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [12:31:39] FIRING: [2x] CoreBGPDown: Core BGP session down between ssw1-a1-codfw and lsw1-a5-codfw (10.192.252.7) - group EVPN_IBGP - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:33:08] RECOVERY - orchestrator resolve cache non-FQDNs on dborch1002 is OK: OK: all orchestrator resolve cache entries are FQDNs https://wikitech.wikimedia.org/wiki/Orchestrator [12:35:19] (03CR) 10Arnaudb: "@mmuhlenhoff@wikimedia.org told me he was going to update the debian package this week, I'll wait until then to advertise the new ssh url" [puppet] - 10https://gerrit.wikimedia.org/r/1300763 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [12:35:51] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-a1-codfw:et-0/0/4 (Core: lsw1-a5-codfw:et-0/0/55 {#230403800022}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [12:40:18] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging1008.eqiad.wmnet with reason: host reimage [12:40:40] 06SRE, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops, 06tools-infrastructure-team: Upgrade cloudsw1-e4-eqiad - https://phabricator.wikimedia.org/T429013#12023550 (10fgiunchedi) Indeed the recent rack redundancy testing has shown we are resilient to the loss of one rack, for all hosts but cloudvirts... [12:41:13] 06SRE, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops, 06tools-infrastructure-team: Upgrade cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T429014#12023556 (10fgiunchedi) See my update at https://phabricator.wikimedia.org/T429013#12023550 since it applies equally here [12:42:31] (03PS1) 10Bartosz Wójtowicz: ml-services: Update outlink-topic-model image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302812 (https://phabricator.wikimedia.org/T428127) [12:43:15] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetserver2002.codfw.wmnet [12:44:40] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-logging1008.eqiad.wmnet with reason: host reimage [12:45:25] !log cmooney@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2012.codfw.wmnet [12:45:27] !log cmooney@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2012.codfw.wmnet [12:45:33] !log cmooney@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2013.codfw.wmnet [12:45:34] !log cmooney@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2013.codfw.wmnet [12:45:40] !log cmooney@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2014.codfw.wmnet [12:45:41] !log cmooney@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2014.codfw.wmnet [12:45:47] !log cmooney@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2017.codfw.wmnet [12:45:48] !log cmooney@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2017.codfw.wmnet [12:45:51] RESOLVED: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-a1-codfw:et-0/0/4 (Core: lsw1-a5-codfw:et-0/0/55 {#230403800022}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [12:45:54] !log cmooney@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2018.codfw.wmnet [12:45:55] !log cmooney@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2018.codfw.wmnet [12:45:56] !log cmooney@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve2001.codfw.wmnet [12:45:57] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:45:57] !log cmooney@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve2001.codfw.wmnet [12:46:01] !log cmooney@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2041.codfw.wmnet [12:46:02] !log cmooney@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2041.codfw.wmnet [12:46:08] !log cmooney@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2044.codfw.wmnet [12:46:09] !log cmooney@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2044.codfw.wmnet [12:46:14] !log cmooney@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2051.codfw.wmnet [12:46:16] !log cmooney@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2051.codfw.wmnet [12:46:21] !log cmooney@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2074.codfw.wmnet [12:46:23] !log cmooney@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2074.codfw.wmnet [12:46:23] (03CR) 10Jelto: "That would be a good addition if we change the `broadcast_message` frequently. But I'm wondering if we really need this code? In the past " [puppet] - 10https://gerrit.wikimedia.org/r/1302733 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [12:46:28] !log cmooney@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2075.codfw.wmnet [12:46:30] !log cmooney@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2075.codfw.wmnet [12:46:35] !log cmooney@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2076.codfw.wmnet [12:46:36] !log cmooney@cumin1003 START - Cookbook sre.hosts.remove-downtime for 29 hosts [12:46:37] !log cmooney@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2076.codfw.wmnet [12:46:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between ssw1-a1-codfw and lsw1-a5-codfw (10.192.252.7) - group EVPN_IBGP - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:46:43] !log cmooney@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2077.codfw.wmnet [12:46:45] !log cmooney@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2077.codfw.wmnet [12:46:50] !log cmooney@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2078.codfw.wmnet [12:46:52] !log cmooney@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2078.codfw.wmnet [12:46:53] !log cmooney@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 29 hosts [12:46:57] !log cmooney@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2091.codfw.wmnet [12:46:59] !log cmooney@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2091.codfw.wmnet [12:47:04] !log cmooney@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2092.codfw.wmnet [12:47:05] !log cmooney@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2092.codfw.wmnet [12:47:06] !log elukey@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [12:47:11] !log cmooney@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2242.codfw.wmnet [12:47:13] !log cmooney@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2242.codfw.wmnet [12:47:18] !log cmooney@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2243.codfw.wmnet [12:47:20] !log cmooney@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2243.codfw.wmnet [12:47:26] !log cmooney@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2254.codfw.wmnet [12:47:27] !log cmooney@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2254.codfw.wmnet [12:47:33] !log cmooney@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2255.codfw.wmnet [12:47:34] !log cmooney@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2255.codfw.wmnet [12:48:39] (03PS1) 10Muehlenhoff: Revert "Depool puppetserver2002 for rack maintenance" [dns] - 10https://gerrit.wikimedia.org/r/1302815 [12:48:51] !log cmooney@cumin1003 START - Cookbook sre.mysql.pool pool db2153: codfw rack a5 depool for switch maintenance T428020 [12:48:55] T428020: codfw: rack A5 maintenance - https://phabricator.wikimedia.org/T428020 [12:49:00] (03CR) 10Andrew Bogott: [C:04-1] "To be fair, Tofu doesn't do that /yet/, but I don't want to stand in the way of less icinga" [puppet] - 10https://gerrit.wikimedia.org/r/1302748 (https://phabricator.wikimedia.org/T328502) (owner: 10Filippo Giunchedi) [12:49:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetserver2002.codfw.wmnet [12:49:26] (03CR) 10FNegri: [C:03+1] "LGTM, thanks @cwilliams@wikimedia.org!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1302745 (https://phabricator.wikimedia.org/T429230) (owner: 10CWilliams) [12:50:11] elukey@cumin1003 reimage (PID 189080) is awaiting input [12:50:53] !log elukey@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [12:50:53] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-logging1006.eqiad.wmnet with OS trixie [12:51:20] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host kafka-logging1007.eqiad.wmnet with OS trixie [12:51:22] (03CR) 10Andrew Bogott: [C:04-1] "created https://phabricator.wikimedia.org/T429336 about tofu checks" [puppet] - 10https://gerrit.wikimedia.org/r/1302748 (https://phabricator.wikimedia.org/T328502) (owner: 10Filippo Giunchedi) [12:51:59] !log cmooney@cumin2002 START - Cookbook sre.mysql.pool pool db2154: codfw rack a5 depool for switch maintenance T428020 [12:52:02] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host cloudvirt1079.eqiad.wmnet with OS trixie [12:54:38] (03CR) 10Andrew Bogott: [C:03+1] openstack: deprecate ensure_running_kvm_instances check [puppet] - 10https://gerrit.wikimedia.org/r/1302114 (https://phabricator.wikimedia.org/T328502) (owner: 10Filippo Giunchedi) [12:54:48] FIRING: KubernetesCalicoDown: dse-k8s-worker1009.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1009.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:54:50] (03CR) 10Andrew Bogott: [C:03+1] openstack: deprecate check-cinder-snapshot-leaks [puppet] - 10https://gerrit.wikimedia.org/r/1302749 (https://phabricator.wikimedia.org/T328502) (owner: 10Filippo Giunchedi) [12:55:16] FIRING: [4x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1009:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [12:57:05] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-logging1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [12:57:16] (03CR) 10Astein: [C:03+1] "lgtm! i assume the secrets are being added elsewhere" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302668 (https://phabricator.wikimedia.org/T429048) (owner: 10Brouberol) [12:57:21] (03PS4) 10Jelto: profile::reboot::unattended: add class to mark hosts for unattended reboots [puppet] - 10https://gerrit.wikimedia.org/r/1251406 [13:00:05] Lucas_WMDE, urbanecm, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260616T1300). [13:00:05] mfossati, Neriah, atsukoito, nemo-yiannis, and seanleong-wmde: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:09] hey [13:00:13] o/ [13:00:15] hii! [13:00:16] 👋 [13:00:21] cc dcausse [13:00:27] o/ [13:00:33] I can self-deploy [13:01:03] 06SRE, 10DNS, 07Kubernetes: 10.67.28.73 reverse DNS showing 2(SERVFAIL) - https://phabricator.wikimedia.org/T428573#12023689 (10CDanis) > It might be possible to work around this by creating headless services for these jobs that access the databases -- definitely worth trying. @Clement_Goubert @JMeybohm Doe... [13:01:22] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host cloudvirt1080.eqiad.wmnet with OS trixie [13:01:31] mfossati: hi! please go ahead [13:01:43] let's start! [13:01:53] I can deploy some patches if someone can't self-deploy [13:02:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mfossati@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1298875 (https://phabricator.wikimedia.org/T423148) (owner: 10Kimberly Sarabia) [13:02:09] !log elukey@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [13:02:11] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:02:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:02:38] (03CR) 10Ottomata: [C:03+1] stream: mw-page-html-feature-counts-change-enrich (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302777 (https://phabricator.wikimedia.org/T429127) (owner: 10JavierMonton) [13:02:42] dcausse: I need a deployer :) [13:02:48] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1079.eqiad.wmnet with reason: host reimage [13:02:56] Neriah: sure :) [13:03:18] (03Merged) 10jenkins-bot: Remove custom streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1298875 (https://phabricator.wikimedia.org/T423148) (owner: 10Kimberly Sarabia) [13:03:42] !log mfossati@deploy1003 Started scap sync-world: Backport for [[gerrit:1298875|Remove custom streams (T423148)]] [13:03:43] Can somebody deploy mine too ? [13:03:47] T423148: Remove custom streams from EventStreamConfig - https://phabricator.wikimedia.org/T423148 [13:03:55] (i am not sure how to deploy vendor changes) [13:04:28] nemo-yiannis: looking [13:04:35] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 5 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1251406 (owner: 10Jelto) [13:04:50] * TheresNoTime isn't available to deploy this window, think i borked my ssh config on this laptop, whoops.. [13:05:06] There is a core patch that bumps the version of parsoid and a vendor patch that the previous depends to [13:05:13] elukey@cumin1003 reimage (PID 192418) is awaiting input [13:05:24] (03CR) 10Muehlenhoff: [C:03+2] Revert "Depool puppetserver2002 for rack maintenance" [dns] - 10https://gerrit.wikimedia.org/r/1302815 (owner: 10Muehlenhoff) [13:05:29] !log jmm@dns1004 START - running authdns-update [13:05:39] !log mfossati@deploy1003 ksarabia, mfossati: Backport for [[gerrit:1298875|Remove custom streams (T423148)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:06:06] !log mfossati@deploy1003 ksarabia, mfossati: Continuing with deployment [13:07:13] !log jmm@dns1004 END - running authdns-update [13:07:31] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1302758 (owner: 10Volans) [13:07:41] nemo-yiannis: I'm not sure https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1302793 is strictly necessary but should not hurt to ship as well [13:08:21] !log elukey@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [13:08:21] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-logging1008.eqiad.wmnet with OS trixie [13:08:29] Yeah, both patches need to be deployed: https://gerrit.wikimedia.org/r/c/mediawiki/vendor/+/1302792 then https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1302793 [13:08:34] (03PS14) 10Andrew Bogott: cloud-vps vendordata: install cumin key at VM creation time [puppet] - 10https://gerrit.wikimedia.org/r/1302236 (https://phabricator.wikimedia.org/T422801) [13:08:54] dcausse: ^ [13:08:56] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1079.eqiad.wmnet with reason: host reimage [13:09:23] (03CR) 10Seanleong-wmde: "recheck" [extensions/Wikibase] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302804 (https://phabricator.wikimedia.org/T428620) (owner: 10Seanleong-wmde) [13:09:43] (03CR) 10JavierMonton: [C:03+2] stream: mw-page-html-feature-counts-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302777 (https://phabricator.wikimedia.org/T429127) (owner: 10JavierMonton) [13:10:10] nemo-yiannis: sure [13:10:25] dcausse: from my side, I don't think the patch needs further testing. [13:10:26] we deployed two similar changes last week to make sure the behavior is correct, and I've checked commons multiple times, everything looks fine. (Ping Lucas_WMDE if he's around, I did this together with him.) [13:11:00] Neriah: ok [13:11:02] (03CR) 10MVernon: [C:03+1] "Dunno if string-vs-int will cause us problems later, but this at least seems worth a shot!" [puppet] - 10https://gerrit.wikimedia.org/r/1302772 (https://phabricator.wikimedia.org/T414440) (owner: 10Clément Goubert) [13:11:22] (there isn't really a good way to properly test this either) [13:11:28] ack [13:11:51] (03CR) 10Tiziano Fogli: [C:03+1] mysql-gtid.yaml: Add pint [alerts] - 10https://gerrit.wikimedia.org/r/1302724 (https://phabricator.wikimedia.org/T427469) (owner: 10Marostegui) [13:12:16] (03Merged) 10jenkins-bot: stream: mw-page-html-feature-counts-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302777 (https://phabricator.wikimedia.org/T429127) (owner: 10JavierMonton) [13:12:18] !log mfossati@deploy1003 Finished scap sync-world: Backport for [[gerrit:1298875|Remove custom streams (T423148)]] (duration: 08m 35s) [13:12:22] T423148: Remove custom streams from EventStreamConfig - https://phabricator.wikimedia.org/T423148 [13:12:25] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1080.eqiad.wmnet with reason: host reimage [13:12:35] FIRING: [3x] SystemdUnitFailed: database-backups-snapshots.service on cumin2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:12:36] done, the floor is all yours :-) [13:12:43] mfossati: thanks [13:12:55] Neriah: shipping your patch [13:13:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Install new MPC10E-10C line cards on cr1-eqiad and cr2-eqiad slot 0. - https://phabricator.wikimedia.org/T426343#12023730 (10cmooney) [13:13:11] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#12023731 (10Jclark-ctr) [13:13:20] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Install new MPC10E-10C line cards on cr1-eqiad and cr2-eqiad slot 0. - https://phabricator.wikimedia.org/T426343#12023732 (10cmooney) >>! In T426343#12016920, @Papaul wrote: > @cmooney I took a look at the steps all look good to me for... [13:14:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299626 (https://phabricator.wikimedia.org/T426206) (owner: 10Neriah) [13:14:17] (03CR) 10Tiziano Fogli: prometheus: use dc label in appservers_red reporting rules (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1302185 (https://phabricator.wikimedia.org/T249663) (owner: 10Hnowlan) [13:15:00] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:16:01] (03Merged) 10jenkins-bot: Replace wgNewUserMessageOnAutoCreate with wgNewUserMessageOnFirstEdit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1299626 (https://phabricator.wikimedia.org/T426206) (owner: 10Neriah) [13:16:26] !log dcausse@deploy1003 Started scap sync-world: Backport for [[gerrit:1299626|Replace wgNewUserMessageOnAutoCreate with wgNewUserMessageOnFirstEdit (T426206)]] [13:16:30] T426206: Per global RfC, only welcome users on Wikimedia projects where they created their account or have edited - https://phabricator.wikimedia.org/T426206 [13:18:27] !log dcausse@deploy1003 dcausse, neriah: Backport for [[gerrit:1299626|Replace wgNewUserMessageOnAutoCreate with wgNewUserMessageOnFirstEdit (T426206)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:18:50] checking logs [13:20:01] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1080.eqiad.wmnet with reason: host reimage [13:20:06] (03CR) 10Gergő Tisza: "Thanks! Can be merged any time. (Let me know if you want me to be around, but otherwise I think the change is trivial enough that it doesn" [puppet] - 10https://gerrit.wikimedia.org/r/1298383 (https://phabricator.wikimedia.org/T208443) (owner: 10Gergő Tisza) [13:20:22] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [13:20:34] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [13:21:00] !log dcausse@deploy1003 dcausse, neriah: Continuing with deployment [13:21:17] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:21:33] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [13:22:03] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [13:22:10] FIRING: [2x] BFDdown: BFD session down between cloudsw1-b1-codfw and 172.20.5.8 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cloudsw1-b1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:22:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:24:17] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:25:02] jouncebot: next [13:25:02] In 0 hour(s) and 34 minute(s): Test Kitchen UI Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260616T1400) [13:25:15] !log elukey@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [13:25:16] !log dcausse@deploy1003 Finished scap sync-world: Backport for [[gerrit:1299626|Replace wgNewUserMessageOnAutoCreate with wgNewUserMessageOnFirstEdit (T426206)]] (duration: 08m 50s) [13:25:22] Neriah: should be live [13:25:23] T426206: Per global RfC, only welcome users on Wikimedia projects where they created their account or have edited - https://phabricator.wikimedia.org/T426206 [13:25:38] atsukoito: your turn :) [13:25:47] lemme do the honors [13:26:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by atsuko@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302197 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [13:27:10] RESOLVED: [4x] BFDdown: BFD session down between cloudsw1-b1-codfw and 172.20.5.9 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cloudsw1-b1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:27:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:27:41] (03Merged) 10jenkins-bot: translate: remove CirrusSearch endpoints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302197 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [13:28:06] !log atsuko@deploy1003 Started scap sync-world: Backport for [[gerrit:1302197|translate: remove CirrusSearch endpoints (T425377)]] [13:28:11] T425377: Migrate Ttmserver (Translatewiki application) indices from production OpenSearch to OpenSearch on k8s - https://phabricator.wikimedia.org/T425377 [13:28:20] elukey@cumin1003 reimage (PID 198109) is awaiting input [13:29:06] (03CR) 10FNegri: [C:03+1] local CI: force docker arch on linux (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1302758 (owner: 10Volans) [13:30:08] !log atsuko@deploy1003 atsuko: Backport for [[gerrit:1302197|translate: remove CirrusSearch endpoints (T425377)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:30:20] (03CR) 10Brouberol: [C:03+2] airflow-fr-tech: configure s3_fr_tech connection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302668 (https://phabricator.wikimedia.org/T429048) (owner: 10Brouberol) [13:30:44] (03PS2) 10TChin: [eventstreams] Bump to v0.19.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276779 (https://phabricator.wikimedia.org/T420257) [13:30:45] (03CR) 10Volans: local CI: force docker arch on linux (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1302758 (owner: 10Volans) [13:31:16] (03PS1) 10Arnaudb: ci: monitor a wider variety of network errors [puppet] - 10https://gerrit.wikimedia.org/r/1302829 (https://phabricator.wikimedia.org/T420865) [13:31:38] dcausse: testing debug wiki [13:31:41] (03PS1) 10Muehlenhoff: Disable Debian mirror sync [puppet] - 10https://gerrit.wikimedia.org/r/1302838 (https://phabricator.wikimedia.org/T416707) [13:32:28] (03PS1) 10Arnaudb: ci: add repositories to gitcache [puppet] - 10https://gerrit.wikimedia.org/r/1302834 (https://phabricator.wikimedia.org/T420865) [13:32:46] !log elukey@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [13:32:46] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1079.eqiad.wmnet with OS trixie [13:33:28] atsukoito: something's not right not getting the same output for e.g. https://www.wikifunctions.org/w/index.php?title=Special%3ATranslate&showMessage=Wikifunctions%3AStatus_updates%2F2026-05-15%2F1&group=page-Wikifunctions%3AStatus+updates%2F2026-05-15&language=de&filter=&optional=1&action=translate [13:33:37] (03CR) 10CDanis: [C:03+1] cache::haproxy: remove x-provenance feature flag [puppet] - 10https://gerrit.wikimedia.org/r/1301429 (https://phabricator.wikimedia.org/T427068) (owner: 10Fabfur) [13:33:51] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#12023807 (10elukey) I was able to successfully install Trixie on 1006 and 1008, but for some reason 1007 ended up in the EFI shell when I tried to EFI Boot. [13:33:53] same, I don't get correct output on a random test page https://www.mediawiki.org/w/index.php?title=Special%3ATranslate&group=page-Help%3AExtension%3ATranslate%2FTranslation+memories&action=page&filter=%21translated&language=ja [13:34:16] not seeing much in the logs :/ [13:34:20] !log cmooney@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2153: codfw rack a5 depool for switch maintenance T428020 [13:34:21] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1302838 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [13:34:24] T428020: codfw: rack A5 maintenance - https://phabricator.wikimedia.org/T428020 [13:34:58] 10ops-eqiad, 06SRE, 06DC-Ops: Q3 :rack/setup/install cloudvirt refresh - https://phabricator.wikimedia.org/T425088#12023810 (10elukey) cloudvirt1077, 1079 and 1080 are running Trixie. The only one missing is 1078. [13:35:29] Special:SearchTranslations seem to work fine tho... [13:35:39] (03CR) 10Fabfur: [C:03+2] cache::haproxy: remove x-provenance feature flag [puppet] - 10https://gerrit.wikimedia.org/r/1301429 (https://phabricator.wikimedia.org/T427068) (owner: 10Fabfur) [13:35:45] (03PS2) 10Filippo Giunchedi: openstack: deprecate check-cinder-snapshot-leaks [puppet] - 10https://gerrit.wikimedia.org/r/1302749 (https://phabricator.wikimedia.org/T328502) [13:36:00] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] openstack: deprecate check-cinder-snapshot-leaks [puppet] - 10https://gerrit.wikimedia.org/r/1302749 (https://phabricator.wikimedia.org/T328502) (owner: 10Filippo Giunchedi) [13:36:13] !log elukey@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [13:36:29] !log elukey@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [13:36:29] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1080.eqiad.wmnet with OS trixie [13:36:30] (03PS1) 10Kevin Bazira: ml-services: deploy cope-b-a4b isvc that adds confidence score to responses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302846 (https://phabricator.wikimedia.org/T427497) [13:36:35] I think it is rollback then, and I'll introduce the patch to test pure k8s deployment on testwiki [13:36:54] dcausse: did you see any logs? [13:37:03] !log atsuko@deploy1003 atsuko: Rolling back deployment [13:37:28] atsukoito: no all is empty not sure what's happening [13:37:44] could be a timeout but I'd expect some errors in the logs [13:37:58] or it's returning nothing [13:38:31] (03CR) 10JHathaway: [C:03+2] profile::postfix::mx: Mark the SMTP port as intentionally open [puppet] - 10https://gerrit.wikimedia.org/r/1283043 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [13:38:32] could be a bug in opensearch2 support as well, hard to tell [13:39:02] (03PS1) 10Atsuko: Revert "translate: remove CirrusSearch endpoints" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302850 [13:39:09] atsukoito: yes let's revert, we'll try to repro crafting the same query and sending it directly to opensearch2 [13:39:22] !log atsuko@deploy1003 Finished scap sync-world: Backport for [[gerrit:1302197|translate: remove CirrusSearch endpoints (T425377)]] (duration: 11m 16s) [13:39:24] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:39:26] T425377: Migrate Ttmserver (Translatewiki application) indices from production OpenSearch to OpenSearch on k8s - https://phabricator.wikimedia.org/T425377 [13:39:59] dcausse: do I merge the revert via scap or just with +1? [13:40:03] !log cmooney@cumin2002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2154: codfw rack a5 depool for switch maintenance T428020 [13:40:06] meant +2 [13:40:06] T428020: codfw: rack A5 maintenance - https://phabricator.wikimedia.org/T428020 [13:40:07] !log cmooney@cumin2002 START - Cookbook sre.mysql.pool pool db2157: codfw rack a5 depool for switch maintenance T428020 [13:40:16] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1302850 [13:40:18] atsukoito: you should ship the revert via scap [13:40:24] ok [13:40:34] (03CR) 10Atsuko: [C:03+1] "revert" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302850 (owner: 10Atsuko) [13:40:57] thank you dcausse! [13:40:58] (03CR) 10Bartosz Wójtowicz: [C:03+1] ml-services: deploy cope-b-a4b isvc that adds confidence score to responses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302846 (https://phabricator.wikimedia.org/T427497) (owner: 10Kevin Bazira) [13:41:05] Neriah: yw! :) [13:41:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by atsuko@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302850 (owner: 10Atsuko) [13:41:16] (03CR) 10Kevin Bazira: [C:03+2] ml-services: deploy cope-b-a4b isvc that adds confidence score to responses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302846 (https://phabricator.wikimedia.org/T427497) (owner: 10Kevin Bazira) [13:41:17] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:41:20] (03PS1) 10Muehlenhoff: tunnelencabulator: Update bastions in ulsfo/eqsin [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1302853 [13:41:20] (03PS1) 10Muehlenhoff: configs/ssh-client-config: Update bastions in ulsfo/eqsin [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1302854 [13:41:27] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Review of firewall services without srange - https://phabricator.wikimedia.org/T149804#12023839 (10MoritzMuehlenhoff) [13:41:40] (03PS1) 10Fabfur: hiera: remove x_provenance switch for haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1302855 (https://phabricator.wikimedia.org/T427068) [13:42:35] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1302855 (https://phabricator.wikimedia.org/T427068) (owner: 10Fabfur) [13:43:12] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [13:43:22] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-fr-tech: apply [13:43:28] (03Merged) 10jenkins-bot: ml-services: deploy cope-b-a4b isvc that adds confidence score to responses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302846 (https://phabricator.wikimedia.org/T427497) (owner: 10Kevin Bazira) [13:43:33] (03PS1) 10Muehlenhoff: Remove the black box check for mirrors.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1302858 (https://phabricator.wikimedia.org/T416707) [13:43:45] (03Merged) 10jenkins-bot: Revert "translate: remove CirrusSearch endpoints" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302850 (owner: 10Atsuko) [13:43:56] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-fr-tech: apply [13:44:07] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [13:44:08] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] tunnelencabulator: Update bastions in ulsfo/eqsin [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1302853 (owner: 10Muehlenhoff) [13:44:12] !log atsuko@deploy1003 Started scap sync-world: Backport for [[gerrit:1302850|Revert "translate: remove CirrusSearch endpoints"]] [13:44:32] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] configs/ssh-client-config: Update bastions in ulsfo/eqsin [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1302854 (owner: 10Muehlenhoff) [13:45:15] dcausse: if I remember correctly, I don't need to propagate the revert over all the servers, and can abort after debug [13:45:15] (03PS1) 10Elukey: Add sre.hosts.bmc-user-mgmt.py [cookbooks] - 10https://gerrit.wikimedia.org/r/1302859 (https://phabricator.wikimedia.org/T426180) [13:45:31] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#12023875 (10cmooney) [13:45:40] (03PS1) 10Muehlenhoff: Bump changelog for 1.0.6 [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1302860 [13:45:41] (03CR) 10Elukey: "Still need to test it!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1302859 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [13:46:01] atsukoito: I believe you're right, worst case the next deploys will ship the matching images [13:46:01] 06SRE, 10DNS, 07Kubernetes: 10.67.28.73 reverse DNS showing 2(SERVFAIL) - https://phabricator.wikimedia.org/T428573#12023876 (10Clement_Goubert) >>! In T428573#12023688, @CDanis wrote: >> It might be possible to work around this by creating headless services for these jobs that access the databases -- defini... [13:46:14] !log atsuko@deploy1003 atsuko: Backport for [[gerrit:1302850|Revert "translate: remove CirrusSearch endpoints"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:46:36] (03CR) 10Effie Mouzeli: [C:03+2] aliases: rdb1011 has been decommed [puppet] - 10https://gerrit.wikimedia.org/r/1300748 (owner: 10Effie Mouzeli) [13:46:44] dcausse: Special:Translate has returned [13:46:57] (03CR) 10Effie Mouzeli: [C:03+2] site.pp: remove retired redis hosts [puppet] - 10https://gerrit.wikimedia.org/r/1300761 (https://phabricator.wikimedia.org/T428858) (owner: 10Effie Mouzeli) [13:47:18] atsukoito: you mean targetting debug servers? [13:47:26] yup [13:47:30] ack [13:47:35] FIRING: PuppetFailure: Puppet has failed on cumin2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:47:36] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [13:47:47] atsukoito: lemme know if we can continue with other patches [13:47:57] !log atsuko@deploy1003 atsuko: Rolling back deployment [13:48:23] !log atsuko@deploy1003 Finished scap sync-world: Backport for [[gerrit:1302850|Revert "translate: remove CirrusSearch endpoints"]] (duration: 04m 10s) [13:48:30] dcausse: you can proceed [13:48:33] thanks! [13:48:40] nemo-yiannis: still around? [13:48:57] yup [13:49:09] ok, shipping your patches then [13:49:32] thanks [13:49:40] FIRING: [3x] SystemdUnitFailed: database-backups-snapshots.service on cumin2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:49:46] (03CR) 10Volans: [C:03+2] local CI: force docker arch on linux [puppet] - 10https://gerrit.wikimedia.org/r/1302758 (owner: 10Volans) [13:50:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1003 using scap backport" [vendor] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302792 (https://phabricator.wikimedia.org/T417530) (owner: 10Jgiannelos) [13:50:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302793 (https://phabricator.wikimedia.org/T429187) (owner: 10Jgiannelos) [13:51:53] !log cscott@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [13:52:18] !log cscott@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [13:52:19] !log cscott@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [13:52:46] !log cscott@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [13:53:46] (03CR) 10Tiziano Fogli: "Our Alertmanager configuration groups alerts by alertname, cluster, scope, and team." [alerts] - 10https://gerrit.wikimedia.org/r/1278488 (owner: 10Gmodena) [13:54:00] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.24.0-a10 [vendor] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302792 (https://phabricator.wikimedia.org/T417530) (owner: 10Jgiannelos) [13:54:24] (03CR) 10CDanis: [C:03+1] hiera: remove x_provenance switch for haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1302855 (https://phabricator.wikimedia.org/T427068) (owner: 10Fabfur) [13:54:37] (03CR) 10Effie Mouzeli: [C:03+2] mediawiki-common: remove retired redis servers from list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300753 (owner: 10Effie Mouzeli) [13:54:41] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [13:54:48] (03CR) 10Fabfur: [C:03+2] hiera: remove x_provenance switch for haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1302855 (https://phabricator.wikimedia.org/T427068) (owner: 10Fabfur) [13:55:32] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.24.0-a10 [core] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302793 (https://phabricator.wikimedia.org/T429187) (owner: 10Jgiannelos) [13:55:36] fwiw I would love to still deploy the final change in this window, even if it goes a bit overtime. Is that okay? [13:55:47] Otherwise we will have logspam tonight... [13:56:02] !log dcausse@deploy1003 Started scap sync-world: Backport for [[gerrit:1302792|Bump wikimedia/parsoid to 0.24.0-a10 (T417530 T428105 T429187)]], [[gerrit:1302793|Bump wikimedia/parsoid to 0.24.0-a10 (T429187)]] [13:56:09] awight: sure, want to self-deploy? [13:56:11] T417530: Parsoid shouldn't wrap wikitext html-ish `` tags in
wrappers - https://phabricator.wikimedia.org/T417530 [13:56:11] T428105: Extension:Chart renders stray text above chart in Parsoid when field title contains "%" - https://phabricator.wikimedia.org/T428105 [13:56:12] T429187: CTT tasks week of 2026-06-12 - https://phabricator.wikimedia.org/T429187 [13:56:40] we can ping testkitchens devs to see if this is fine [13:56:46] dcause are the changes live ? [13:57:11] nemo-yiannis: not yet, they haven't hit debug servers yet [13:57:13] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-fr-tech: apply [13:57:37] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-fr-tech: apply [13:57:59] yes [13:58:01] !log dcausse@deploy1003 jgiannelos, dcausse: Backport for [[gerrit:1302792|Bump wikimedia/parsoid to 0.24.0-a10 (T417530 T428105 T429187)]], [[gerrit:1302793|Bump wikimedia/parsoid to 0.24.0-a10 (T429187)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:58:11] (03Merged) 10jenkins-bot: mediawiki-common: remove retired redis servers from list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300753 (owner: 10Effie Mouzeli) [13:58:23] nemo-yiannis: now you should see the fixes in debug servers [13:58:29] ok [13:58:30] looking [13:58:36] 06SRE, 06Infrastructure-Foundations, 06Traffic: Scaling urldownloaders by adding redundancy and load balancing - https://phabricator.wikimedia.org/T429175#12023981 (10cmooney) Option 3 is fine I think. The only thing I'd say against it is with my network hat on I'm always trying to minimise hops, especially... [13:59:31] (03PS1) 10Fabfur: cache::haproxy: using intermediate variable for logging x-provenance [puppet] - 10https://gerrit.wikimedia.org/r/1302874 (https://phabricator.wikimedia.org/T427068) [13:59:35] (03PS1) 10WMDE-Fisch: Fix VE core submodule update to 3e79e9934 [extensions/VisualEditor] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302872 (https://phabricator.wikimedia.org/T428764) [13:59:56] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [14:00:04] Deploy window Test Kitchen UI Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260616T1400) [14:00:07] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-feature-counts-change-enrich: apply [14:00:58] the mw backport window is running a bit late [14:02:03] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1302874 (https://phabricator.wikimedia.org/T427068) (owner: 10Fabfur) [14:02:06] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [14:02:20] nemo-yiannis: fwiw 1.47.0-wmf.7 is only available on test wikis [14:02:48] not yet on group0 & mediawiki.org for instance [14:02:48] ok, i was testing enwiki and i was confused :P [14:02:55] i think its OK to rollout [14:03:00] sounds good [14:03:03] its part of our weekly releases [14:03:15] !log dcausse@deploy1003 jgiannelos, dcausse: Continuing with deployment [14:03:43] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: es2045 down - https://phabricator.wikimedia.org/T429113#12024001 (10Jhancock.wm) 05Open→03Resolved [14:04:11] (03PS1) 10Santiago Faci: Test Kitchen UI: Deploy v1.4.2 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302881 (https://phabricator.wikimedia.org/T428802) [14:04:57] (03CR) 10FNegri: [C:03+1] local CI: force docker arch on linux (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1302758 (owner: 10Volans) [14:06:22] (03CR) 10Filippo Giunchedi: "This is ready for reviewing again" [alerts] - 10https://gerrit.wikimedia.org/r/1302151 (https://phabricator.wikimedia.org/T328502) (owner: 10Filippo Giunchedi) [14:06:45] sfaci: hi! just in case it could possibly conflict with your test kitchen deploys note that the mw backport window is not enterily done, a patch is being shipped and we'd like to ship another one [14:06:50] (03PS3) 10Filippo Giunchedi: team-wmcs: introduce per-namespace neutron conntrack alert [alerts] - 10https://gerrit.wikimedia.org/r/1302151 (https://phabricator.wikimedia.org/T328502) [14:07:14] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [14:07:31] !log dcausse@deploy1003 Finished scap sync-world: Backport for [[gerrit:1302792|Bump wikimedia/parsoid to 0.24.0-a10 (T417530 T428105 T429187)]], [[gerrit:1302793|Bump wikimedia/parsoid to 0.24.0-a10 (T429187)]] (duration: 11m 29s) [14:07:38] T417530: Parsoid shouldn't wrap wikitext html-ish `` tags in
wrappers - https://phabricator.wikimedia.org/T417530 [14:07:39] T428105: Extension:Chart renders stray text above chart in Parsoid when field title contains "%" - https://phabricator.wikimedia.org/T428105 [14:07:39] T429187: CTT tasks week of 2026-06-12 - https://phabricator.wikimedia.org/T429187 [14:07:59] nemo-yiannis: should be live on test wikis and will ride the train this week [14:08:13] awight: want to self-deploy or should I? [14:08:27] thanks dcausse [14:08:48] dcausse: ty, I can self-deploy [14:08:56] sure! [14:10:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by awight@deploy1003 using scap backport" [extensions/Wikibase] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302804 (https://phabricator.wikimedia.org/T428620) (owner: 10Seanleong-wmde) [14:10:29] (03PS1) 10Brouberol: airflow-fr-tech: use the fr-tech CA certificate to validate the s3 endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302882 (https://phabricator.wikimedia.org/T429048) [14:11:44] 10SRE-swift-storage, 06Commons: Compressing TIFF files from the Library of Congress - https://phabricator.wikimedia.org/T429264#12024035 (10Yann) I am curious. What is the average size of these files? And how much storage would be saved if compressed? [14:12:24] (03CR) 10Astein: [C:03+1] airflow-fr-tech: use the fr-tech CA certificate to validate the s3 endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302882 (https://phabricator.wikimedia.org/T429048) (owner: 10Brouberol) [14:12:31] (03PS2) 10Brouberol: airflow-fr-tech: use the fr-tech CA certificate to validate the s3 endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302882 (https://phabricator.wikimedia.org/T429048) [14:13:14] (03CR) 10Tiziano Fogli: [C:03+2] thanos/compact: avoid constant Puppet changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1273762 (https://phabricator.wikimedia.org/T386911) (owner: 10Tiziano Fogli) [14:14:14] 10SRE-swift-storage, 06Commons: Compressing TIFF files from the Library of Congress - https://phabricator.wikimedia.org/T429264#12024043 (10Ladsgroup) I'm planning to look that up soon-ish. It's not too hard to measure. I just need to join categoyrlinks table or templatelinks table with image/file table [14:15:57] (03CR) 10Brouberol: [C:03+2] airflow-fr-tech: use the fr-tech CA certificate to validate the s3 endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302882 (https://phabricator.wikimedia.org/T429048) (owner: 10Brouberol) [14:16:04] (03PS1) 10JavierMonton: stream: mw-page-html-feature-counts-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302883 (https://phabricator.wikimedia.org/T429127) [14:20:08] (03CR) 10Brouberol: [C:03+1] stream: mw-page-html-feature-counts-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302883 (https://phabricator.wikimedia.org/T429127) (owner: 10JavierMonton) [14:21:37] (03CR) 10JavierMonton: [C:03+2] stream: mw-page-html-feature-counts-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302883 (https://phabricator.wikimedia.org/T429127) (owner: 10JavierMonton) [14:22:33] PROBLEM - Dell PowerEdge or Supermicro Broadcom RAID Controller on db2247 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 0 OK : virtual_disk: 1 Dgrd : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [14:22:35] ACKNOWLEDGEMENT - Dell PowerEdge or Supermicro Broadcom RAID Controller on db2247 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 0 OK : virtual_disk: 1 Dgrd : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T429348 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [14:22:42] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2247 - https://phabricator.wikimedia.org/T429348 (10ops-monitoring-bot) 03NEW [14:23:54] (03Merged) 10jenkins-bot: stream: mw-page-html-feature-counts-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302883 (https://phabricator.wikimedia.org/T429127) (owner: 10JavierMonton) [14:28:11] !log cmooney@cumin2002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2157: codfw rack a5 depool for switch maintenance T428020 [14:28:16] T428020: codfw: rack A5 maintenance - https://phabricator.wikimedia.org/T428020 [14:28:16] !log cmooney@cumin2002 START - Cookbook sre.mysql.pool pool db2175: codfw rack a5 depool for switch maintenance T428020 [14:30:04] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260616T1430) [14:33:26] gate-and-submit is hanging for some reason https://integration.wikimedia.org/ci/job/quibble-apitests-only-vendor-php83/28053/console [14:33:39] (03CR) 10Awight: "recheck" [extensions/Wikibase] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302804 (https://phabricator.wikimedia.org/T428620) (owner: 10Seanleong-wmde) [14:39:03] (03CR) 10CI reject: [V:04-1] Hotfix for T428620 [extensions/Wikibase] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302804 (https://phabricator.wikimedia.org/T428620) (owner: 10Seanleong-wmde) [14:39:42] (03CR) 10Awight: [C:03+2] "recheck" [extensions/Wikibase] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302804 (https://phabricator.wikimedia.org/T428620) (owner: 10Seanleong-wmde) [14:41:58] (03CR) 10AikoChou: [C:03+1] ml-services: Update outlink-topic-model image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302812 (https://phabricator.wikimedia.org/T428127) (owner: 10Bartosz Wójtowicz) [14:44:36] (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Update outlink-topic-model image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302812 (https://phabricator.wikimedia.org/T428127) (owner: 10Bartosz Wójtowicz) [14:46:41] (03PS1) 10Jdlrobson: Guard round function with a supports query [core] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302890 (https://phabricator.wikimedia.org/T424596) [14:46:45] (03Merged) 10jenkins-bot: ml-services: Update outlink-topic-model image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302812 (https://phabricator.wikimedia.org/T428127) (owner: 10Bartosz Wójtowicz) [14:46:48] !log aokoth@deploy1003 Started deploy [phabricator/deployment@73e57ce]: deploy phab [14:47:13] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users level 1 for chudson - https://phabricator.wikimedia.org/T429353 (10CHudson-WMF) 03NEW [14:47:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [core] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302890 (https://phabricator.wikimedia.org/T424596) (owner: 10Jdlrobson) [14:48:08] (03Merged) 10jenkins-bot: Hotfix for T428620 [extensions/Wikibase] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302804 (https://phabricator.wikimedia.org/T428620) (owner: 10Seanleong-wmde) [14:48:57] !log aokoth@deploy1003 Finished deploy [phabricator/deployment@73e57ce]: deploy phab (duration: 02m 09s) [14:49:35] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN:Switch refresh diagram and wiring - https://phabricator.wikimedia.org/T423724#12024308 (10Papaul) [14:50:56] trying the scap one more time... [14:51:29] !log awight@deploy1003 Started scap sync-world: Backport for [[gerrit:1302804|Hotfix for T428620 (T428620)]] [14:51:34] T428620: TypeError: Wikibase UsageDeduplicator::deduplicateStatementUsages(): Argument must be of type array (Warning: Undefined array key "C") (☀️) - https://phabricator.wikimedia.org/T428620 [14:53:31] !log awight@deploy1003 seanleong-wmde, awight: Backport for [[gerrit:1302804|Hotfix for T428620 (T428620)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:54:42] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update records for frproto1001 (formerly payments1008) - cmooney@cumin1003" [14:54:46] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update records for frproto1001 (formerly payments1008) - cmooney@cumin1003" [14:54:46] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:55:11] (03PS1) 10AOkoth: hiera: promote phab2003 to passive_server [puppet] - 10https://gerrit.wikimedia.org/r/1302894 (https://phabricator.wikimedia.org/T423727) [14:55:26] !log urbanecm@deploy1003 mwscript-k8s job started: foreachwikiindblist growthexperiments purgeUserOptions.php --login-age 1 growthexperiments-tour-help-panel # T429352 [14:55:30] T429352: Remove GrowthExperiments tour properties older than a year - https://phabricator.wikimedia.org/T429352 [14:57:14] !log awight@deploy1003 seanleong-wmde, awight: Continuing with deployment [14:58:33] (03PS2) 10Trueg: dse-k8s-services: Enable ingress on WDQS namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302784 (https://phabricator.wikimedia.org/T429313) [14:59:13] (03CR) 10RLazarus: [C:03+1] redirects.dat: Funnel api.w.o to mw.o/wiki/Wikimedia_APIs [puppet] - 10https://gerrit.wikimedia.org/r/1302106 (https://phabricator.wikimedia.org/T418492) (owner: 10Clément Goubert) [15:00:04] jelto, arnoldokoth, mutante, and arnaudb: How many deployers does it take to do SRE Collaboration Services office hours deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260616T1500). [15:01:30] !log awight@deploy1003 Finished scap sync-world: Backport for [[gerrit:1302804|Hotfix for T428620 (T428620)]] (duration: 10m 00s) [15:01:35] T428620: TypeError: Wikibase UsageDeduplicator::deduplicateStatementUsages(): Argument must be of type array (Warning: Undefined array key "C") (☀️) - https://phabricator.wikimedia.org/T428620 [15:03:45] !log urbanecm@deploy1003 mwscript-k8s job started: foreachwikiindblist growthexperiments purgeUserOptions.php --login-age 1 growthexperiments-tour-homepage-mentorship # T429352 [15:03:49] (03CR) 10JMeybohm: [C:03+1] tls_terminator: Convert size to kB for rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/1302772 (https://phabricator.wikimedia.org/T414440) (owner: 10Clément Goubert) [15:03:50] T429352: Remove GrowthExperiments tour properties older than a year - https://phabricator.wikimedia.org/T429352 [15:06:12] !log urbanecm@deploy1003 mwscript-k8s job started: foreachwikiindblist growthexperiments purgeUserOptions.php --login-age 1 growthexperiments-tour-homepage-discovery # T429352 [15:06:58] !log urbanecm@deploy1003 mwscript-k8s job started: foreachwikiindblist growthexperiments purgeUserOptions.php --login-age 1 growthexperiments-tour-homepage-welcome # T429352 [15:08:22] (03CR) 10Scott French: "Thanks, Fabrizio!" [puppet] - 10https://gerrit.wikimedia.org/r/1302874 (https://phabricator.wikimedia.org/T427068) (owner: 10Fabfur) [15:16:19] !log cmooney@cumin2002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2175: codfw rack a5 depool for switch maintenance T428020 [15:16:24] !log cmooney@cumin2002 START - Cookbook sre.mysql.pool pool db2176: codfw rack a5 depool for switch maintenance T428020 [15:16:24] T428020: codfw: rack A5 maintenance - https://phabricator.wikimedia.org/T428020 [15:17:24] 06SRE, 10DNS, 07Kubernetes: 10.67.28.73 reverse DNS showing 2(SERVFAIL) - https://phabricator.wikimedia.org/T428573#12024436 (10JMeybohm) >>! In T428573#12023688, @CDanis wrote: >> It might be possible to work around this by creating headless services for these jobs that access the databases -- definitely wo... [15:18:13] (03PS14) 10CDanis: fundraising_data_import maintenance script wrapper & timer [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948) [15:20:31] (03PS1) 10Muehlenhoff: Fix component config for routinator on trixie [puppet] - 10https://gerrit.wikimedia.org/r/1302899 [15:22:19] (03CR) 10Muehlenhoff: [C:03+2] Fix component config for routinator on trixie [puppet] - 10https://gerrit.wikimedia.org/r/1302899 (owner: 10Muehlenhoff) [15:29:27] (03PS15) 10CDanis: fundraising_data_import maintenance script wrapper & timer [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948) [15:32:58] !log brennen@deploy1003 Started deploy [phabricator/deployment@a640ed9]: test deploy phab2003 - T427286 [15:33:03] T427286: Deploy Phab/Phorge 2026-05-26 - https://phabricator.wikimedia.org/T427286 [15:33:06] (03CR) 10Ottomata: [C:03+1] [eventstreams] Bump to v0.19.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276779 (https://phabricator.wikimedia.org/T420257) (owner: 10TChin) [15:33:47] !log brennen@deploy1003 Finished deploy [phabricator/deployment@a640ed9]: test deploy phab2003 - T427286 (duration: 00m 49s) [15:34:08] 06SRE, 10DNS, 07Kubernetes: 10.67.28.73 reverse DNS showing 2(SERVFAIL) - https://phabricator.wikimedia.org/T428573#12024654 (10CDanis) >>! In T428573#12024436, @JMeybohm wrote: > Not completely. But I can imagine this might come up again in X time because another thing came along that does not have the dumm... [15:36:25] !log dancy@deploy1003 Installing scap version "4.269.0" for 2 host(s) [15:38:17] !log dancy@deploy1003 Installation of scap version "4.269.0" completed for 2 hosts [15:38:45] !log Remove `migrateMentorStatusAwayToCommunityConfiguration` from `updatelog` on all wikis in `growthexperiments.dblist` (T409170) [15:38:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:50] T409170: Run MigrateMentorStatusAway migration script - https://phabricator.wikimedia.org/T409170 [15:39:59] !log installing Tomcat security updates [15:40:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:44] !log urbanecm@deploy1003 mwscript-k8s job started: GrowthExperiments:migrateMentorStatusAway --wiki=abwiki --dry-run # T409170 [15:50:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/VisualEditor] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302320 (owner: 10DLynch) [15:51:18] (03CR) 10Marostegui: [C:03+1] etcd: Ignore test-s4 from dbctl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1292300 (https://phabricator.wikimedia.org/T427059) (owner: 10Ladsgroup) [15:54:44] 06SRE, 10homer, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Homer should abort on filter rules applied on non-existent or disabled interfaces - https://phabricator.wikimedia.org/T428886#12024751 (10cmooney) >>! In T428886#12019220, @taavi wrote: > I'm not a huge fan of relying on the exac... [15:55:45] (03PS2) 10AOkoth: hiera: promote phab2003 to passive_server [puppet] - 10https://gerrit.wikimedia.org/r/1302894 (https://phabricator.wikimedia.org/T423727) [16:00:05] jhathaway and rzl: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet request window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260616T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:40] fyi we're still working on phab stuff [16:04:27] !log cmooney@cumin2002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2176: codfw rack a5 depool for switch maintenance T428020 [16:04:32] T428020: codfw: rack A5 maintenance - https://phabricator.wikimedia.org/T428020 [16:06:26] !log aokoth@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: Phorge Deploy [16:07:22] !log brennen@deploy1003 Started deploy [phabricator/deployment@a640ed9]: deploy phab2002 - T429350 [16:07:27] T429350: Deploy Phab/Phorge 2026-06-16 - https://phabricator.wikimedia.org/T429350 [16:07:35] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:08:09] !log brennen@deploy1003 Finished deploy [phabricator/deployment@a640ed9]: deploy phab2002 - T429350 (duration: 00m 47s) [16:08:35] !log aokoth@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: Phorge Deploy [16:08:48] (03PS1) 10Volans: ceph: allow to set client transport encryption [puppet] - 10https://gerrit.wikimedia.org/r/1302904 (https://phabricator.wikimedia.org/T294432) [16:08:50] !log brennen@deploy1003 Started deploy [phabricator/deployment@a640ed9]: deploy phab1004 - T429350 [16:08:50] (03PS1) 10Volans: Cinder backups: enable transport encryption part 1 [puppet] - 10https://gerrit.wikimedia.org/r/1302905 (https://phabricator.wikimedia.org/T294432) [16:09:36] !log brennen@deploy1003 Finished deploy [phabricator/deployment@a640ed9]: deploy phab1004 - T429350 (duration: 00m 45s) [16:09:44] (03CR) 10CI reject: [V:04-1] Cinder backups: enable transport encryption part 1 [puppet] - 10https://gerrit.wikimedia.org/r/1302905 (https://phabricator.wikimedia.org/T294432) (owner: 10Volans) [16:09:48] 10SRE-swift-storage, 06Commons: Compressing TIFF files from the Library of Congress - https://phabricator.wikimedia.org/T429264#12024917 (10Ladsgroup) ` MariaDB [commonswiki_p]> select sum(fr_size) from file join filerevision on file_latest = fr_id join page on page_namespace = 6 and file_name = page_title jo... [16:11:50] (03PS2) 10Volans: Cinder backups: enable transport encryption part 1 [puppet] - 10https://gerrit.wikimedia.org/r/1302905 (https://phabricator.wikimedia.org/T294432) [16:11:58] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1302904 (https://phabricator.wikimedia.org/T294432) (owner: 10Volans) [16:14:20] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2247 - https://phabricator.wikimedia.org/T429348#12024936 (10Jhancock.wm) a:03Jhancock.wm this definitely falls in the warranty. requested new part from Dell. SR227870004. Will update when drive is replaced. [16:19:40] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:22:35] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:25:09] (03PS23) 10Trueg: dse-k8s-services: WDQS deployment helmfile values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297067 (https://phabricator.wikimedia.org/T424338) [16:25:09] (03PS3) 10Trueg: dse-k8s-services: Enable ingress on WDQS namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302784 (https://phabricator.wikimedia.org/T429313) [16:25:17] (03CR) 10BCornwall: [C:03+2] admin: Add echukwukere to analytics-private-data [puppet] - 10https://gerrit.wikimedia.org/r/1302270 (https://phabricator.wikimedia.org/T428827) (owner: 10BCornwall) [16:25:17] (03CR) 10BCornwall: [C:03+2] admin: Add mfossati to ml-lab-users [puppet] - 10https://gerrit.wikimedia.org/r/1302272 (https://phabricator.wikimedia.org/T429148) (owner: 10BCornwall) [16:27:47] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1302905 (https://phabricator.wikimedia.org/T294432) (owner: 10Volans) [16:28:29] 06SRE, 10SRE-Access-Requests, 06Machine-Learning-Team (Q4 FY2025-26), 13Patch-For-Review: Requesting access to ml-lab-users for mfossati - https://phabricator.wikimedia.org/T429148#12025031 (10BCornwall) 05In progress→03Resolved a:03BCornwall Your access has been granted. Please wait up to one ho... [16:28:46] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for EChukwukere-WMF - https://phabricator.wikimedia.org/T428827#12025034 (10BCornwall) 05In progress→03Resolved Access has been granted. Please wait up to one hour for the system changes to propagate. I... [16:32:26] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users level 1 for chudson - https://phabricator.wikimedia.org/T429353#12025054 (10BCornwall) [16:34:59] (03PS1) 10Clare Ming: Update GrowthBook api key for Test Kitchen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302906 (https://phabricator.wikimedia.org/T428985) [16:35:05] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users level 1 for chudson - https://phabricator.wikimedia.org/T429353#12025065 (10BCornwall) @ottomata / @Ahoelzl / @Milimetric Can you approve this as group approvers? @IAckerman-WMF Can you also approve as manager? [16:35:25] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users level 1 for chudson - https://phabricator.wikimedia.org/T429353#12025068 (10BCornwall) 05Open→03In progress p:05Triage→03Medium [16:35:47] (03PS2) 10Clare Ming: Update GrowthBook api key for Test Kitchen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302906 (https://phabricator.wikimedia.org/T428985) [16:35:53] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs: apply [16:36:06] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: decommission rdb2007-rdb2010.codfw.wmnet - https://phabricator.wikimedia.org/T428561#12025086 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [16:41:49] (03Abandoned) 10Dzahn: contint: switch apache proxying to jenkins to use https [puppet] - 10https://gerrit.wikimedia.org/r/1297216 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [16:43:41] jouncebot: nowandnext [16:43:41] For the next 0 hour(s) and 16 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260616T1600) [16:43:41] In 0 hour(s) and 16 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260616T1700) [16:44:03] (03PS1) 10Dreamy Jazz: PublishCaptchaHandler: Only require CAPTCHA for UploadWizard [extensions/UploadWizard] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302908 (https://phabricator.wikimedia.org/T429322) [16:44:14] (03PS1) 10Dreamy Jazz: PublishCaptchaHandler: Only require CAPTCHA for UploadWizard [extensions/UploadWizard] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302909 (https://phabricator.wikimedia.org/T429322) [16:48:43] mutante: Just verifying, are you willing to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1276813 (beta puppet stuff) [16:48:49] FIRING: HelmReleaseBadStatus: Helm release wdqs/scholarly-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:50:04] (03PS16) 10CDanis: fundraising_data_import maintenance script wrapper & timer [puppet] - 10https://gerrit.wikimedia.org/r/1271028 (https://phabricator.wikimedia.org/T416948) [16:50:38] (03PS1) 10Ahmon Dancy: modules/beta/files/wmf-beta-update-databases.py: Keep update.php jobs topped up [puppet] - 10https://gerrit.wikimedia.org/r/1302910 [16:51:46] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-ctrl2006.mgmt:22 - https://phabricator.wikimedia.org/T429283#12025218 (10Jhancock.wm) a:03Jhancock.wm implementation issues caused this trigger. resolved with a status update on netbox. mgmt does in fact ping. gotta wait to close until prom... [16:53:46] (03PS1) 10Eevans: data-gateway: deploy v1.0.16 to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302911 (https://phabricator.wikimedia.org/T428218) [16:54:48] FIRING: KubernetesCalicoDown: dse-k8s-worker1009.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1009.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:55:16] FIRING: [4x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1009:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [16:55:26] 06SRE, 07SRE-Unowned, 13Patch-Needs-Improvement: Some SAL log entries (e.g. switchdc, scap backport) are getting cut off because long lines are being split over IRC - https://phabricator.wikimedia.org/T285709#12025265 (10BCornwall) [16:56:17] (03CR) 10Eevans: [C:03+2] data-gateway: deploy v1.0.16 to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302911 (https://phabricator.wikimedia.org/T428218) (owner: 10Eevans) [16:56:31] (03PS2) 10Ahmon Dancy: modules/beta/files/wmf-beta-update-databases.py: Keep update.php jobs topped up [puppet] - 10https://gerrit.wikimedia.org/r/1302910 [16:58:23] (03Merged) 10jenkins-bot: data-gateway: deploy v1.0.16 to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302911 (https://phabricator.wikimedia.org/T428218) (owner: 10Eevans) [16:58:29] jouncebot: nowandnext [16:58:29] For the next 0 hour(s) and 1 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260616T1600) [16:58:29] In 0 hour(s) and 1 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260616T1700) [16:59:29] (03CR) 10Volans: "PCC is a noop as expected, being the change totally backward compatible." [puppet] - 10https://gerrit.wikimedia.org/r/1302904 (https://phabricator.wikimedia.org/T294432) (owner: 10Volans) [16:59:29] (03PS1) 10Dreamy Jazz: Revert^2 "hCaptcha: Enable for UploadWizard on all wikis with it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302912 [16:59:40] (03CR) 10Btullis: ceph: allow to set client transport encryption (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1302904 (https://phabricator.wikimedia.org/T294432) (owner: 10Volans) [17:00:04] (03CR) 10Volans: "PCC seems fine:" [puppet] - 10https://gerrit.wikimedia.org/r/1302905 (https://phabricator.wikimedia.org/T294432) (owner: 10Volans) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260616T1700) [17:00:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/UploadWizard] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302909 (https://phabricator.wikimedia.org/T429322) (owner: 10Dreamy Jazz) [17:00:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/UploadWizard] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302908 (https://phabricator.wikimedia.org/T429322) (owner: 10Dreamy Jazz) [17:00:25] 06SRE, 06Data-Engineering, 10DNS, 07Kubernetes: 10.67.28.73 reverse DNS showing 2(SERVFAIL) - https://phabricator.wikimedia.org/T428573#12025310 (10BCornwall) [17:01:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302912 (owner: 10Dreamy Jazz) [17:01:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/UploadWizard] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302909 (https://phabricator.wikimedia.org/T429322) (owner: 10Dreamy Jazz) [17:01:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/UploadWizard] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302908 (https://phabricator.wikimedia.org/T429322) (owner: 10Dreamy Jazz) [17:01:55] (03CR) 10Volans: ceph: allow to set client transport encryption (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1302904 (https://phabricator.wikimedia.org/T294432) (owner: 10Volans) [17:02:15] (03PS4) 10Daniel Kinzler: rest-gateway: emit 401 if rate limit is 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298031 (https://phabricator.wikimedia.org/T428184) [17:03:10] 06SRE, 07SRE-Unowned, 10DNS, 07Kubernetes: 10.67.28.73 reverse DNS showing 2(SERVFAIL) - https://phabricator.wikimedia.org/T428573#12025334 (10BCornwall) [17:04:16] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8744/co" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1285926 (https://phabricator.wikimedia.org/T424112) (owner: 10Aleksandar Mastilovic) [17:07:07] (03Merged) 10jenkins-bot: PublishCaptchaHandler: Only require CAPTCHA for UploadWizard [extensions/UploadWizard] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302909 (https://phabricator.wikimedia.org/T429322) (owner: 10Dreamy Jazz) [17:07:10] (03Merged) 10jenkins-bot: PublishCaptchaHandler: Only require CAPTCHA for UploadWizard [extensions/UploadWizard] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302908 (https://phabricator.wikimedia.org/T429322) (owner: 10Dreamy Jazz) [17:11:07] (03CR) 10CDobbins: "upload:" [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [17:11:08] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1302912|Revert^2 "hCaptcha: Enable for UploadWizard on all wikis with it"]], [[gerrit:1302909|PublishCaptchaHandler: Only require CAPTCHA for UploadWizard (T429322)]], [[gerrit:1302908|PublishCaptchaHandler: Only require CAPTCHA for UploadWizard (T429322)]] [17:11:13] T429322: hCaptcha UploadWizard: UploadWizard hook fires for all uploads, not just those performed by the upload wizard - https://phabricator.wikimedia.org/T429322 [17:11:54] (03CR) 10Ssingh: "@sbassett@wikimedia.org: This is ready to go from our end, so let us know when we should merge it." [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [17:13:35] (03CR) 10Ssingh: "@bcornwall@wikimedia.org will help merge this, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1298383 (https://phabricator.wikimedia.org/T208443) (owner: 10Gergő Tisza) [17:13:41] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/1302858 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [17:20:45] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-logging1007.eqiad.wmnet with OS trixie [17:25:50] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host kafka-logging1007.eqiad.wmnet with OS trixie [17:25:59] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for Bliviero - https://phabricator.wikimedia.org/T428815#12025437 (10BCornwall) Please don't be afraid to ask if there's any other access issue :) [17:26:37] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#12025442 (10elukey) @Jclark-ctr 1007 seems not able to PXE boot, as if it was not properly connected to the network with its main NIC. Could you please double check if ever... [17:27:48] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host conf2007.codfw.wmnet with OS trixie [17:29:57] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1302912|Revert^2 "hCaptcha: Enable for UploadWizard on all wikis with it"]], [[gerrit:1302909|PublishCaptchaHandler: Only require CAPTCHA for UploadWizard (T429322)]], [[gerrit:1302908|PublishCaptchaHandler: Only require CAPTCHA for UploadWizard (T429322)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified t [17:29:57] here. [17:30:02] T429322: hCaptcha UploadWizard: UploadWizard hook fires for all uploads, not just those performed by the upload wizard - https://phabricator.wikimedia.org/T429322 [17:30:33] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with deployment [17:31:57] (03CR) 10BCornwall: [C:03+2] trafficserver: Add Special:OAuth/approve to multi-DC exemptions [puppet] - 10https://gerrit.wikimedia.org/r/1298383 (https://phabricator.wikimedia.org/T208443) (owner: 10Gergő Tisza) [17:35:22] (03CR) 10BCornwall: [V:03+1 C:03+2] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8745/console" [puppet] - 10https://gerrit.wikimedia.org/r/1298383 (https://phabricator.wikimedia.org/T208443) (owner: 10Gergő Tisza) [17:37:33] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install conf200[7-9] - https://phabricator.wikimedia.org/T418914#12025561 (10elukey) I provisioned the host, I see the following in reimage: ` ┌───────────────────────┤ [!!] Partition disks ├────────────────... [17:38:02] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [17:38:48] (03PS1) 10Elukey: preseed: fix partman config for the new conf2* hosts [puppet] - 10https://gerrit.wikimedia.org/r/1302921 (https://phabricator.wikimedia.org/T418914) [17:40:33] (03PS1) 10CWilliams: Fix typo zacillo in sre.mysql [cookbooks] - 10https://gerrit.wikimedia.org/r/1302922 (https://phabricator.wikimedia.org/T429382) [17:41:17] (03CR) 10Ladsgroup: [C:03+2] Fix typo zacillo in sre.mysql [cookbooks] - 10https://gerrit.wikimedia.org/r/1302922 (https://phabricator.wikimedia.org/T429382) (owner: 10CWilliams) [17:43:27] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1302912|Revert^2 "hCaptcha: Enable for UploadWizard on all wikis with it"]], [[gerrit:1302909|PublishCaptchaHandler: Only require CAPTCHA for UploadWizard (T429322)]], [[gerrit:1302908|PublishCaptchaHandler: Only require CAPTCHA for UploadWizard (T429322)]] (duration: 32m 19s) [17:43:31] T429322: hCaptcha UploadWizard: UploadWizard hook fires for all uploads, not just those performed by the upload wizard - https://phabricator.wikimedia.org/T429322 [17:43:53] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host conf2007.codfw.wmnet with OS trixie [17:44:27] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: change mgmt name for frproto1001 - cmooney@cumin1003" [17:44:40] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q3:rack/setup/install kafka-logging200[6-8] - https://phabricator.wikimedia.org/T418931#12025696 (10elukey) @Jhancock.wm Hi! When you have a moment, could you send to me the BMC passwords? [17:46:15] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host dns5004.wikimedia.org with OS bookworm [17:46:26] (03Merged) 10jenkins-bot: Fix typo zacillo in sre.mysql [cookbooks] - 10https://gerrit.wikimedia.org/r/1302922 (https://phabricator.wikimedia.org/T429382) (owner: 10CWilliams) [17:46:35] (03CR) 10Scott French: [C:03+1] "Ah, good catch! Thank you very much." [puppet] - 10https://gerrit.wikimedia.org/r/1302921 (https://phabricator.wikimedia.org/T418914) (owner: 10Elukey) [17:46:50] !log brett@cumin2002 START - Cookbook sre.hosts.move-vlan for host dns5004 [17:47:02] (03CR) 10Btullis: [C:03+2] Deploy the new version of the ceph-csi plugin [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302177 (https://phabricator.wikimedia.org/T428385) (owner: 10Btullis) [17:47:32] cmooney@cumin1003 netbox (PID 235654) is awaiting input [17:47:35] FIRING: PuppetFailure: Puppet has failed on cumin2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:47:45] (03PS1) 10Lerickson: EventStreamConfig: add stream for WDQS V2 external queries. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302923 (https://phabricator.wikimedia.org/T429380) [17:47:48] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: change mgmt name for frproto1001 - cmooney@cumin1003" [17:47:48] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:49:53] brett@cumin2002 reimage (PID 3658819) is awaiting input [17:52:01] (03PS1) 10BCornwall: common: Update dns5004's IP address [puppet] - 10https://gerrit.wikimedia.org/r/1302926 (https://phabricator.wikimedia.org/T428229) [17:52:35] FIRING: [2x] SystemdUnitFailed: database-backups-snapshots.service on cumin2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:52:48] (03CR) 10Andrea Denisse: [C:03+1] "Topissimo!! +1" [puppet] - 10https://gerrit.wikimedia.org/r/1296529 (owner: 10Ilias Sarantopoulos) [17:53:26] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=dns5004.* [17:54:10] (03CR) 10Ssingh: [C:03+1] common: Update dns5004's IP address [puppet] - 10https://gerrit.wikimedia.org/r/1302926 (https://phabricator.wikimedia.org/T428229) (owner: 10BCornwall) [17:54:36] 06SRE, 06Infrastructure-Foundations, 10netops: cr2-esams rpd failure after enabling bgp 'graceful-shutdown' (June 2026) - https://phabricator.wikimedia.org/T429386 (10cmooney) 03NEW p:05Triage→03Low [17:54:57] (03CR) 10BCornwall: [C:03+2] common: Update dns5004's IP address [puppet] - 10https://gerrit.wikimedia.org/r/1302926 (https://phabricator.wikimedia.org/T428229) (owner: 10BCornwall) [17:55:28] (03Merged) 10jenkins-bot: Deploy the new version of the ceph-csi plugin [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302177 (https://phabricator.wikimedia.org/T428385) (owner: 10Btullis) [17:56:48] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thanks!!" [puppet] - 10https://gerrit.wikimedia.org/r/1302785 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [17:59:43] !log btullis@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [18:00:04] jeena and dduvall: MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260616T1800). Please do the needful. [18:00:08] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for EChukwukere-WMF - https://phabricator.wikimedia.org/T428827#12025832 (10EChukwukere-WMF) @BCornwall .. thanks alot.. I will check this and get back to you [18:00:22] !log btullis@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [18:00:49] !log brett@cumin2002 START - Cookbook sre.dns.netbox [18:01:05] 10ops-eqiad, 06DC-Ops: Relable server WMF5520 - https://phabricator.wikimedia.org/T429388 (10Jgreen) 03NEW [18:02:07] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [18:03:27] (03PS1) 10Kamila Součková: kubernetes: switch mw-{debug,experimental} to bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1302929 (https://phabricator.wikimedia.org/T429030) [18:03:32] (03PS1) 10Kamila Součková: kubernetes: switch MW canaries to bookworm Change MW image flavour for canary releases and misc to bookworm. [puppet] - 10https://gerrit.wikimedia.org/r/1302930 (https://phabricator.wikimedia.org/T429030) [18:03:38] (03PS1) 10Kamila Součková: kubernetes: switch all of MW to bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1302931 (https://phabricator.wikimedia.org/T429030) [18:04:30] (03CR) 10CI reject: [V:04-1] kubernetes: switch MW canaries to bookworm Change MW image flavour for canary releases and misc to bookworm. [puppet] - 10https://gerrit.wikimedia.org/r/1302930 (https://phabricator.wikimedia.org/T429030) (owner: 10Kamila Součková) [18:05:20] (03CR) 10Kamila Součková: [C:04-1] "-1 until roll out day" [puppet] - 10https://gerrit.wikimedia.org/r/1302929 (https://phabricator.wikimedia.org/T429030) (owner: 10Kamila Součková) [18:05:38] (03CR) 10Kamila Součková: [C:04-1] "-1 until roll out day" [puppet] - 10https://gerrit.wikimedia.org/r/1302930 (https://phabricator.wikimedia.org/T429030) (owner: 10Kamila Součková) [18:06:16] (03CR) 10Kamila Součková: [C:04-1] "DNM: -1 until roll out day" [puppet] - 10https://gerrit.wikimedia.org/r/1302931 (https://phabricator.wikimedia.org/T429030) (owner: 10Kamila Součková) [18:06:50] brett@cumin2002 reimage (PID 3658819) is awaiting input [18:06:58] (03PS2) 10Kamila Součková: kubernetes: switch MW canaries to bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1302930 (https://phabricator.wikimedia.org/T429030) [18:07:00] (03PS1) 10TrainBranchBot: group0 to 1.47.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302932 (https://phabricator.wikimedia.org/T423916) [18:07:03] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by jhuneidi@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302932 (https://phabricator.wikimedia.org/T423916) (owner: 10TrainBranchBot) [18:08:08] !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host dns5004 - brett@cumin2002" [18:08:14] !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host dns5004 - brett@cumin2002" [18:08:14] !log brett@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:08:14] !log brett@cumin2002 START - Cookbook sre.dns.wipe-cache dns5004.wikimedia.org 8.166.102.103.in-addr.arpa 8.0.0.0.6.6.1.0.2.0.1.0.3.0.1.0.1.0.0.0.0.0.5.e.2.f.d.0.1.0.0.2.ip6.arpa on all recursors [18:08:18] !log brett@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dns5004.wikimedia.org 8.166.102.103.in-addr.arpa 8.0.0.0.6.6.1.0.2.0.1.0.3.0.1.0.1.0.0.0.0.0.5.e.2.f.d.0.1.0.0.2.ip6.arpa on all recursors [18:08:19] !log brett@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host dns5004 [18:11:01] (03Merged) 10jenkins-bot: group0 to 1.47.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302932 (https://phabricator.wikimedia.org/T423916) (owner: 10TrainBranchBot) [18:11:30] brett@cumin2002 reimage (PID 3658819) is awaiting input [18:12:08] !log brett@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dns5004 [18:12:08] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host dns5004 [18:12:41] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [18:12:50] (03PS1) 10Arlolra: Update definition of html heading to match Parsoid/core [extensions/DiscussionTools] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302934 (https://phabricator.wikimedia.org/T417530) [18:14:40] FIRING: JobUnavailable: Reduced availability for job pdnsrec in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:15:10] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2247 - https://phabricator.wikimedia.org/T429348#12025909 (10Marostegui) Thank you so much! [18:16:10] FIRING: [4x] BFDdown: BFD session down between cr2-eqsin and 103.102.166.8 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:16:23] FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [18:16:38] (03CR) 10C. Scott Ananian: [C:03+1] Update definition of html heading to match Parsoid/core [extensions/DiscussionTools] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302934 (https://phabricator.wikimedia.org/T417530) (owner: 10Arlolra) [18:17:48] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/DiscussionTools] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302934 (https://phabricator.wikimedia.org/T417530) (owner: 10Arlolra) [18:19:40] FIRING: [2x] JobUnavailable: Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:20:28] (03PS1) 10Jdlrobson: Add wprov parameter to home link [extensions/MobileFrontend] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302935 (https://phabricator.wikimedia.org/T429268) [18:23:38] !log jhuneidi@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.47.0-wmf.7 refs T423916 [18:23:44] T423916: 1.47.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T423916 [18:30:00] !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl2006.codfw.wmnet with OS trixie [18:30:15] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware, 07Kubernetes, 13Patch-For-Review: wikikube-ctrl2006 implementation tracking - https://phabricator.wikimedia.org/T406596#12025947 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host wikikube-ctrl2006.co... [18:33:21] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-ctrl2006.codfw.wmnet with OS trixie [18:33:36] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware, 07Kubernetes, 13Patch-For-Review: wikikube-ctrl2006 implementation tracking - https://phabricator.wikimedia.org/T406596#12025968 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host wikikube-ctrl2006.codfw.... [18:34:45] !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl2006.codfw.wmnet with OS trixie [18:34:58] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware, 07Kubernetes, 13Patch-For-Review: wikikube-ctrl2006 implementation tracking - https://phabricator.wikimedia.org/T406596#12025981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host wikikube-ctrl2006.co... [18:35:56] !log trueg@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs: apply [18:37:09] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware, 07Kubernetes, 13Patch-For-Review: wikikube-ctrl2006 implementation tracking - https://phabricator.wikimedia.org/T406596#12026000 (10RobH) When I try to run the reimage, it blows past PXE on the host without trying network boot and moves onto the di... [18:37:35] RESOLVED: [2x] JobUnavailable: Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:38:49] RESOLVED: HelmReleaseBadStatus: Helm release wdqs/scholarly-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [18:39:47] !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/data-gateway: apply [18:39:53] !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/data-gateway: apply [18:39:57] !log eevans@deploy1003 helmfile [eqiad] START helmfile.d/services/data-gateway: apply [18:40:06] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware, 07Kubernetes, 13Patch-For-Review: wikikube-ctrl2006 implementation tracking - https://phabricator.wikimedia.org/T406596#12026007 (10RobH) @Jhancock.wm: >>! In T400661#11226602, @Jhancock.wm wrote: > got the pxe issue fixed. but found a new one. @C... [18:40:19] robh@cumin2002 reimage (PID 3669932) is awaiting input [18:40:22] !log eevans@deploy1003 helmfile [eqiad] DONE helmfile.d/services/data-gateway: apply [18:41:08] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns5004.wikimedia.org with reason: host reimage [18:41:22] !log eevans@deploy1003 helmfile [codfw] START helmfile.d/services/data-gateway: apply [18:41:41] !log eevans@deploy1003 helmfile [codfw] DONE helmfile.d/services/data-gateway: apply [18:42:48] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#12026012 (10Jclark-ctr) @elukey sorry about that it was plugged into wrong port on nokia switch corrected it has link now [18:44:25] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns5004.wikimedia.org with reason: host reimage [18:44:36] (03CR) 10Scott French: [C:03+1] kubernetes: switch mw-{debug,experimental} to bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1302929 (https://phabricator.wikimedia.org/T429030) (owner: 10Kamila Součková) [18:44:41] (03CR) 10Scott French: [C:03+1] kubernetes: switch MW canaries to bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1302930 (https://phabricator.wikimedia.org/T429030) (owner: 10Kamila Součková) [18:44:45] (03CR) 10Scott French: [C:03+1] kubernetes: switch all of MW to bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1302931 (https://phabricator.wikimedia.org/T429030) (owner: 10Kamila Součková) [18:45:10] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Repurpose ganeti102[3456] for Zuul migration - https://phabricator.wikimedia.org/T427353#12026015 (10Dzahn) The machines are already in puppet site.pp and partman and can be installed with an OS any time. [18:45:44] !log jasmine@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:46:22] jeena: done with train? [18:46:45] yes [18:46:53] * Krinkle looking to squeeze in a config change [18:47:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302274 (https://phabricator.wikimedia.org/T107188) (owner: 10Krinkle) [18:48:16] (03Merged) 10jenkins-bot: Disable ShortUrl on hiwiki, hiwikiversity, maiwiki, knwiki, knwikisource, tcywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302274 (https://phabricator.wikimedia.org/T107188) (owner: 10Krinkle) [18:48:28] (03CR) 10JHathaway: sre.hosts.provision: introduce the wmfroot user (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1291994 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [18:48:46] !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1302274|Disable ShortUrl on hiwiki, hiwikiversity, maiwiki, knwiki, knwikisource, tcywiki (T107188)]] [18:48:50] T107188: Sunset ShortUrl extension in favour of UrlShortener extension - https://phabricator.wikimedia.org/T107188 [18:51:00] PROBLEM - Recursive DNS on 103.102.166.36 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [18:51:07] !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1302274|Disable ShortUrl on hiwiki, hiwikiversity, maiwiki, knwiki, knwikisource, tcywiki (T107188)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:51:11] 06SRE, 06Infrastructure-Foundations, 10netops: Network device tls certs: alerting niggles - https://phabricator.wikimedia.org/T429242#12026032 (10cmooney) So something odd is going on here. If I query thanos only two Nokia devices in eqiad are currently showing a //probe_ssl_earliest_cert_expiry// value: `... [18:51:42] 06SRE, 06Infrastructure-Foundations, 10netops: Blackbox probe for TLS cert expriy failing on multiple eqiad SR-Linux nodes - https://phabricator.wikimedia.org/T429242#12026037 (10cmooney) [18:52:51] 10ops-eqiad, 06SRE, 06DC-Ops: Relable server WMF5520 - https://phabricator.wikimedia.org/T429388#12026042 (10Jclark-ctr) 05Open→03Resolved a:05Jgreen→03Jclark-ctr replaced physical label on server [18:52:55] !log jasmine@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:53:54] 10ops-eqiad, 06SRE, 06DC-Ops: Q3 :rack/setup/install cloudvirt refresh - https://phabricator.wikimedia.org/T425088#12026046 (10Jclark-ctr) [18:54:34] RECOVERY - Host dse-k8s-worker1009 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [18:55:39] !log krinkle@deploy1003 krinkle: Continuing with deployment [18:56:16] !log jasmine@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:56:41] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [18:58:23] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [18:59:33] RESOLVED: KubernetesCalicoDown: dse-k8s-worker1009.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1009.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:00:01] RESOLVED: [4x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1009:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [19:00:04] !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1302274|Disable ShortUrl on hiwiki, hiwikiversity, maiwiki, knwiki, knwikisource, tcywiki (T107188)]] (duration: 11m 18s) [19:00:08] T107188: Sunset ShortUrl extension in favour of UrlShortener extension - https://phabricator.wikimedia.org/T107188 [19:01:00] PROBLEM - Recursive DNS on 2001:df2:e500:2:103:102:166:36 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [19:02:51] !log jasmine@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:08:16] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-ctrl2006.codfw.wmnet with OS trixie [19:08:16] !log jasmine@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl2006.codfw.wmnet with OS trixie [19:08:27] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware, 07Kubernetes, 13Patch-For-Review: wikikube-ctrl2006 implementation tracking - https://phabricator.wikimedia.org/T406596#12026123 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host wikikube-ctrl2006.codfw.... [19:08:30] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware, 07Kubernetes, 13Patch-For-Review: wikikube-ctrl2006 implementation tracking - https://phabricator.wikimedia.org/T406596#12026124 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jasmine@cumin2002 for host wikikube-ctrl2006... [19:12:35] FIRING: JobUnavailable: Reduced availability for job pdnsrec in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:13:50] jasmine@cumin2002 reimage (PID 3675922) is awaiting input [19:18:15] !log restarting grpc server on eqiad SR-Linux switches to recover from problem of no free threads T429242 [19:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:19] T429242: Blackbox probe for TLS cert expriy failing on multiple eqiad SR-Linux nodes - https://phabricator.wikimedia.org/T429242 [19:21:23] RESOLVED: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [19:24:13] (03PS1) 10Btullis: Update the storage configuration for the new dse-k8s-wdqs nodes [puppet] - 10https://gerrit.wikimedia.org/r/1302945 (https://phabricator.wikimedia.org/T423314) [19:27:11] (03CR) 10RLazarus: [C:03+2] "Thanks!" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1300941 (owner: 10RLazarus) [19:27:22] !log jasmine@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-ctrl2006.codfw.wmnet with OS trixie [19:27:33] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware, 07Kubernetes, 13Patch-For-Review: wikikube-ctrl2006 implementation tracking - https://phabricator.wikimedia.org/T406596#12026264 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jasmine@cumin2002 for host wikikube-ctrl2006.cod... [19:29:41] (03PS2) 10Btullis: Update the storage configuration for the new dse-k8s-wdqs nodes [puppet] - 10https://gerrit.wikimedia.org/r/1302945 (https://phabricator.wikimedia.org/T423314) [19:30:05] !log brett@cumin2002 START - Cookbook sre.cdn.roll-restart-haproxy rolling restart of HAProxy on A:cp - OpenSSL update () [19:30:41] (03Merged) 10jenkins-bot: cli: argparse fix for Python 3.14 compatibility [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1300941 (owner: 10RLazarus) [19:31:00] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [19:34:03] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:34:20] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware, 07Kubernetes, 13Patch-For-Review: wikikube-ctrl2006 implementation tracking - https://phabricator.wikimedia.org/T406596#12026308 (10jasmine_) >>! In T406596#12026005, @RobH wrote: > @Jhancock.wm: > >>>! In T400661#11226602, @Jhancock.wm wrote: >>... [19:35:00] !log brett@cumin2002 START - Cookbook sre.cdn.roll-restart-haproxy rolling restart of HAProxy on A:cp - OpenSSL update () [19:35:06] 06SRE, 06Infrastructure-Foundations, 10netops: Blackbox probe for TLS cert expriy failing on multiple eqiad SR-Linux nodes - https://phabricator.wikimedia.org/T429242#12026310 (10cmooney) 05Open→03Resolved Alight small gap when gnmic had to reconnect but other than that lsw1-c4-eqiad is back working... [19:35:17] (03PS3) 10Btullis: Update the storage configuration for the new dse-k8s-wdqs nodes [puppet] - 10https://gerrit.wikimedia.org/r/1302945 (https://phabricator.wikimedia.org/T423314) [19:38:33] (03CR) 10Btullis: [C:03+2] Update the storage configuration for the new dse-k8s-wdqs nodes [puppet] - 10https://gerrit.wikimedia.org/r/1302945 (https://phabricator.wikimedia.org/T423314) (owner: 10Btullis) [19:39:12] !log jasmine@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:45:06] (03PS8) 10BCornwall: varnish: Remove reload_vcl_opts function [puppet] - 10https://gerrit.wikimedia.org/r/1298885 [19:45:16] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-wdqs1001.eqiad.wmnet with OS bookworm [19:45:19] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-wdqs2001.codfw.wmnet with OS bookworm [19:46:24] !log jasmine@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:47:27] (03CR) 10CI reject: [V:04-1] varnish: Remove reload_vcl_opts function [puppet] - 10https://gerrit.wikimedia.org/r/1298885 (owner: 10BCornwall) [19:47:49] !log jasmine@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:48:47] (03PS9) 10BCornwall: varnish: Remove reload_vcl_opts function [puppet] - 10https://gerrit.wikimedia.org/r/1298885 [19:54:33] !log jasmine@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:55:07] !log jasmine@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl2006.codfw.wmnet with OS trixie [19:55:21] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware, 07Kubernetes, 13Patch-For-Review: wikikube-ctrl2006 implementation tracking - https://phabricator.wikimedia.org/T406596#12026414 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jasmine@cumin2002 for host wikikube-ctrl2006... [19:55:38] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-wdqs2001.codfw.wmnet with OS bookworm [19:56:29] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-wdqs1001.eqiad.wmnet with reason: host reimage [19:58:21] (03PS1) 10BCornwall: dns: Increase netbox_dns_snippets clone timeout [puppet] - 10https://gerrit.wikimedia.org/r/1302949 [20:00:05] RoanKattouw, urbanecm, TheresNoTime, kindrobot, and cjming: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260616T2000). [20:00:05] jdlrobson, kemayo, and cscott: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:11] o/ [20:00:14] o/ [20:00:27] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-wdqs1001.eqiad.wmnet with reason: host reimage [20:00:41] jasmine@cumin2002 reimage (PID 3686909) is awaiting input [20:00:46] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8746/co" [puppet] - 10https://gerrit.wikimedia.org/r/1302949 (owner: 10BCornwall) [20:01:50] (do folks need a deployer or can y'all self-deploy?) [20:01:54] I have one fairly boring patch, and don't mind getting it myself. [20:02:09] go for it :) [20:02:21] can i start [20:02:29] I can do my own [20:02:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/VisualEditor] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302320 (owner: 10DLynch) [20:02:40] oh Kemayo you started already? [20:02:58] Oh, sorry, I took TheresNoTime's "go for it" as very me-directed. 😅 [20:03:05] ok np. Ping me when you are done [20:03:17] Will do! [20:09:38] !log jasmine@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-ctrl2006.codfw.wmnet with OS trixie [20:09:51] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware, 07Kubernetes, 13Patch-For-Review: wikikube-ctrl2006 implementation tracking - https://phabricator.wikimedia.org/T406596#12026535 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jasmine@cumin2002 for host wikikube-ctrl2006.cod... [20:11:12] (03CR) 10Ssingh: [C:03+1] dns: Increase netbox_dns_snippets clone timeout (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1302949 (owner: 10BCornwall) [20:11:26] (03PS2) 10BCornwall: dns: Increase netbox_dns_snippets clone timeout [puppet] - 10https://gerrit.wikimedia.org/r/1302949 [20:12:56] (03PS1) 10DLynch: Update VE core submodule to master (0930c3a9e) [extensions/VisualEditor] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302952 (https://phabricator.wikimedia.org/T397501) [20:13:40] (03CR) 10BCornwall: [V:03+2 C:03+2] dns: Increase netbox_dns_snippets clone timeout (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1302949 (owner: 10BCornwall) [20:17:14] (03Merged) 10jenkins-bot: EditChecks: Namespace tracking object for seen/shown/used checks [extensions/VisualEditor] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302320 (owner: 10DLynch) [20:17:46] !log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1302320|EditChecks: Namespace tracking object for seen/shown/used checks]] [20:18:12] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [20:19:14] (03PS4) 10Bking: cirrussearch: Add minimal opensearch config for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/1302280 (https://phabricator.wikimedia.org/T425585) [20:19:41] !log kemayo@deploy1003 kemayo: Backport for [[gerrit:1302320|EditChecks: Namespace tracking object for seen/shown/used checks]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:20:01] my patch is pretty boring, although it touches Kemayo's code :) [20:20:19] (03CR) 10CI reject: [V:04-1] cirrussearch: Add minimal opensearch config for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/1302280 (https://phabricator.wikimedia.org/T425585) (owner: 10Bking) [20:20:57] RECOVERY - Recursive DNS on 103.102.166.36 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [20:20:57] RECOVERY - Recursive DNS on 2001:df2:e500:2:103:102:166:36 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [20:21:16] btullis@cumin1003 reimage (PID 248845) is awaiting input [20:22:35] RESOLVED: JobUnavailable: Reduced availability for job pdnsrec in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:23:15] !log kemayo@deploy1003 kemayo: Continuing with deployment [20:25:55] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=dns5004.*,service=authdns-update [20:26:01] !log brett@dns5004 START - running authdns-update [20:26:22] (03PS2) 10DLynch: Update VE core submodule to master (0930c3a9e) [extensions/VisualEditor] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302952 (https://phabricator.wikimedia.org/T406841) [20:26:23] !log brett@dns5004 START - running authdns-update [20:26:33] (03PS5) 10Bking: cirrussearch: Add minimal opensearch config for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/1302280 (https://phabricator.wikimedia.org/T425585) [20:27:37] !log kemayo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1302320|EditChecks: Namespace tracking object for seen/shown/used checks]] (duration: 09m 50s) [20:28:02] Kemayo: can i go now? [20:28:05] I have a hard stop in 30m [20:28:07] !log brett@dns5004 FAIL - running authdns-update [20:29:02] !log brett@dns5004 START - running authdns-update [20:29:03] Jdlrobson: Okay, mine's done. [20:29:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302890 (https://phabricator.wikimedia.org/T424596) (owner: 10Jdlrobson) [20:29:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [extensions/MobileFrontend] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302935 (https://phabricator.wikimedia.org/T429268) (owner: 10Jdlrobson) [20:30:44] !log brett@dns5004 FAIL - running authdns-update [20:30:54] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns5004.wikimedia.org with OS bookworm [20:31:08] (03PS3) 10DLynch: Update VE core submodule to master (0930c3a9e) [extensions/VisualEditor] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302952 (https://phabricator.wikimedia.org/T406841) [20:31:21] (03PS4) 10DLynch: Update VE core submodule to master (0930c3a9e) [extensions/VisualEditor] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302952 (https://phabricator.wikimedia.org/T397501) [20:31:30] !log brett@dns1004 START - running authdns-update [20:33:08] !log brett@dns1004 END - running authdns-update [20:33:48] (03PS3) 10DLynch: Update VE core submodule to master (0930c3a9e) [extensions/VisualEditor] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302953 (https://phabricator.wikimedia.org/T406841) [20:37:08] (03CR) 10DLynch: [C:04-2] "I'm going to be backporting I134c6e3251c163f19c2aee9c4d9adb2ea63896ff so this will no longer be needed, as it should drag the submodule al" [extensions/VisualEditor] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302872 (https://phabricator.wikimedia.org/T428764) (owner: 10WMDE-Fisch) [20:38:11] (As you might guess from all these wikibugs messages, I haver developed some more patches I'll want to squash into the end of the window.) [20:38:13] (03PS1) 10Bking: deployment-prep: Update cirrussearch (OpenSearch) config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302956 (https://phabricator.wikimedia.org/T425585) [20:39:57] (03CR) 10Bking: [C:03+2] cirrussearch: Add minimal opensearch config for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/1302280 (https://phabricator.wikimedia.org/T425585) (owner: 10Bking) [20:40:13] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=dns5004.* [20:41:15] (03Merged) 10jenkins-bot: Guard round function with a supports query [core] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302890 (https://phabricator.wikimedia.org/T424596) (owner: 10Jdlrobson) [20:41:20] (03Merged) 10jenkins-bot: Add wprov parameter to home link [extensions/MobileFrontend] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302935 (https://phabricator.wikimedia.org/T429268) (owner: 10Jdlrobson) [20:41:48] !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1302890|Guard round function with a supports query (T424596)]], [[gerrit:1302935|Add wprov parameter to home link (T429268)]] [20:41:54] T424596: Firefox 115esr doesn't support thumbnail sizes or upright parameter - https://phabricator.wikimedia.org/T424596 [20:41:54] T429268: Prepare web manifest in mobile web for reading list experiments - https://phabricator.wikimedia.org/T429268 [20:43:43] !log jdlrobson@deploy1003 jdlrobson: Backport for [[gerrit:1302890|Guard round function with a supports query (T424596)]], [[gerrit:1302935|Add wprov parameter to home link (T429268)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:47:00] !log jdlrobson@deploy1003 jdlrobson: Continuing with deployment [20:51:16] !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1302890|Guard round function with a supports query (T424596)]], [[gerrit:1302935|Add wprov parameter to home link (T429268)]] (duration: 09m 28s) [20:51:22] T424596: Firefox 115esr doesn't support thumbnail sizes or upright parameter - https://phabricator.wikimedia.org/T424596 [20:51:22] T429268: Prepare web manifest in mobile web for reading list experiments - https://phabricator.wikimedia.org/T429268 [20:52:19] cscott: Looks like Jon's one has finished. [20:52:27] yep all done thanks [20:54:05] ok, i'll jump in then. thanks! [20:54:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [extensions/DiscussionTools] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302934 (https://phabricator.wikimedia.org/T417530) (owner: 10Arlolra) [20:54:41] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp2043.* [21:00:04] Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260616T2100) [21:00:07] (03CR) 10Santiago Faci: [C:03+2] Update GrowthBook api key for Test Kitchen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302906 (https://phabricator.wikimedia.org/T428985) (owner: 10Clare Ming) [21:07:59] i should have kicked off the merge speculative earlier, jenkins is hosed [21:08:08] !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl2006.codfw.wmnet with OS trixie [21:08:18] (03PS1) 10Urbanecm: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302968 (https://phabricator.wikimedia.org/T416877) [21:08:25] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware, 07Kubernetes, 13Patch-For-Review: wikikube-ctrl2006 implementation tracking - https://phabricator.wikimedia.org/T406596#12026738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host wikikube-ctrl2006.co... [21:08:52] (03CR) 10Urbanecm: [C:03+2] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302968 (https://phabricator.wikimedia.org/T416877) (owner: 10Urbanecm) [21:10:48] !log robh@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-ctrl2006.codfw.wmnet with OS trixie [21:11:01] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware, 07Kubernetes, 13Patch-For-Review: wikikube-ctrl2006 implementation tracking - https://phabricator.wikimedia.org/T406596#12026754 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host wikikube-ctrl2006.codfw.... [21:14:55] (03Merged) 10jenkins-bot: Update definition of html heading to match Parsoid/core [extensions/DiscussionTools] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302934 (https://phabricator.wikimedia.org/T417530) (owner: 10Arlolra) [21:15:28] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1302934|Update definition of html heading to match Parsoid/core (T417530 T417531 T428677)]] [21:15:37] T417530: Parsoid shouldn't wrap wikitext html-ish `` tags in
wrappers - https://phabricator.wikimedia.org/T417530 [21:15:37] T417531: Section wrapping should use precise information about HTML-syntax headings - https://phabricator.wikimedia.org/T417531 [21:15:37] T428677: Something is wrong with the rendering of headings on this page (due to new postprocessing of old parser cache entries) - https://phabricator.wikimedia.org/T428677 [21:17:05] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware, 07Kubernetes, 13Patch-For-Review: wikikube-ctrl2006 implementation tracking - https://phabricator.wikimedia.org/T406596#12026790 (10RobH) Attempting to reimage and watching the serial console output, it showed no media present for PXE boot check be... [21:17:23] !log cscott@deploy1003 arlolra, cscott: Backport for [[gerrit:1302934|Update definition of html heading to match Parsoid/core (T417530 T417531 T428677)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:17:49] (03Merged) 10jenkins-bot: Update GrowthBook api key for Test Kitchen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302906 (https://phabricator.wikimedia.org/T428985) (owner: 10Clare Ming) [21:18:13] (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302968 (https://phabricator.wikimedia.org/T416877) (owner: 10Urbanecm) [21:20:04] !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl2006.codfw.wmnet with OS bookworm [21:20:21] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware, 07Kubernetes, 13Patch-For-Review: wikikube-ctrl2006 implementation tracking - https://phabricator.wikimedia.org/T406596#12026801 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host wikikube-ctrl2006.co... [21:20:25] (03PS2) 10Bking: deployment-prep: Update cirrussearch (OpenSearch) config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302956 (https://phabricator.wikimedia.org/T425585) [21:20:41] !log urbanecm@deploy1003 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [21:21:12] (03PS3) 10Bking: deployment-prep: Update cirrussearch (OpenSearch) config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302956 (https://phabricator.wikimedia.org/T425585) [21:21:42] !log urbanecm@deploy1003 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [21:23:10] !log urbanecm@deploy1003 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply [21:23:30] FIRING: LibericaDiffFPCheck: Liberica instance lvs5006:9100 control plane status doesn't match with forwarding plane status - https://wikitech.wikimedia.org/wiki/Liberica#LibericaDiffFPCheck - https://grafana.wikimedia.org/d/fa4de97a-7114-48c7-a91a-f56089ef554f/liberica?var-site=eqsin&var-instance=lvs5006 - https://alerts.wikimedia.org/?q=alertname%3DLibericaDiffFPCheck [21:23:38] (03CR) 10Ryan Kemper: [C:03+2] dse-k8s: bump opensearch-semantic-search mem quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300945 (https://phabricator.wikimedia.org/T426589) (owner: 10Ryan Kemper) [21:23:47] cscott: testing on the testserver? [21:23:58] yup, working on it [21:24:23] https://phabricator.wikimedia.org/T68637#12015627 didn't come with a prebuilt test case, as i realize belatedly [21:24:41] !log robh@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-ctrl2006.codfw.wmnet with OS bookworm [21:24:42] (03CR) 10Dzahn: "I tend to be on the side of "keep it simple" here and would say it's fine to keep doing it directly in the admin UI." [puppet] - 10https://gerrit.wikimedia.org/r/1302733 (https://phabricator.wikimedia.org/T425441) (owner: 10Arnaudb) [21:24:50] !log urbanecm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply [21:24:53] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware, 07Kubernetes, 13Patch-For-Review: wikikube-ctrl2006 implementation tracking - https://phabricator.wikimedia.org/T406596#12026818 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host wikikube-ctrl2006.codfw.... [21:25:03] !log urbanecm@deploy1003 helmfile [codfw] START helmfile.d/services/linkrecommendation: apply [21:26:50] !log urbanecm@deploy1003 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: apply [21:28:30] RESOLVED: LibericaDiffFPCheck: Liberica instance lvs5006:9100 control plane status doesn't match with forwarding plane status - https://wikitech.wikimedia.org/wiki/Liberica#LibericaDiffFPCheck - https://grafana.wikimedia.org/d/fa4de97a-7114-48c7-a91a-f56089ef554f/liberica?var-site=eqsin&var-instance=lvs5006 - https://alerts.wikimedia.org/?q=alertname%3DLibericaDiffFPCheck [21:29:53] !log cscott@deploy1003 arlolra, cscott: Continuing with deployment [21:30:07] (03PS2) 10Clare Ming: Test Kitchen UI: Deploy v1.4.3 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302881 (https://phabricator.wikimedia.org/T428985) (owner: 10Santiago Faci) [21:30:15] (03CR) 10Dzahn: "I was going to upload a change to do something similar in a different way but before I do that let me ask what the goal is first: Is the " [puppet] - 10https://gerrit.wikimedia.org/r/1178874 (https://phabricator.wikimedia.org/T378028) (owner: 10AOkoth) [21:30:42] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:31:20] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:32:25] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:34:10] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1302934|Update definition of html heading to match Parsoid/core (T417530 T417531 T428677)]] (duration: 18m 41s) [21:34:17] T417530: Parsoid shouldn't wrap wikitext html-ish `` tags in
wrappers - https://phabricator.wikimedia.org/T417530 [21:34:17] T417531: Section wrapping should use precise information about HTML-syntax headings - https://phabricator.wikimedia.org/T417531 [21:34:18] T428677: Something is wrong with the rendering of headings on this page (due to new postprocessing of old parser cache entries) - https://phabricator.wikimedia.org/T428677 [21:34:27] (03Merged) 10jenkins-bot: dse-k8s: bump opensearch-semantic-search mem quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/1300945 (https://phabricator.wikimedia.org/T426589) (owner: 10Ryan Kemper) [21:36:42] huh, its now back to oem and memory init... [21:36:46] ok... i guess its not done. [21:37:38] (03CR) 10Santiago Faci: [C:03+2] Test Kitchen UI: Deploy v1.4.3 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302881 (https://phabricator.wikimedia.org/T428985) (owner: 10Santiago Faci) [21:38:59] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:39:45] robh: yeah it did a few iterations of that when I ran it if I'm not mistaken) [21:39:55] (03Merged) 10jenkins-bot: Test Kitchen UI: Deploy v1.4.3 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302881 (https://phabricator.wikimedia.org/T428985) (owner: 10Santiago Faci) [21:41:48] Anyone going to use the remainder of the Readers window, or can I get my leftover patches in? [21:44:34] i'm done. i think Jdlrobson is done as well. [21:45:00] !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl2006.codfw.wmnet with OS trixie [21:45:15] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware, 07Kubernetes, 13Patch-For-Review: wikikube-ctrl2006 implementation tracking - https://phabricator.wikimedia.org/T406596#12026916 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host wikikube-ctrl2006.co... [21:45:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/VisualEditor] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302953 (https://phabricator.wikimedia.org/T406841) (owner: 10DLynch) [21:45:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/VisualEditor] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302952 (https://phabricator.wikimedia.org/T397501) (owner: 10DLynch) [21:46:08] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen-next: apply [21:46:20] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen-next: apply [21:46:24] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen-next: apply [21:46:43] !log ryankemper@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [21:46:47] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen-next: apply [21:47:35] FIRING: PuppetFailure: Puppet has failed on cumin2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [21:48:04] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware, 07Kubernetes, 13Patch-For-Review: wikikube-ctrl2006 implementation tracking - https://phabricator.wikimedia.org/T406596#12026933 (10RobH) Steps taken: * Confirmed MAC of port 1 (of 2) is indeed connected to switch * successfully ran sudo cookbook s... [21:48:26] !log robh@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-ctrl2006.codfw.wmnet with OS trixie [21:48:38] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware, 07Kubernetes, 13Patch-For-Review: wikikube-ctrl2006 implementation tracking - https://phabricator.wikimedia.org/T406596#12026936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host wikikube-ctrl2006.codfw.... [21:48:40] !log ryankemper@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [21:49:04] (03CR) 10Cathal Mooney: [C:03+1] Add interface irb.900 to security zone mgmt [homer/public] - 10https://gerrit.wikimedia.org/r/1302337 (https://phabricator.wikimedia.org/T421674) (owner: 10Papaul) [21:49:16] !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl2006.codfw.wmnet with OS bookworm [21:49:33] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware, 07Kubernetes, 13Patch-For-Review: wikikube-ctrl2006 implementation tracking - https://phabricator.wikimedia.org/T406596#12026937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host wikikube-ctrl2006.co... [21:50:29] !log ryankemper@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [21:52:24] !log ryankemper@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [21:52:35] FIRING: [2x] SystemdUnitFailed: database-backups-snapshots.service on cumin2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:53:26] (03Merged) 10jenkins-bot: Update VE core submodule to master (0930c3a9e) [extensions/VisualEditor] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302953 (https://phabricator.wikimedia.org/T406841) (owner: 10DLynch) [21:53:26] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware, 07Kubernetes, 13Patch-For-Review: wikikube-ctrl2006 implementation tracking - https://phabricator.wikimedia.org/T406596#12026941 (10RobH) bookworm loader comes up fine, so this is a speicfic issue with trixie failing to pxe tftp load. [21:57:16] robh@cumin2002 reimage (PID 3712108) is awaiting input [21:58:28] (03Merged) 10jenkins-bot: Update VE core submodule to master (0930c3a9e) [extensions/VisualEditor] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1302952 (https://phabricator.wikimedia.org/T397501) (owner: 10DLynch) [21:58:59] !log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1302953|Update VE core submodule to master (0930c3a9e) (T406841 T429174 T397501 T424632 T429355)]], [[gerrit:1302952|Update VE core submodule to master (0930c3a9e) (T397501 T424632 T429355)]] [21:59:17] T406841: Detect large IME insertions as pastes - https://phabricator.wikimedia.org/T406841 [21:59:17] T429174: Data model corruption when editing a specific page - https://phabricator.wikimedia.org/T429174 [21:59:18] T397501: [Epic] Fix issues with {{reflist}} or missing references lists and main+details sub-references - https://phabricator.wikimedia.org/T397501 [21:59:18] T424632: Remove unused code from Cite and VE after refactoring - https://phabricator.wikimedia.org/T424632 [21:59:19] T429355: TypeError: can't access property "length", data is null when starting ve on specific enwiki page - https://phabricator.wikimedia.org/T429355 [22:00:03] 07sre-alert-triage, 06Data-Platform-SRE (2026-06-05 - 2026-06-26), 13Patch-For-Review: Alert in need of triage: ResourceQuotaMemoryLimitsWarning - https://phabricator.wikimedia.org/T426589#12026990 (10RKemper) 05Open→03Resolved Deployed the quota change; alert cleared. [22:00:39] (03PS1) 10Santiago Faci: test-kitchen/test-kitchen-next: Fixing GrowthBook flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302973 [22:00:58] (03PS2) 10Santiago Faci: test-kitchen/test-kitchen-next: Fixed typo in the GrowthBook flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302973 [22:01:11] !log kemayo@deploy1003 kemayo: Backport for [[gerrit:1302953|Update VE core submodule to master (0930c3a9e) (T406841 T429174 T397501 T424632 T429355)]], [[gerrit:1302952|Update VE core submodule to master (0930c3a9e) (T397501 T424632 T429355)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:01:38] (03CR) 10Clare Ming: [C:03+2] test-kitchen/test-kitchen-next: Fixed typo in the GrowthBook flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302973 (owner: 10Santiago Faci) [22:02:50] !log kemayo@deploy1003 kemayo: Continuing with deployment [22:04:34] (03CR) 10Ryan Kemper: [C:03+1] deployment-prep: Update cirrussearch (OpenSearch) config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1302956 (https://phabricator.wikimedia.org/T425585) (owner: 10Bking) [22:04:58] (03Merged) 10jenkins-bot: test-kitchen/test-kitchen-next: Fixed typo in the GrowthBook flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302973 (owner: 10Santiago Faci) [22:07:10] !log kemayo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1302953|Update VE core submodule to master (0930c3a9e) (T406841 T429174 T397501 T424632 T429355)]], [[gerrit:1302952|Update VE core submodule to master (0930c3a9e) (T397501 T424632 T429355)]] (duration: 08m 11s) [22:07:20] Okay, I'm all done with that. [22:07:21] T406841: Detect large IME insertions as pastes - https://phabricator.wikimedia.org/T406841 [22:07:21] T429174: Data model corruption when editing a specific page - https://phabricator.wikimedia.org/T429174 [22:07:22] T397501: [Epic] Fix issues with {{reflist}} or missing references lists and main+details sub-references - https://phabricator.wikimedia.org/T397501 [22:07:22] T424632: Remove unused code from Cite and VE after refactoring - https://phabricator.wikimedia.org/T424632 [22:07:23] T429355: TypeError: can't access property "length", data is null when starting ve on specific enwiki page - https://phabricator.wikimedia.org/T429355 [22:07:55] (03CR) 10DLynch: [C:04-2] "Backport: done." [extensions/VisualEditor] (wmf/1.47.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1302872 (https://phabricator.wikimedia.org/T428764) (owner: 10WMDE-Fisch) [22:08:48] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen-next: apply [22:09:00] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen-next: apply [22:13:37] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:16:11] FIRING: [4x] BFDdown: BFD session down between cr2-eqsin and 103.102.166.8 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:20:23] FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [22:22:33] (03PS1) 10Ahmon Dancy: modules/profile/files/puppet/bin: cleanup puppet SSL on CA server mismatch [puppet] - 10https://gerrit.wikimedia.org/r/1302978 (https://phabricator.wikimedia.org/T429413) [22:24:31] (03PS2) 10Ahmon Dancy: modules/profile/files/puppet/bin: cleanup puppet SSL on CA server mismatch [puppet] - 10https://gerrit.wikimedia.org/r/1302978 (https://phabricator.wikimedia.org/T429413) [22:25:23] RESOLVED: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [22:27:10] (03CR) 10Dzahn: [C:03+2] "Ok. And it did not mean you have to then ping traffic team members?" [dns] - 10https://gerrit.wikimedia.org/r/1302196 (https://phabricator.wikimedia.org/T429189) (owner: 10Dzahn) [22:30:05] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-ctrl2006.codfw.wmnet with OS bookworm [22:30:16] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware, 07Kubernetes, 13Patch-For-Review: wikikube-ctrl2006 implementation tracking - https://phabricator.wikimedia.org/T406596#12027086 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host wikikube-ctrl2006.codfw.... [22:32:54] (03PS1) 10Santiago Faci: Test Kitchen UI: New config property [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302982 (https://phabricator.wikimedia.org/T426464) [22:37:14] !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host wikikube-ctrl2006.codfw.wmnet [22:40:18] pt1979@cumin2002 dhcp (PID 3723337) is awaiting input [22:40:45] (03CR) 10Clare Ming: [C:03+2] Test Kitchen UI: New config property [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302982 (https://phabricator.wikimedia.org/T426464) (owner: 10Santiago Faci) [22:43:14] (03Merged) 10jenkins-bot: Test Kitchen UI: New config property [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302982 (https://phabricator.wikimedia.org/T426464) (owner: 10Santiago Faci) [22:49:54] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen-next: apply [22:50:05] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen-next: apply [22:50:27] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host wikikube-ctrl2006.codfw.wmnet [22:52:25] !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host wikikube-ctrl2006.codfw.wmnet [22:55:29] pt1979@cumin2002 dhcp (PID 3725299) is awaiting input [22:57:11] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host wikikube-ctrl2006.codfw.wmnet [22:57:43] !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host wikikube-ctrl2006.codfw.wmnet [23:00:47] pt1979@cumin2002 dhcp (PID 3728203) is awaiting input [23:01:29] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host wikikube-ctrl2006.codfw.wmnet [23:02:23] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-restart-haproxy (exit_code=0) rolling restart of HAProxy on A:cp - OpenSSL update () [23:03:57] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl2006.codfw.wmnet with OS trixie [23:04:18] 06SRE, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware, 07Kubernetes, 13Patch-For-Review: wikikube-ctrl2006 implementation tracking - https://phabricator.wikimedia.org/T406596#12027169 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wikikube-ctrl2006.... [23:07:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 9.607% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [23:12:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.31% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [23:13:37] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:38:35] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-ctrl2006.codfw.wmnet with reason: host reimage [23:42:34] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1302987 [23:42:34] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1302987 (owner: 10TrainBranchBot) [23:44:36] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-ctrl2006.codfw.wmnet with reason: host reimage [23:48:38] 06SRE, 06Data-Platform-SRE, 06Data-Engineering (Q1 FS26/27 July 1st - September 30th): Move Druid realtime configuration out of Refinery into standalone repo on GitLab - https://phabricator.wikimedia.org/T407994#12027349 (10Ahoelzl) a:03amastilovic [23:50:59] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1302987 (owner: 10TrainBranchBot) [23:54:50] (03PS1) 10RLazarus: test_cli: Update assertEquals to assertEqual [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1302988 [23:54:50] (03PS1) 10RLazarus: tox: Bump flake8 to 7.3.0 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1302989 [23:54:50] (03PS1) 10RLazarus: tox: Test up to Python 3.14 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1302990 [23:54:51] (03PS1) 10RLazarus: Release 4.0.5 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1302991 [23:58:23] FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn