[00:00:05] Deploy window Abstract Wikipedia off-cadence backend deployment (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260331T0000) [00:00:08] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp1106 is OK: SSL OK - Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2026-05-13 04:44:41 +0000 (expires in 43 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:08] RECOVERY - HAProxy HTTPS wikipedia25.org ECDSA on cp1106 is OK: SSL OK - Certificate wikipedia25.org contains all required SANs:Certificate wikipedia25.org (ECDSA) valid until 2026-06-06 06:58:50 +0000 (expires in 67 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:12] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp1106 is OK: SSL OK - Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2026-05-07 21:41:31 +0000 (expires in 37 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:10:12] (03PS1) 10Cory Massaro: wikifunctions: Upgrade orchestrator from 2026-03-25-132654 to 2026-03-30-195027 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1264828 (https://phabricator.wikimedia.org/T413839) [00:12:04] (03CR) 10Cory Massaro: [C:03+2] wikifunctions: Upgrade orchestrator from 2026-03-25-132654 to 2026-03-30-195027 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1264828 (https://phabricator.wikimedia.org/T413839) (owner: 10Cory Massaro) [00:14:16] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2026-03-25-132654 to 2026-03-30-195027 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1264828 (https://phabricator.wikimedia.org/T413839) (owner: 10Cory Massaro) [00:14:22] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/0 (Transport: cr2-codfw:xe-0/0/1:1 (Arelion, IC-314534 29ms 10Gbps wave) {#11375}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqord:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [00:15:21] !log apine@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [00:15:39] FIRING: CoreBGPDown: Core BGP session down between cr2-eqord and cr2-codfw (208.80.153.193) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqord:9804&var-bgp_group=Confed_codfw&var-bgp_neighbor=cr2-codfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [00:19:22] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [00:20:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [00:23:37] 14SRE-Sprint-Week-Sustainability-March2023, 10SRE-swift-storage, 06Commons, 06Data-Persistence, and 3 others: Picture from Commons not found from Singapore - https://phabricator.wikimedia.org/T231086#11769428 (10Pppery) [00:25:30] !log apine@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [00:26:27] !log apine@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [00:27:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.99% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:29:15] FIRING: [6x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:32:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.99% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:34:15] RESOLVED: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:36:40] !log apine@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [00:38:09] (03CR) 10Ssingh: [C:03+1] site.pp: Remove deprecated hcaptcha nodes [puppet] - 10https://gerrit.wikimedia.org/r/1264748 (https://phabricator.wikimedia.org/T411097) (owner: 10BCornwall) [00:38:27] !log apine@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [00:39:42] (03CR) 10Ssingh: "I think we can also update modules/profile/data/profile/installserver/preseed.yaml and remove" [puppet] - 10https://gerrit.wikimedia.org/r/1264749 (https://phabricator.wikimedia.org/T411097) (owner: 10BCornwall) [00:47:32] (03PS1) 10Krinkle: Remove unused/redundant wgMFNoindexPages=true setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264845 (https://phabricator.wikimedia.org/T255458) [00:48:34] !log apine@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [00:49:32] (03PS1) 10Cory Massaro: Revert "wikifunctions: Upgrade orchestrator from 2026-03-25-132654 to 2026-03-30-195027" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1264847 [00:49:38] (03CR) 10Cory Massaro: [C:03+2] Revert "wikifunctions: Upgrade orchestrator from 2026-03-25-132654 to 2026-03-30-195027" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1264847 (owner: 10Cory Massaro) [00:52:14] (03Merged) 10jenkins-bot: Revert "wikifunctions: Upgrade orchestrator from 2026-03-25-132654 to 2026-03-30-195027" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1264847 (owner: 10Cory Massaro) [00:52:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 19.25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:55:15] FIRING: [5x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:00:15] FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:01:22] (03PS1) 10KineticPelagic: REST: Publish ReadingLists v0 module in REST Sandbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264856 (https://phabricator.wikimedia.org/T419619) [01:05:15] FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:07:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 21.52% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:08:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.67% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:08:32] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:10:15] RESOLVED: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:10:39] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.46.0-wmf.22 [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1264862 (https://phabricator.wikimedia.org/T420480) [01:10:41] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.46.0-wmf.22 [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1264862 (https://phabricator.wikimedia.org/T420480) (owner: 10TrainBranchBot) [01:11:25] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1264863 [01:11:25] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1264863 (owner: 10TrainBranchBot) [01:12:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.41% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:13:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:15:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.08% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:19:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [01:20:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.97% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:23:59] (03Merged) 10jenkins-bot: Branch commit for wmf/1.46.0-wmf.22 [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1264862 (https://phabricator.wikimedia.org/T420480) (owner: 10TrainBranchBot) [01:24:11] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1264863 (owner: 10TrainBranchBot) [01:24:23] FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [01:26:26] (03PS2) 10KineticPelagic: REST: Publish ReadingLists v0 module in REST Sandbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264856 (https://phabricator.wikimedia.org/T419619) [01:29:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [01:29:23] RESOLVED: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [01:29:31] 10ops-eqiad, 06SRE, 06DC-Ops: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11769685 (10Papaul) @Jclark-ctr For wikikube-worker 1371 wrong serial number in Netbox it was S497720X5834979 after the 5 is it not a 8 but a B S497720X5B3497... [01:49:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:54:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:58:32] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260331T0200) [02:01:02] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:08:15] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 07m 13s) [02:08:30] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 69740464 and 70 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:08:34] 06SRE, 06serviceops-deprecated, 10Wikimedia-Site-requests, 13Patch-Needs-Improvement: Change $wgMaxArticleSize limit from byte-based to character-based - https://phabricator.wikimedia.org/T275319#11769865 (10Pppery) [02:09:13] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:12:30] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 58168 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:14:22] 06SRE, 06Traffic: Move all purge traffic to kafka - https://phabricator.wikimedia.org/T250781#11769898 (10Pppery) [02:16:03] 06SRE, 06serviceops-deprecated, 13Patch-Needs-Improvement: Phase out "redis_sessions" cluster and away from memcached cluster - https://phabricator.wikimedia.org/T267581#11769920 (10Pppery) [02:21:30] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 507030864 and 39 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:24:30] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3669800 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:26:43] FIRING: [3x] NodeTextfileStale: Stale textfile for wcqs1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:31:43] FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2023:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:31:43] FIRING: [18x] NodeTextfileStale: Stale textfile for wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:31:48] (03PS1) 10Scott French: Revert "Enable $wgTempCategoryCollations for s3 wikis." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264897 (https://phabricator.wikimedia.org/T419274) [02:34:13] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:35:33] FYI, I'm going to be running a backport deployment to revert a mediawiki-config patch shortly. should be doable before mwpresync runs scap stage-train at the top of the hour [02:36:28] FIRING: [6x] NodeTextfileStale: Stale textfile for wdqs1025:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:36:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264897 (https://phabricator.wikimedia.org/T419274) (owner: 10Scott French) [02:37:28] FIRING: [3x] NodeTextfileStale: Stale textfile for wdqs1027:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:37:30] (03Merged) 10jenkins-bot: Revert "Enable $wgTempCategoryCollations for s3 wikis." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264897 (https://phabricator.wikimedia.org/T419274) (owner: 10Scott French) [02:38:09] !log swfrench@deploy1003 Started scap sync-world: Backport for [[gerrit:1264897|Revert "Enable $wgTempCategoryCollations for s3 wikis." (T419274)]] [02:38:15] T419274: ICU 72 upgrade: enable remote ICU collation writes - https://phabricator.wikimedia.org/T419274 [02:40:03] !log swfrench@deploy1003 swfrench: Backport for [[gerrit:1264897|Revert "Enable $wgTempCategoryCollations for s3 wikis." (T419274)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [02:41:21] !log swfrench@deploy1003 swfrench: Continuing with sync [02:45:36] !log swfrench@deploy1003 Finished scap sync-world: Backport for [[gerrit:1264897|Revert "Enable $wgTempCategoryCollations for s3 wikis." (T419274)]] (duration: 07m 27s) [02:46:01] T419274: ICU 72 upgrade: enable remote ICU collation writes - https://phabricator.wikimedia.org/T419274 [02:47:19] all done on my end. I'll follow up on the above task ^^ [03:00:05] Deploy window Automatic deployment of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260331T0300) [03:01:55] (03PS1) 10TrainBranchBot: testwikis to 1.46.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264913 (https://phabricator.wikimedia.org/T420480) [03:02:16] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by mwpresync@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264913 (https://phabricator.wikimedia.org/T420480) (owner: 10TrainBranchBot) [03:03:12] (03Merged) 10jenkins-bot: testwikis to 1.46.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264913 (https://phabricator.wikimedia.org/T420480) (owner: 10TrainBranchBot) [03:03:33] !log mwpresync@deploy1003 Started scap sync-world: testwikis to 1.46.0-wmf.22 refs T420480 [03:03:46] T420480: 1.46.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T420480 [03:06:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.8% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:11:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.42% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:19:01] (03PS1) 10Catrope: Email confirmation banner: Add Test Kitchen A/B gating [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1264921 (https://phabricator.wikimedia.org/T421366) [03:19:07] (03PS1) 10Catrope: Add instrumentation for email confirmation lifecycle events [extensions/WikimediaEvents] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1264922 (https://phabricator.wikimedia.org/T420007) [03:41:14] !log mwpresync@deploy1003 Finished scap sync-world: testwikis to 1.46.0-wmf.22 refs T420480 (duration: 37m 41s) [03:41:21] T420480: 1.46.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T420480 [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260331T0400) [04:02:31] !log mwpresync@deploy1003 Pruned MediaWiki: 1.46.0-wmf.19 (duration: 02m 29s) [04:19:22] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:20:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:49:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:13:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:16:12] PROBLEM - MariaDB read only s3 on clouddb1022 is CRITICAL: Could not connect to localhost:3313 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [05:16:12] PROBLEM - MariaDB read only wikireplica-s3 on clouddb1022 is CRITICAL: Could not connect to localhost:3313 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [05:16:48] PROBLEM - MariaDB Replica Lag: s3 on clouddb1022 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 24693.05 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:17:14] RECOVERY - MariaDB read only wikireplica-s3 on clouddb1022 is OK: Version 10.11.16-MariaDB, Uptime 58s, read_only: True, event_scheduler: False, 13575.00 QPS, connection latency: 0.031343s, query latency: 0.000570s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [05:17:14] RECOVERY - MariaDB read only s3 on clouddb1022 is OK: Version 10.11.16-MariaDB, Uptime 58s, read_only: True, event_scheduler: False, 13613.11 QPS, connection latency: 0.028512s, query latency: 0.001114s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [05:19:54] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast6003.wikimedia.org [05:21:48] RECOVERY - MariaDB Replica Lag: s3 on clouddb1023 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:21:54] (03CR) 10Muehlenhoff: "The old VMs also needs to be dropped from conftool-data/node/[eqiad|codfw].yaml" [puppet] - 10https://gerrit.wikimedia.org/r/1264748 (https://phabricator.wikimedia.org/T411097) (owner: 10BCornwall) [05:22:48] RECOVERY - MariaDB Replica Lag: s3 on clouddb1022 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:26:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast6003.wikimedia.org [05:28:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast3007.wikimedia.org [05:30:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast3007.wikimedia.org [05:31:29] (03PS2) 10Ryan Kemper: sre.elasticsearch.rolling-operation: use boottime for reboot operations [cookbooks] - 10https://gerrit.wikimedia.org/r/1235113 (https://phabricator.wikimedia.org/T410577) [05:32:06] (03CR) 10Ryan Kemper: sre.elasticsearch.rolling-operation: use boottime for reboot operations (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1235113 (https://phabricator.wikimedia.org/T410577) (owner: 10Ryan Kemper) [05:38:23] FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [05:47:21] (03CR) 10KartikMistry: [C:03+1] "LGTM. Feel free to deploy or let me know if we can go ahead with sometime this week." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1264671 (https://phabricator.wikimedia.org/T335491) (owner: 10JMeybohm) [05:49:20] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1264669 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260331T0600) [06:00:05] marostegui, Amir1, and federico3: Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260331T0600). Please do the needful. [06:01:44] (03PS1) 10Muehlenhoff: Remove obsolete alert [alerts] - 10https://gerrit.wikimedia.org/r/1265191 (https://phabricator.wikimedia.org/T421517) [06:06:48] PROBLEM - MariaDB Replica Lag: s3 on clouddb1022 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 550.73 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:18:32] (03CR) 10Slyngshede: [C:03+1] Registe rairflow-fr-tech-ops [puppet] - 10https://gerrit.wikimedia.org/r/1264628 (https://phabricator.wikimedia.org/T421703) (owner: 10Muehlenhoff) [06:24:36] (03CR) 10Arnaudb: [C:03+1] gerrit: prevent crawling patches/archive files [puppet] - 10https://gerrit.wikimedia.org/r/1264733 (owner: 10Hashar) [06:26:43] FIRING: [3x] NodeTextfileStale: Stale textfile for wcqs1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:28:23] RESOLVED: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [06:30:47] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, one nit inline" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1249279 (https://phabricator.wikimedia.org/T419419) (owner: 10Slyngshede) [06:31:43] FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2023:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:31:48] FIRING: [18x] NodeTextfileStale: Stale textfile for wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:36:02] (03CR) 10Muehlenhoff: [C:03+2] Registe rairflow-fr-tech-ops [puppet] - 10https://gerrit.wikimedia.org/r/1264628 (https://phabricator.wikimedia.org/T421703) (owner: 10Muehlenhoff) [06:36:03] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2157.codfw.wmnet with reason: Maintenance [06:36:11] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2157 (T419635)', diff saved to https://phabricator.wikimedia.org/P90011 and previous config saved to /var/cache/conftool/dbconfig/20260331-063611-fceratto.json [06:36:17] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [06:36:43] FIRING: [6x] NodeTextfileStale: Stale textfile for wdqs1025:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:37:23] !log tappof@cumin1003 START - Cookbook sre.hosts.reboot-single for host titan1001.eqiad.wmnet [06:37:28] FIRING: [3x] NodeTextfileStale: Stale textfile for wdqs1027:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:37:38] !log tappof@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host titan1001.eqiad.wmnet [06:37:48] RECOVERY - MariaDB Replica Lag: s3 on clouddb1022 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:38:08] !log tappof@cumin1003 START - Cookbook sre.hosts.reboot-single for host titan1001.eqiad.wmnet [06:38:12] !log tappof@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host titan1001.eqiad.wmnet [06:38:52] !log tappof@cumin1003 START - Cookbook sre.hosts.reboot-single for host titan1001.eqiad.wmnet [06:43:53] (03PS1) 10Muehlenhoff: Bitu: Add approval config for airflow-fr-tech-ops [puppet] - 10https://gerrit.wikimedia.org/r/1265224 (https://phabricator.wikimedia.org/T421703) [06:46:21] !log tappof@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host titan1001.eqiad.wmnet [06:49:59] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1165.eqiad.wmnet with reason: Maintenance [06:50:20] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [06:50:28] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1165 (T419635)', diff saved to https://phabricator.wikimedia.org/P90012 and previous config saved to /var/cache/conftool/dbconfig/20260331-065027-fceratto.json [06:50:33] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [06:53:02] (03PS1) 10Marostegui: clouddb1013: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1265230 [06:53:52] (03CR) 10Marostegui: [C:03+2] clouddb1013: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1265230 (owner: 10Marostegui) [06:55:10] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on 12 hosts with reason: Downgrade to 10.11.13 [06:55:17] !log installing postgresql-15 security updates [06:55:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:35] !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster search_codfw: test reboot boottime check T410577 - ryankemper@cumin2002 - T410577 [06:55:41] T410577: sre.elasticsearch.rolling-operation: Fix reboot --start-datetime logic - https://phabricator.wikimedia.org/T410577 [06:56:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps2011.codfw.wmnet [06:58:38] !log tappof@cumin1003 START - Cookbook sre.hosts.reboot-single for host titan1002.eqiad.wmnet [06:58:48] (03PS1) 10Marostegui: Revert "clouddb1013: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1265234 [07:00:05] Amir1, Urbanecm, and awight: Your horoscope predicts another UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260331T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:23] (03PS3) 10Slyngshede: CAS 7.3.5 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1249279 (https://phabricator.wikimedia.org/T419419) [07:01:25] (03CR) 10Marostegui: [C:03+2] Revert "clouddb1013: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1265234 (owner: 10Marostegui) [07:01:34] (03CR) 10Slyngshede: CAS 7.3.5 (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1249279 (https://phabricator.wikimedia.org/T419419) (owner: 10Slyngshede) [07:02:30] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1249279 (https://phabricator.wikimedia.org/T419419) (owner: 10Slyngshede) [07:02:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps2011.codfw.wmnet [07:06:04] !log tappof@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host titan1002.eqiad.wmnet [07:09:23] (03CR) 10Slyngshede: [V:03+2 C:03+2] CAS 7.3.5 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1249279 (https://phabricator.wikimedia.org/T419419) (owner: 10Slyngshede) [07:11:08] (03CR) 10Volans: [C:03+1] "LGTM" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1243135 (owner: 10Muehlenhoff) [07:11:41] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T419635)', diff saved to https://phabricator.wikimedia.org/P90013 and previous config saved to /var/cache/conftool/dbconfig/20260331-071140-fceratto.json [07:11:46] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [07:14:53] !log installing mongo-c-driver security updates [07:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:23] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1331.eqiad.wmnet with OS trixie [07:16:51] !log ayounsi@cumin1003 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1331 [07:16:59] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [07:21:34] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1331 - ayounsi@cumin1003" [07:21:40] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1331 - ayounsi@cumin1003" [07:21:40] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:21:40] !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache wikikube-worker1331.eqiad.wmnet 172.48.64.10.in-addr.arpa 2.7.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [07:21:45] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1331.eqiad.wmnet 172.48.64.10.in-addr.arpa 2.7.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [07:21:46] !log ayounsi@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1331 [07:21:47] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P90014 and previous config saved to /var/cache/conftool/dbconfig/20260331-072147-fceratto.json [07:22:46] !log ryankemper@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster search_codfw: test reboot boottime check T410577 - ryankemper@cumin2002 - T410577 [07:22:52] T410577: sre.elasticsearch.rolling-operation: Fix reboot --start-datetime logic - https://phabricator.wikimedia.org/T410577 [07:23:11] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1331 [07:23:11] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1331 [07:23:15] !log T410577 ^ cookbook did its job, ctrl+c'd after one host was rebooted. new spicerack upgrade confirmed working [07:23:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:28] (03CR) 10Ryan Kemper: [C:03+2] sre.elasticsearch.rolling-operation: use boottime for reboot operations [cookbooks] - 10https://gerrit.wikimedia.org/r/1235113 (https://phabricator.wikimedia.org/T410577) (owner: 10Ryan Kemper) [07:26:03] (03Merged) 10jenkins-bot: sre.elasticsearch.rolling-operation: use boottime for reboot operations [cookbooks] - 10https://gerrit.wikimedia.org/r/1235113 (https://phabricator.wikimedia.org/T410577) (owner: 10Ryan Kemper) [07:27:50] (03CR) 10Santiago Faci: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1238312 (https://phabricator.wikimedia.org/T414381) (owner: 10Santiago Faci) [07:29:33] (03PS1) 10Marostegui: clouddb1014: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1265257 [07:29:51] (03PS1) 10Slyngshede: Debugging config [software/bitu] - 10https://gerrit.wikimedia.org/r/1265258 [07:30:21] (03CR) 10Marostegui: [C:03+2] clouddb1014: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1265257 (owner: 10Marostegui) [07:31:55] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P90015 and previous config saved to /var/cache/conftool/dbconfig/20260331-073155-fceratto.json [07:32:06] (03PS1) 10Marostegui: Revert "clouddb1014: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1265259 [07:32:47] (03CR) 10Marostegui: [C:03+2] Revert "clouddb1014: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1265259 (owner: 10Marostegui) [07:34:20] (03PS1) 10Fabfur: New release [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1265263 [07:34:32] (03CR) 10Fabfur: [V:03+2 C:03+2] New release [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1265263 (owner: 10Fabfur) [07:34:37] (03CR) 10JMeybohm: "You can do it anytime that suits you. The outdated state just surfaced during other work. But since there is no functional change its not " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1264671 (https://phabricator.wikimedia.org/T335491) (owner: 10JMeybohm) [07:34:53] !log ayounsi@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1331.eqiad.wmnet with reason: host reimage [07:35:27] !log jayme@cumin1003 START - Cookbook sre.loadbalancer.check-ipip [07:35:38] !log jayme@cumin1003 END (FAIL) - Cookbook sre.loadbalancer.check-ipip (exit_code=99) [07:35:38] !log fabfur@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "New deploy for MR 152 - fabfur@cumin1003" [07:35:39] !log fabfur@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: New deploy for MR 152 - fabfur@cumin1003 [07:36:24] !log fabfur@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: New deploy for MR 152 - fabfur@cumin1003 [07:36:25] !log fabfur@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "New deploy for MR 152 - fabfur@cumin1003" [07:36:43] !log fabfur@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "New deploy for MR 152 - fabfur@cumin1003" [07:36:44] !log fabfur@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: New deploy for MR 152 - fabfur@cumin1003 [07:36:47] !log jayme@cumin1003 START - Cookbook sre.loadbalancer.check-ipip [07:36:58] !log jayme@cumin1003 END (FAIL) - Cookbook sre.loadbalancer.check-ipip (exit_code=99) [07:37:28] !log fabfur@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: New deploy for MR 152 - fabfur@cumin1003 [07:37:29] !log fabfur@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "New deploy for MR 152 - fabfur@cumin1003" [07:38:59] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1331.eqiad.wmnet with reason: host reimage [07:39:17] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:40:30] (03PS1) 10Brouberol: Remove an-tool1007 from site [puppet] - 10https://gerrit.wikimedia.org/r/1265265 (https://phabricator.wikimedia.org/T416127) [07:40:30] (03PS1) 10Brouberol: Remove an-tool1011 from site [puppet] - 10https://gerrit.wikimedia.org/r/1265266 (https://phabricator.wikimedia.org/T416127) [07:41:17] (03CR) 10CI reject: [V:04-1] Remove an-tool1007 from site [puppet] - 10https://gerrit.wikimedia.org/r/1265265 (https://phabricator.wikimedia.org/T416127) (owner: 10Brouberol) [07:42:03] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T419635)', diff saved to https://phabricator.wikimedia.org/P90016 and previous config saved to /var/cache/conftool/dbconfig/20260331-074202-fceratto.json [07:42:09] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [07:42:20] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1168.eqiad.wmnet with reason: Maintenance [07:42:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1168 (T419635)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20260331-074227-fceratto.json [07:42:58] (03PS2) 10Brouberol: Remove an-tool1011 from site [puppet] - 10https://gerrit.wikimedia.org/r/1265266 (https://phabricator.wikimedia.org/T416127) [07:42:58] (03PS1) 10Brouberol: Remove an-tool1007 from site [puppet] - 10https://gerrit.wikimedia.org/r/1265268 (https://phabricator.wikimedia.org/T416127) [07:43:27] (03PS1) 10Marostegui: clouddb1015: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1265269 [07:44:04] (03Abandoned) 10Brouberol: Remove an-tool1007 from site [puppet] - 10https://gerrit.wikimedia.org/r/1265265 (https://phabricator.wikimedia.org/T416127) (owner: 10Brouberol) [07:44:35] (03CR) 10Marostegui: [C:03+2] clouddb1015: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1265269 (owner: 10Marostegui) [07:44:43] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T419635)', diff saved to https://phabricator.wikimedia.org/P90018 and previous config saved to /var/cache/conftool/dbconfig/20260331-074442-fceratto.json [07:46:30] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps2012.codfw.wmnet [07:46:52] (03PS2) 10Fabfur: hiera: upgrade haproxy to version 3.2 on drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1262059 (https://phabricator.wikimedia.org/T421402) [07:46:52] (03PS3) 10Fabfur: hiera: upgrade haproxy to version magru [puppet] - 10https://gerrit.wikimedia.org/r/1262060 (https://phabricator.wikimedia.org/T421402) [07:46:52] (03PS2) 10Fabfur: hiera: upgrade haproxy to version 3.2 on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1262061 (https://phabricator.wikimedia.org/T421402) [07:46:53] (03PS2) 10Fabfur: hiera: upgrade haproxy to version 3.2 on eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1262062 (https://phabricator.wikimedia.org/T421402) [07:46:54] (03PS2) 10Fabfur: hiera: upgrade haproxy to version 3.2 on codfw [puppet] - 10https://gerrit.wikimedia.org/r/1262063 (https://phabricator.wikimedia.org/T421402) [07:46:55] (03PS2) 10Fabfur: hiera: upgrade haproxy to version 3.2 on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1262064 (https://phabricator.wikimedia.org/T421402) [07:46:58] (03PS2) 10Fabfur: hiera: upgrade haproxy to version 3.2 on esams [puppet] - 10https://gerrit.wikimedia.org/r/1262065 (https://phabricator.wikimedia.org/T421402) [07:49:23] (03PS1) 10Marostegui: Revert "clouddb1015: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1265270 [07:50:47] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.13 point update - https://phabricator.wikimedia.org/T414205#11770735 (10MoritzMuehlenhoff) [07:50:49] (03CR) 10Marostegui: [C:03+2] Revert "clouddb1015: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1265270 (owner: 10Marostegui) [07:51:17] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1262059 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [07:51:43] (03PS1) 10Brouberol: turnilo: remove associated resources from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1265273 (https://phabricator.wikimedia.org/T416126) [07:51:55] (03PS4) 10Fabfur: hiera: upgrade haproxy to version 3.2 magru [puppet] - 10https://gerrit.wikimedia.org/r/1262060 (https://phabricator.wikimedia.org/T421402) [07:52:17] (03PS2) 10Brouberol: turnilo: remove associated resources from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1265273 (https://phabricator.wikimedia.org/T416126) [07:52:41] (03PS5) 10Fabfur: hiera: upgrade haproxy to version 3.2 on magru [puppet] - 10https://gerrit.wikimedia.org/r/1262060 (https://phabricator.wikimedia.org/T421402) [07:52:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps2012.codfw.wmnet [07:54:51] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P90019 and previous config saved to /var/cache/conftool/dbconfig/20260331-075450-fceratto.json [07:55:01] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1331.eqiad.wmnet with OS trixie [07:55:24] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:56:08] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps2013.codfw.wmnet [07:56:20] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1020.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:57:20] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:57:24] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:00:05] jnuche and hashar: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260331T0800) [08:00:42] morning, train is blocked on T421828 [08:00:43] T421828: PHP Warning: Undefined array key "user_identifier_type" - https://phabricator.wikimedia.org/T421828 [08:00:53] (03PS1) 10Brouberol: analytics/turnilo: remove associated resources from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1265278 (https://phabricator.wikimedia.org/T416126) [08:01:41] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1332.eqiad.wmnet with OS trixie [08:01:42] (03Abandoned) 10Brouberol: turnilo: remove associated resources from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1265273 (https://phabricator.wikimedia.org/T416126) (owner: 10Brouberol) [08:02:09] !log ayounsi@cumin1003 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1332 [08:02:09] (03PS1) 10Marostegui: clouddb1016: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1265284 [08:02:23] (03CR) 10Joal: [C:03+1] Remove an-tool1007 from site [puppet] - 10https://gerrit.wikimedia.org/r/1265268 (https://phabricator.wikimedia.org/T416127) (owner: 10Brouberol) [08:02:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps2013.codfw.wmnet [08:02:32] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1265224 (https://phabricator.wikimedia.org/T421703) (owner: 10Muehlenhoff) [08:02:32] (03CR) 10Joal: [C:03+1] Remove an-tool1011 from site [puppet] - 10https://gerrit.wikimedia.org/r/1265266 (https://phabricator.wikimedia.org/T416127) (owner: 10Brouberol) [08:03:01] (03CR) 10Marostegui: [C:03+2] clouddb1016: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1265284 (owner: 10Marostegui) [08:03:12] PROBLEM - orchestrator process on dborch1002 is CRITICAL: PROCS CRITICAL: 2 processes with regex args orchestrator http https://wikitech.wikimedia.org/wiki/Orchestrator [08:03:53] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [08:04:08] (03CR) 10Joal: [C:03+1] analytics/turnilo: remove associated resources from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1265278 (https://phabricator.wikimedia.org/T416126) (owner: 10Brouberol) [08:04:12] RECOVERY - orchestrator process on dborch1002 is OK: PROCS OK: 1 process with regex args orchestrator http https://wikitech.wikimedia.org/wiki/Orchestrator [08:04:30] (03CR) 10Brouberol: [C:03+2] Remove an-tool1007 from site [puppet] - 10https://gerrit.wikimedia.org/r/1265268 (https://phabricator.wikimedia.org/T416127) (owner: 10Brouberol) [08:04:34] (03CR) 10Brouberol: [C:03+2] Remove an-tool1011 from site [puppet] - 10https://gerrit.wikimedia.org/r/1265266 (https://phabricator.wikimedia.org/T416127) (owner: 10Brouberol) [08:05:04] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20260331-080459-fceratto.json [08:06:36] (03PS1) 10Marostegui: Revert "clouddb1016: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1265316 [08:06:40] !log brouberol@cumin1003 START - Cookbook sre.hosts.decommission for hosts an-tool1007.eqiad.wmnet [08:07:27] (03CR) 10Marostegui: [C:03+2] Revert "clouddb1016: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1265316 (owner: 10Marostegui) [08:07:33] !log brouberol@cumin1003 START - Cookbook sre.hosts.decommission for hosts an-tool1011.eqiad.wmnet [08:07:35] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1332 - ayounsi@cumin1003" [08:07:41] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1332 - ayounsi@cumin1003" [08:07:41] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:07:41] !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache wikikube-worker1332.eqiad.wmnet 190.48.64.10.in-addr.arpa 0.9.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:07:45] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1332.eqiad.wmnet 190.48.64.10.in-addr.arpa 0.9.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:07:46] !log ayounsi@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1332 [08:09:06] (03PS4) 10Arnaudb: gerrit: adjust idleTimeout on Jetty [puppet] - 10https://gerrit.wikimedia.org/r/1262020 (https://phabricator.wikimedia.org/T421827) [08:09:18] (03PS1) 10Brouberol: trafficserver: remove deprecated references to pivot.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1265317 (https://phabricator.wikimedia.org/T416126) [08:10:48] !log tappof@cumin1003 START - Cookbook sre.hosts.reboot-single for host titan2002.codfw.wmnet [08:11:01] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1332 [08:11:01] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1332 [08:12:14] !log brouberol@cumin1003 START - Cookbook sre.dns.netbox [08:14:23] (03PS1) 10Arnaudb: gerrit: update Envoy upstream response timeout [puppet] - 10https://gerrit.wikimedia.org/r/1265322 (https://phabricator.wikimedia.org/T421827) [08:15:16] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T419635)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20260331-081511-fceratto.json [08:15:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:15:32] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [08:15:34] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1180.eqiad.wmnet with reason: Maintenance [08:15:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1180 (T419635)', diff saved to https://phabricator.wikimedia.org/P90020 and previous config saved to /var/cache/conftool/dbconfig/20260331-081541-fceratto.json [08:16:09] !log brouberol@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-tool1007.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1003" [08:16:43] !log brouberol@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-tool1007.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1003" [08:16:43] !log brouberol@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:16:44] !log brouberol@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts an-tool1007.eqiad.wmnet [08:16:52] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T419635)', diff saved to https://phabricator.wikimedia.org/P90021 and previous config saved to /var/cache/conftool/dbconfig/20260331-081651-fceratto.json [08:16:59] !log brouberol@cumin1003 START - Cookbook sre.dns.netbox [08:17:04] (03CR) 10Arnaudb: [C:03+2] gerrit: update Envoy upstream response timeout [puppet] - 10https://gerrit.wikimedia.org/r/1265322 (https://phabricator.wikimedia.org/T421827) (owner: 10Arnaudb) [08:17:08] !log tappof@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host titan2002.codfw.wmnet [08:18:19] (03CR) 10Joal: [C:03+1] trafficserver: remove deprecated references to pivot.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1265317 (https://phabricator.wikimedia.org/T416126) (owner: 10Brouberol) [08:18:34] (03CR) 10Brouberol: [C:03+2] trafficserver: remove deprecated references to pivot.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1265317 (https://phabricator.wikimedia.org/T416126) (owner: 10Brouberol) [08:18:41] (03CR) 10Brouberol: [C:03+2] analytics/turnilo: remove associated resources from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1265278 (https://phabricator.wikimedia.org/T416126) (owner: 10Brouberol) [08:19:37] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:19:44] !log brouberol@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:19:45] !log brouberol@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts an-tool1011.eqiad.wmnet [08:20:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:20:54] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [08:22:18] (03CR) 10Tiziano Fogli: [C:03+2] prometheus/pop: consolidate the firewall provider declaration at the role level. [puppet] - 10https://gerrit.wikimedia.org/r/1264339 (https://phabricator.wikimedia.org/T419960) (owner: 10Tiziano Fogli) [08:23:00] !log ayounsi@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1332.eqiad.wmnet with reason: host reimage [08:24:38] !log upgrade spicerack on cumin1003 [08:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:09] (03PS1) 10Daniel Kinzler: rest gateway: introduce policy for abstractwiki/wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265333 (https://phabricator.wikimedia.org/T421581) [08:27:00] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P90022 and previous config saved to /var/cache/conftool/dbconfig/20260331-082700-fceratto.json [08:29:31] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1332.eqiad.wmnet with reason: host reimage [08:37:07] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P90023 and previous config saved to /var/cache/conftool/dbconfig/20260331-083707-fceratto.json [08:40:44] (03CR) 10Dpogorzelski: ml-serve: add modified kserve 0.17 chart (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [08:42:55] (03PS16) 10Dpogorzelski: ml-serve: add modified kserve 0.17 chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) [08:44:39] !log jayme@cumin1003 START - Cookbook sre.loadbalancer.check-ipip [08:44:39] !log jayme@cumin1003 END (PASS) - Cookbook sre.loadbalancer.check-ipip (exit_code=0) [08:45:38] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1332.eqiad.wmnet with OS trixie [08:47:18] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T419635)', diff saved to https://phabricator.wikimedia.org/P90024 and previous config saved to /var/cache/conftool/dbconfig/20260331-084714-fceratto.json [08:47:22] (03CR) 10Filippo Giunchedi: [C:03+2] openstack: enable trove-guestagent rabbit transient quorum queues [puppet] - 10https://gerrit.wikimedia.org/r/1264557 (https://phabricator.wikimedia.org/T421054) (owner: 10Filippo Giunchedi) [08:47:24] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [08:47:34] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1187.eqiad.wmnet with reason: Maintenance [08:47:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1187 (T419635)', diff saved to https://phabricator.wikimedia.org/P90025 and previous config saved to /var/cache/conftool/dbconfig/20260331-084742-fceratto.json [08:49:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:49:41] (03CR) 10Dpogorzelski: ml-serve: add modified kserve 0.17 chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [08:49:52] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T419635)', diff saved to https://phabricator.wikimedia.org/P90026 and previous config saved to /var/cache/conftool/dbconfig/20260331-084951-fceratto.json [08:53:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps2014.codfw.wmnet [08:56:31] (03CR) 10Elukey: [C:03+1] cli: allow interactive mode with multiple commands [software/cumin] - 10https://gerrit.wikimedia.org/r/1264707 (owner: 10Volans) [08:56:53] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1333.eqiad.wmnet with OS trixie [08:57:21] !log ayounsi@cumin1003 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1333 [08:57:29] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [08:59:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps2014.codfw.wmnet [08:59:59] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P90027 and previous config saved to /var/cache/conftool/dbconfig/20260331-085958-fceratto.json [09:01:11] (03PS1) 10Giuseppe Lavagetto: Remove body from request patterns [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1265343 [09:01:13] (03CR) 10Volans: [C:03+2] cli: allow interactive mode with multiple commands [software/cumin] - 10https://gerrit.wikimedia.org/r/1264707 (owner: 10Volans) [09:01:21] jouncebot: nowandnext [09:01:21] For the next 0 hour(s) and 58 minute(s): MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260331T0800) [09:01:21] In 0 hour(s) and 58 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260331T1000) [09:01:22] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Remove body from request patterns [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1265343 (owner: 10Giuseppe Lavagetto) [09:03:08] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "remove body from patterns - oblivian@cumin1003" [09:03:09] ayounsi@cumin1003 reimage (PID 116282) is awaiting input [09:03:10] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: remove body from patterns - oblivian@cumin1003 [09:04:04] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: remove body from patterns - oblivian@cumin1003 [09:04:06] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "remove body from patterns - oblivian@cumin1003" [09:05:01] (03CR) 10Brouberol: [C:03+1] Add andreawest to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1253568 (https://phabricator.wikimedia.org/T420053) (owner: 10Btullis) [09:07:49] !log pfw1-eqiad - add NAT rule - T421750 [09:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:38] (03PS1) 10Giuseppe Lavagetto: Adding post-deploy step [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1265349 [09:09:47] (03CR) 10Giuseppe Lavagetto: [V:03+1 C:03+1] Adding post-deploy step [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1265349 (owner: 10Giuseppe Lavagetto) [09:10:07] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P90029 and previous config saved to /var/cache/conftool/dbconfig/20260331-091007-fceratto.json [09:13:30] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps-test2001.codfw.wmnet [09:13:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:14:40] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1333 - ayounsi@cumin1003" [09:14:46] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1333 - ayounsi@cumin1003" [09:14:46] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:14:46] !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache wikikube-worker1333.eqiad.wmnet 191.48.64.10.in-addr.arpa 1.9.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:14:50] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1333.eqiad.wmnet 191.48.64.10.in-addr.arpa 1.9.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:14:50] !log ayounsi@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1333 [09:15:41] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1333 [09:15:41] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1333 [09:16:25] (03Merged) 10jenkins-bot: cli: allow interactive mode with multiple commands [software/cumin] - 10https://gerrit.wikimedia.org/r/1264707 (owner: 10Volans) [09:20:15] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T419635)', diff saved to https://phabricator.wikimedia.org/P90030 and previous config saved to /var/cache/conftool/dbconfig/20260331-092014-fceratto.json [09:20:21] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [09:20:32] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1201.eqiad.wmnet with reason: Maintenance [09:20:33] (03CR) 10Btullis: [C:03+2] Add andreawest to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1253568 (https://phabricator.wikimedia.org/T420053) (owner: 10Btullis) [09:20:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1201 (T419635)', diff saved to https://phabricator.wikimedia.org/P90031 and previous config saved to /var/cache/conftool/dbconfig/20260331-092038-fceratto.json [09:21:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps-test2001.codfw.wmnet [09:21:42] (03CR) 10Elukey: ml-serve: add modified kserve 0.17 chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [09:21:49] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T419635)', diff saved to https://phabricator.wikimedia.org/P90032 and previous config saved to /var/cache/conftool/dbconfig/20260331-092148-fceratto.json [09:22:29] (03CR) 10Elukey: ml-serve: add modified kserve 0.17 chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [09:22:34] (03CR) 10Dpogorzelski: ml-serve: add modified kserve 0.17 chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [09:23:08] (03Restored) 10Vgutierrez: sre.loadbalancer: Provide check-ipip cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1251442 (https://phabricator.wikimedia.org/T419873) (owner: 10Vgutierrez) [09:23:53] (03CR) 10Vgutierrez: [C:04-1] "please cleanup cp6001.yaml and cp6009.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/1262059 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [09:27:22] !log ayounsi@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1333.eqiad.wmnet with reason: host reimage [09:27:45] (03PS3) 10Fabfur: hiera: upgrade haproxy to version 3.2 on drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1262059 (https://phabricator.wikimedia.org/T421402) [09:28:08] (03CR) 10Fabfur: "ack tnx" [puppet] - 10https://gerrit.wikimedia.org/r/1262059 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [09:28:46] (03PS8) 10Btullis: opensearch-cluster: Terminate TLS on the ingress gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248865 (https://phabricator.wikimedia.org/T418175) [09:30:45] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1262059 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [09:31:57] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P90033 and previous config saved to /var/cache/conftool/dbconfig/20260331-093156-fceratto.json [09:32:20] (03PS1) 10D3r1ck01: Set a JWT cookie for OAuth 1 and OAuth 2 owner-only requests [extensions/OAuth] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1265367 (https://phabricator.wikimedia.org/T417833) [09:33:00] (03PS1) 10D3r1ck01: tests: OAuth1 and OAuth2 owner-only JWT support [extensions/OAuth] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1265368 (https://phabricator.wikimedia.org/T417833) [09:33:56] (03PS1) 10D3r1ck01: tests: Add test for asserting JWT cookie not set for OAuth2 consumers [extensions/OAuth] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1265369 (https://phabricator.wikimedia.org/T417833) [09:35:02] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1333.eqiad.wmnet with reason: host reimage [09:38:30] (03PS1) 10Muehlenhoff: mariadb::ferm: Rewrite ferm::rule as firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1265374 (https://phabricator.wikimedia.org/T421705) [09:42:05] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P90034 and previous config saved to /var/cache/conftool/dbconfig/20260331-094205-fceratto.json [09:43:03] 06SRE-OnFire: Cortobot help command should not spam the main channel - https://phabricator.wikimedia.org/T421858 (10jijiki) 03NEW [09:45:22] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org [09:45:43] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1265374 (https://phabricator.wikimedia.org/T421705) (owner: 10Muehlenhoff) [09:46:08] (03PS1) 10Volans: CHANGELOG: add changelogs for release v6.0.0 [software/cumin] - 10https://gerrit.wikimedia.org/r/1265379 [09:49:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org [09:51:41] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1333.eqiad.wmnet with OS trixie [09:52:14] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T419635)', diff saved to https://phabricator.wikimedia.org/P90035 and previous config saved to /var/cache/conftool/dbconfig/20260331-095213-fceratto.json [09:52:16] !log jayme@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{wikikube-worker1347.eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [09:52:19] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [09:52:20] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker1347.eqiad.wmnet [09:52:31] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1225.eqiad.wmnet with reason: Maintenance [09:52:55] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker1347.eqiad.wmnet [09:53:37] (03PS2) 10Elukey: profile::pki::intermediates: refresh discovery's public key [puppet] - 10https://gerrit.wikimedia.org/r/1264669 (https://phabricator.wikimedia.org/T420993) [09:53:37] (03PS1) 10Elukey: cfssl::cert: handle the rotation of the intermediate keypair [puppet] - 10https://gerrit.wikimedia.org/r/1265382 (https://phabricator.wikimedia.org/T420993) [09:54:25] (03PS3) 10Elukey: profile::pki::intermediates: refresh discovery's public key [puppet] - 10https://gerrit.wikimedia.org/r/1264669 (https://phabricator.wikimedia.org/T420993) [09:54:27] (03PS9) 10Btullis: opensearch-cluster: Terminate TLS on the ingress gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248865 (https://phabricator.wikimedia.org/T418175) [09:55:20] (03PS2) 10Elukey: cfssl::cert: handle the rotation of the intermediate keypair [puppet] - 10https://gerrit.wikimedia.org/r/1265382 (https://phabricator.wikimedia.org/T420993) [09:55:20] (03PS4) 10Elukey: profile::pki::intermediates: refresh discovery's public key [puppet] - 10https://gerrit.wikimedia.org/r/1264669 (https://phabricator.wikimedia.org/T420993) [09:57:34] (03CR) 10Ladsgroup: [C:03+1] mariadb::ferm: Rewrite ferm::rule as firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1265374 (https://phabricator.wikimedia.org/T421705) (owner: 10Muehlenhoff) [09:58:14] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1347.eqiad.wmnet [09:58:15] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1347.eqiad.wmnet [09:58:15] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on P{wikikube-worker1347.eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [09:58:21] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1334.eqiad.wmnet with OS trixie [09:58:36] (03CR) 10Hashar: "And this morning I found the timing for those requests in Gerrit Javamelody monitoring. It keeps track of metrics for various kind of requ" [puppet] - 10https://gerrit.wikimedia.org/r/1264733 (owner: 10Hashar) [09:58:48] !log ayounsi@cumin1003 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1334 [09:59:00] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [09:59:55] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260331T1000) [10:02:53] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1334 - ayounsi@cumin1003" [10:02:58] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1334 - ayounsi@cumin1003" [10:02:58] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:02:59] !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache wikikube-worker1334.eqiad.wmnet 192.48.64.10.in-addr.arpa 2.9.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:03:02] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1334.eqiad.wmnet 192.48.64.10.in-addr.arpa 2.9.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:03:03] !log ayounsi@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1334 [10:03:56] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1334 [10:03:56] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1334 [10:08:00] (03CR) 10Marostegui: [C:03+1] mariadb::ferm: Rewrite ferm::rule as firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1265374 (https://phabricator.wikimedia.org/T421705) (owner: 10Muehlenhoff) [10:15:57] !log ayounsi@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1334.eqiad.wmnet with reason: host reimage [10:17:27] (03CR) 10Arnaudb: [C:03+2] gerrit: prevent crawling patches/archive files [puppet] - 10https://gerrit.wikimedia.org/r/1264733 (owner: 10Hashar) [10:17:57] (03PS2) 10Muehlenhoff: Migrate swift-rsync to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1242430 [10:19:13] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1334.eqiad.wmnet with reason: host reimage [10:21:06] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1159.eqiad.wmnet with reason: Maintenance [10:21:14] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1159 (T419635)', diff saved to https://phabricator.wikimedia.org/P90036 and previous config saved to /var/cache/conftool/dbconfig/20260331-102112-fceratto.json [10:21:19] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [10:21:46] (03PS13) 10Vgutierrez: sre.loadbalancer: Provide check-ipip cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1251442 (https://phabricator.wikimedia.org/T419873) [10:24:41] (03CR) 10Muehlenhoff: [C:03+2] Migrate swift-rsync to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1242430 (owner: 10Muehlenhoff) [10:26:22] (03PS1) 10Atsuko: admin/data: added atsuko to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1265383 (https://phabricator.wikimedia.org/T421860) [10:26:24] 06SRE, 10SRE-Access-Requests: Requesting shell access for atsuko - https://phabricator.wikimedia.org/T421860#11771723 (10Reedy) [10:26:43] FIRING: [3x] NodeTextfileStale: Stale textfile for wcqs1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:27:19] (03CR) 10CI reject: [V:04-1] admin/data: added atsuko to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1265383 (https://phabricator.wikimedia.org/T421860) (owner: 10Atsuko) [10:27:19] (03CR) 10CI reject: [V:04-1] sre.loadbalancer: Provide check-ipip cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1251442 (https://phabricator.wikimedia.org/T419873) (owner: 10Vgutierrez) [10:27:51] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting shell access for atsuko - https://phabricator.wikimedia.org/T421860#11771737 (10atsuko) [10:28:14] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 31 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/OAuth] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1265367 (https://phabricator.wikimedia.org/T417833) (owner: 10D3r1ck01) [10:28:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 31 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/OAuth] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1265368 (https://phabricator.wikimedia.org/T417833) (owner: 10D3r1ck01) [10:28:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 31 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/OAuth] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1265369 (https://phabricator.wikimedia.org/T417833) (owner: 10D3r1ck01) [10:29:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 31 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260006 (https://phabricator.wikimedia.org/T417833) (owner: 10D3r1ck01) [10:30:14] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11771747 (10BTullis) 05Open→03Resolved I believe that this is all ready to go now. I'll resolve the ticket, but ple... [10:31:09] (03CR) 10Ladsgroup: [C:03+1] mailman: disable web posting [puppet] - 10https://gerrit.wikimedia.org/r/1248895 (https://phabricator.wikimedia.org/T386559) (owner: 10JHathaway) [10:31:24] (03CR) 10Vgutierrez: [C:04-1] hiera: upgrade haproxy to version 3.2 on drmrs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1262059 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [10:31:43] FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2023:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:31:43] FIRING: [18x] NodeTextfileStale: Stale textfile for wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:33:04] (03CR) 10Arnaudb: [C:03+2] mailman: disable web posting [puppet] - 10https://gerrit.wikimedia.org/r/1248895 (https://phabricator.wikimedia.org/T386559) (owner: 10JHathaway) [10:33:40] (03PS1) 10Ayounsi: k8s4_in / k8s6_in - add missing policy-result: accept [homer/public] - 10https://gerrit.wikimedia.org/r/1265386 [10:33:55] jouncebot: nowandnext [10:33:55] For the next 0 hour(s) and 26 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260331T1000) [10:33:55] In 1 hour(s) and 26 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260331T1200) [10:35:01] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: X-spam-score header missing on obvious spam delivered to multiple Mailman3 lists via HyperKitty web ui - https://phabricator.wikimedia.org/T386559#11771777 (10Ladsgroup) I asked Arnaud to merge the patch. If anyone has any obje... [10:35:35] (03CR) 10Cathal Mooney: [C:03+1] k8s4_in / k8s6_in - add missing policy-result: accept [homer/public] - 10https://gerrit.wikimedia.org/r/1265386 (owner: 10Ayounsi) [10:36:02] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1334.eqiad.wmnet with OS trixie [10:36:43] FIRING: [6x] NodeTextfileStale: Stale textfile for wdqs1025:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:37:43] FIRING: [3x] NodeTextfileStale: Stale textfile for wdqs1027:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:38:02] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1362.eqiad.wmnet with OS trixie [10:38:19] !log ayounsi@cumin1003 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1362 [10:38:27] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [10:38:31] (03PS2) 10JMeybohm: k8s4_in / k8s6_in - add missing policy-result: accept [homer/public] - 10https://gerrit.wikimedia.org/r/1265386 (https://phabricator.wikimedia.org/T417817) (owner: 10Ayounsi) [10:39:40] (03CR) 10JMeybohm: [C:03+1] "Thanks! ❤️" [homer/public] - 10https://gerrit.wikimedia.org/r/1265386 (https://phabricator.wikimedia.org/T417817) (owner: 10Ayounsi) [10:40:57] (03CR) 10Ayounsi: [C:03+2] k8s4_in / k8s6_in - add missing policy-result: accept [homer/public] - 10https://gerrit.wikimedia.org/r/1265386 (https://phabricator.wikimedia.org/T417817) (owner: 10Ayounsi) [10:42:21] (03Merged) 10jenkins-bot: k8s4_in / k8s6_in - add missing policy-result: accept [homer/public] - 10https://gerrit.wikimedia.org/r/1265386 (https://phabricator.wikimedia.org/T417817) (owner: 10Ayounsi) [10:42:55] (03CR) 10Btullis: admin/data: added atsuko to ops-limited (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1265383 (https://phabricator.wikimedia.org/T421860) (owner: 10Atsuko) [10:43:20] (03PS9) 10Tiziano Fogli: thanos/compact: add support for instance-based partitioning [puppet] - 10https://gerrit.wikimedia.org/r/1260650 (https://phabricator.wikimedia.org/T386911) [10:43:20] (03PS9) 10Tiziano Fogli: pontoon: override promethues_instances designated_compactor [puppet] - 10https://gerrit.wikimedia.org/r/1260651 (https://phabricator.wikimedia.org/T386911) [10:44:07] ayounsi@cumin1003 reimage (PID 222719) is awaiting input [10:44:12] (03PS10) 10Tiziano Fogli: thanos/compact: add support for instance-based partitioning [puppet] - 10https://gerrit.wikimedia.org/r/1260650 (https://phabricator.wikimedia.org/T386911) [10:44:12] (03PS10) 10Tiziano Fogli: pontoon: override promethues_instances designated_compactor [puppet] - 10https://gerrit.wikimedia.org/r/1260651 (https://phabricator.wikimedia.org/T386911) [10:44:31] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [10:44:36] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [10:47:02] (03PS2) 10Atsuko: admin/data: added atsuko to DPE SRE groups [puppet] - 10https://gerrit.wikimedia.org/r/1265383 (https://phabricator.wikimedia.org/T421860) [10:47:12] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting shell access for atsuko - https://phabricator.wikimedia.org/T421860#11771812 (10BTullis) [10:47:26] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863 (10MoritzMuehlenhoff) 03NEW [10:47:34] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11771826 (10MoritzMuehlenhoff) p:05Triage→03Medium [10:47:57] (03PS11) 10Tiziano Fogli: thanos/compact: add support for instance-based partitioning [puppet] - 10https://gerrit.wikimedia.org/r/1260650 (https://phabricator.wikimedia.org/T386911) [10:47:57] (03PS11) 10Tiziano Fogli: pontoon: override promethues_instances designated_compactor [puppet] - 10https://gerrit.wikimedia.org/r/1260651 (https://phabricator.wikimedia.org/T386911) [10:48:16] !log rename table global_block_whitelist on s3 and s5 for closed.dblist wikis T420525 [10:48:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:21] T420525: Drop global_block_whitelist from closed wikis - https://phabricator.wikimedia.org/T420525 [10:48:24] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1362 - ayounsi@cumin1003" [10:48:30] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1362 - ayounsi@cumin1003" [10:48:30] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:48:30] !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache wikikube-worker1362.eqiad.wmnet 134.32.64.10.in-addr.arpa 4.3.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:48:34] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1362.eqiad.wmnet 134.32.64.10.in-addr.arpa 4.3.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:48:34] !log ayounsi@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1362 [10:48:43] (03CR) 10Atsuko: "fixed, added the groups as per conversation with the team" [puppet] - 10https://gerrit.wikimedia.org/r/1265383 (https://phabricator.wikimedia.org/T421860) (owner: 10Atsuko) [10:49:25] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1362 [10:49:25] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1362 [10:50:56] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting shell access for atsuko - https://phabricator.wikimedia.org/T421860#11771833 (10atsuko) [10:52:36] (03PS4) 10Fabfur: hiera: upgrade haproxy to version 3.2 on drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1262059 (https://phabricator.wikimedia.org/T421402) [10:52:45] (03CR) 10Fabfur: hiera: upgrade haproxy to version 3.2 on drmrs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1262059 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [10:53:18] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11771846 (10MoritzMuehlenhoff) [10:53:38] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1262059 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [10:56:43] !log tappof@cumin1003 START - Cookbook sre.hosts.reboot-single for host titan2001.codfw.wmnet [10:58:07] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1003.eqiad.wmnet with OS bookworm [10:58:09] (03CR) 10Vgutierrez: [C:03+1] hiera: upgrade haproxy to version 3.2 on drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1262059 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [10:58:35] !log btullis@cumin1003 START - Cookbook sre.hosts.move-vlan for host dse-k8s-worker1003 [11:01:23] !log ayounsi@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1362.eqiad.wmnet with reason: host reimage [11:01:38] btullis@cumin1003 reimage (PID 243724) is awaiting input [11:01:40] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [11:01:50] (03CR) 10Volans: [C:03+2] CHANGELOG: add changelogs for release v6.0.0 [software/cumin] - 10https://gerrit.wikimedia.org/r/1265379 (owner: 10Volans) [11:03:16] (03PS1) 10Muehlenhoff: proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265388 [11:03:19] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting shell access for atsuko - https://phabricator.wikimedia.org/T421860#11771917 (10atsuko) [11:03:26] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting shell access for atsuko - https://phabricator.wikimedia.org/T421860#11771918 (10BTullis) [11:03:32] (03CR) 10CI reject: [V:04-1] proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265388 (owner: 10Muehlenhoff) [11:04:13] FIRING: [2x] JobUnavailable: Reduced availability for job thanos-rule in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:04:18] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting shell access for atsuko - https://phabricator.wikimedia.org/T421860#11771920 (10BTullis) [11:04:44] !log tappof@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host titan2001.codfw.wmnet [11:04:47] (03PS3) 10Atsuko: admin/data: added atsuko to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1265383 (https://phabricator.wikimedia.org/T421860) [11:06:22] 06SRE: IP Block/Throttling relief request: urbipedia.org - Bot attack mitigated - https://phabricator.wikimedia.org/T421650#11771936 (10Alberto) Hi @hnowlan, Regarding the errors: my Plesk logs only show 404 errors when MediaWiki tries to fetch the file metadata from Commons. I don't see explicit 429 errors bec... [11:07:18] btullis@cumin1003 reimage (PID 243724) is awaiting input [11:08:01] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1362.eqiad.wmnet with reason: host reimage [11:08:55] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting shell access for atsuko - https://phabricator.wikimedia.org/T421860#11771960 (10BTullis) I approve membership of `analytics-admin` We require: * approval from either @mark or @Kappakayala for membership of `ops` * approval from @thcipriani for mem... [11:09:13] RESOLVED: [2x] JobUnavailable: Reduced availability for job thanos-rule in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:10:39] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host dse-k8s-worker1003 - btullis@cumin1003" [11:10:45] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host dse-k8s-worker1003 - btullis@cumin1003" [11:10:45] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:10:45] !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache dse-k8s-worker1003.eqiad.wmnet 178.32.64.10.in-addr.arpa 8.7.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:10:48] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dse-k8s-worker1003.eqiad.wmnet 178.32.64.10.in-addr.arpa 8.7.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:10:49] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1003 [11:12:41] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [11:12:47] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [11:13:14] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [11:13:21] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [11:16:17] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1003 [11:16:17] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host dse-k8s-worker1003 [11:16:52] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v6.0.0 [software/cumin] - 10https://gerrit.wikimedia.org/r/1265379 (owner: 10Volans) [11:22:53] !log installing gnupg2 security updates [11:22:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:46] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1362.eqiad.wmnet with OS trixie [11:26:43] (03CR) 10Brouberol: [C:03+1] admin/data: added atsuko to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1265383 (https://phabricator.wikimedia.org/T421860) (owner: 10Atsuko) [11:30:59] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1265383 (https://phabricator.wikimedia.org/T421860) (owner: 10Atsuko) [11:32:31] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1003.eqiad.wmnet with reason: host reimage [11:33:27] (03CR) 10Gmodena: [C:03+1] "This look really neat." [puppet] - 10https://gerrit.wikimedia.org/r/1262510 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [11:33:45] 06SRE, 06ServiceOps new, 07Datacenter-Switchover: Increased rate of badtoken errors / session store issues due to datacenter switchover? - https://phabricator.wikimedia.org/T421168#11772062 (10LucasWerkmeister) I don’t know how to answer that question beyond the IRC log link that’s already in the task descri... [11:33:59] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2148.codfw.wmnet with reason: Maintenance [11:34:08] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2148 (T419635)', diff saved to https://phabricator.wikimedia.org/P90042 and previous config saved to /var/cache/conftool/dbconfig/20260331-113407-fceratto.json [11:34:14] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [11:37:10] !log upgrade Envoy on IDM to 1.35.9 T419637 [11:37:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:16] T419637: Upgrade Envoy to v1.35.9 - https://phabricator.wikimedia.org/T419637 [11:38:59] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1003.eqiad.wmnet with reason: host reimage [11:39:32] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [11:40:25] 10ops-eqiad, 06SRE, 06DC-Ops, 07Essential-Work: hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11772073 (10brouberol) @rkemper I can't seem to be able to run puppet on this host: ` Error: Could not retrieve catalog from remote server: Error 500 o... [11:42:08] 10ops-eqiad, 06SRE, 06DC-Ops, 07Essential-Work: hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11772078 (10brouberol) Seems like `/dev/sdk` is having some issues: ` brouberol@an-worker1148:~$ sudo dmesg | grep sdk [ 9.359370] sd 0:2:11:0: [sdk... [11:42:12] jouncebot: nowandnext [11:42:12] No deployments scheduled for the next 0 hour(s) and 17 minute(s) [11:42:12] In 0 hour(s) and 17 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260331T1200) [11:42:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264581 (owner: 10Ladsgroup) [11:43:49] (03Merged) 10jenkins-bot: Switch from InterwikiSortingPrepend to the ULS config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264581 (owner: 10Ladsgroup) [11:44:31] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1264581|Switch from InterwikiSortingPrepend to the ULS config]] [11:48:59] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1264581|Switch from InterwikiSortingPrepend to the ULS config]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:49:30] 10ops-eqiad, 06SRE, 06DC-Ops, 07Essential-Work: hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11772095 (10brouberol) All disks are reported healthy by SMART: ` brouberol@an-worker1148:~$ sudo smart-data-dump --debug 2>&1 | grep healthy ... # HEL... [11:50:11] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [11:50:17] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T419635)', diff saved to https://phabricator.wikimedia.org/P90044 and previous config saved to /var/cache/conftool/dbconfig/20260331-115016-fceratto.json [11:50:22] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [11:57:40] 10ops-eqiad, 06SRE, 06DC-Ops, 07Essential-Work: hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11772154 (10brouberol) {F74676065} {F74676138} Seems like all disks are healthy, but one of them isn't online. [11:57:50] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1264581|Switch from InterwikiSortingPrepend to the ULS config]] (duration: 13m 19s) [11:58:03] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1003.eqiad.wmnet with OS bookworm [11:59:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254955 (https://phabricator.wikimedia.org/T413031) (owner: 10Ladsgroup) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260331T1200) [12:00:25] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P90046 and previous config saved to /var/cache/conftool/dbconfig/20260331-120024-fceratto.json [12:00:39] 10ops-eqiad, 06SRE, 06DC-Ops, 07Essential-Work: hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11772181 (10brouberol) Oh and something I overlooked in https://phabricator.wikimedia.org/T411919#11772073: we're back to having the device names and t... [12:01:09] (03Merged) 10jenkins-bot: Remove VP8 from transcoding [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254955 (https://phabricator.wikimedia.org/T413031) (owner: 10Ladsgroup) [12:01:31] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1254955|Remove VP8 from transcoding (T413031)]] [12:01:37] T413031: Reduce TimedMediaHandler VP9 transcode resolution steps - https://phabricator.wikimedia.org/T413031 [12:02:12] 10ops-eqiad, 06SRE, 06DC-Ops, 07Essential-Work: hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11772195 (10brouberol) The fstab seems to be correct though. ` brouberol@an-worker1148:~$ cat /etc/fstab | grep LABEL=hadoop | grep -v '#' LABEL=hadoo... [12:03:13] (03PS1) 10Volans: Upstream release v6.0.0 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/1265407 [12:03:29] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1254955|Remove VP8 from transcoding (T413031)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:07:53] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [12:10:33] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P90048 and previous config saved to /var/cache/conftool/dbconfig/20260331-121032-fceratto.json [12:12:02] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1254955|Remove VP8 from transcoding (T413031)]] (duration: 10m 31s) [12:12:09] T413031: Reduce TimedMediaHandler VP9 transcode resolution steps - https://phabricator.wikimedia.org/T413031 [12:16:30] (03CR) 10Tiziano Fogli: [C:03+2] titan/memcached: double memcached size [puppet] - 10https://gerrit.wikimedia.org/r/1256395 (https://phabricator.wikimedia.org/T417336) (owner: 10Tiziano Fogli) [12:17:33] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1023.eqiad.wmnet with reason: Downgrade to 10.11.13 [12:19:37] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [12:20:41] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T419635)', diff saved to https://phabricator.wikimedia.org/P90049 and previous config saved to /var/cache/conftool/dbconfig/20260331-122041-fceratto.json [12:20:48] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [12:20:54] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:20:59] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2175.codfw.wmnet with reason: Maintenance [12:21:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2175 (T419635)', diff saved to https://phabricator.wikimedia.org/P90050 and previous config saved to /var/cache/conftool/dbconfig/20260331-122106-fceratto.json [12:27:06] (03CR) 10Volans: [C:03+2] Upstream release v6.0.0 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/1265407 (owner: 10Volans) [12:34:50] !log upgrading drmrs to haproxy 3.2 (T421402) [12:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:56] T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402 [12:36:27] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting shell access for atsuko - https://phabricator.wikimedia.org/T421860#11772348 (10Gehel) Approved from my side. [12:36:54] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T419635)', diff saved to https://phabricator.wikimedia.org/P90054 and previous config saved to /var/cache/conftool/dbconfig/20260331-123653-fceratto.json [12:36:59] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [12:37:18] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting shell access for atsuko - https://phabricator.wikimedia.org/T421860#11772351 (10atsuko) [12:38:39] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting shell access for atsuko - https://phabricator.wikimedia.org/T421860#11772370 (10atsuko) a:03Kappakayala [12:40:31] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting shell access and membership of the ops group for atsuko - https://phabricator.wikimedia.org/T421860#11772393 (10BTullis) [12:40:35] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting shell access and membership of the ops group for atsuko - https://phabricator.wikimedia.org/T421860#11772395 (10taavi) `deployment` is already included with `ops` membership, it doesn't need to be requested or added separately. [12:41:37] (03CR) 10Daniel Kinzler: [C:03+1] Set a JWT cookie for OAuth 1 and OAuth 2 owner-only requests [extensions/OAuth] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1265367 (https://phabricator.wikimedia.org/T417833) (owner: 10D3r1ck01) [12:41:56] (03CR) 10Daniel Kinzler: [C:03+1] tests: OAuth1 and OAuth2 owner-only JWT support [extensions/OAuth] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1265368 (https://phabricator.wikimedia.org/T417833) (owner: 10D3r1ck01) [12:42:02] (03CR) 10Daniel Kinzler: [C:03+1] tests: Add test for asserting JWT cookie not set for OAuth2 consumers [extensions/OAuth] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1265369 (https://phabricator.wikimedia.org/T417833) (owner: 10D3r1ck01) [12:43:01] (03CR) 10Daniel Kinzler: [C:03+1] Enable JWTs for OAuth1 consumers and OAuth2 owner-only consumers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260006 (https://phabricator.wikimedia.org/T417833) (owner: 10D3r1ck01) [12:43:40] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting shell access and membership of the ops group for atsuko - https://phabricator.wikimedia.org/T421860#11772427 (10atsuko) [12:43:54] (03Merged) 10jenkins-bot: Upstream release v6.0.0 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/1265407 (owner: 10Volans) [12:44:03] (03PS6) 10Fabfur: hiera: upgrade haproxy to version 3.2 on magru [puppet] - 10https://gerrit.wikimedia.org/r/1262060 (https://phabricator.wikimedia.org/T421402) [12:45:47] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 6 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11772430 (10Ladsgroup) And poster of videos have been broken now. I fixed them in https://gerrit.wikimedia.org/r/c/mediawiki/extensio... [12:46:39] (03CR) 10Elukey: "I don't no, please use the one that you think works best :)" [alerts] - 10https://gerrit.wikimedia.org/r/1262175 (https://phabricator.wikimedia.org/T418858) (owner: 10Herron) [12:47:02] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P90055 and previous config saved to /var/cache/conftool/dbconfig/20260331-124701-fceratto.json [12:49:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:51:15] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_drmrs - 3.2 upgrade (T421402) [12:51:19] (03CR) 10Elukey: "I checked the Burrow's docs and there is the possibility of running multiple instances coordinated by zookeeper, that may be a valid alter" [puppet] - 10https://gerrit.wikimedia.org/r/1262176 (https://phabricator.wikimedia.org/T418858) (owner: 10Herron) [12:51:21] T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402 [12:54:07] PROBLEM - Ensure acme-chief-api is running on acmechief2002 is CRITICAL: PROCS CRITICAL: 3 processes with args /usr/bin/uwsgi --die-on-term --ini /etc/uwsgi/apps-enabled/acme-chief.ini https://wikitech.wikimedia.org/wiki/Acme-chief [12:54:22] uh? [12:55:07] RECOVERY - Ensure acme-chief-api is running on acmechief2002 is OK: PROCS OK: 1 process with args /usr/bin/uwsgi --die-on-term --ini /etc/uwsgi/apps-enabled/acme-chief.ini https://wikitech.wikimedia.org/wiki/Acme-chief [12:55:41] ` Active: active (running) since Thu 2026-03-19 15:33:30 UTC; 1 week 4 days ago` [12:57:10] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P90058 and previous config saved to /var/cache/conftool/dbconfig/20260331-125709-fceratto.json [12:59:20] 10ops-eqiad, 06SRE, 06DC-Ops, 07Essential-Work: hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11772552 (10brouberol) I'm going to follow https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Hadoop/Administration#Swapping_broken_disk to conf... [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: Time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260331T1300). [13:00:05] Raine and xSavitar: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:21] o/ [13:00:31] I can self-serve when Raine is done. [13:00:32] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Excempt researcher from hyperkitty monthly export - https://phabricator.wikimedia.org/T385271#11772556 (10A_smart_kitten) >>! In T385271#10530855, @Ladsgroup wrote: > BTW this exists now, we should deploy it? https://gitlab.com/mailman/hyperkitt... [13:00:59] Oh, sorry, I forgot to unschedule it [13:01:07] I'm not going [13:01:17] Raine, okay! So should I go ahead? :) [13:01:26] Yes, go for it [13:01:31] Ack! Thanks! [13:01:37] I need to first fix an issue that popped up [13:01:52] Ack! [13:02:36] 06SRE, 07ci-test-error, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 07Kubernetes: Unusual CI failure for aux-k8s when changing dse-k8s cert-manager values - https://phabricator.wikimedia.org/T421362#11772560 (10Gehel) [13:03:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by derick@deploy1003 using scap backport" [extensions/OAuth] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1265367 (https://phabricator.wikimedia.org/T417833) (owner: 10D3r1ck01) [13:03:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by derick@deploy1003 using scap backport" [extensions/OAuth] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1265368 (https://phabricator.wikimedia.org/T417833) (owner: 10D3r1ck01) [13:03:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by derick@deploy1003 using scap backport" [extensions/OAuth] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1265369 (https://phabricator.wikimedia.org/T417833) (owner: 10D3r1ck01) [13:03:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by derick@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260006 (https://phabricator.wikimedia.org/T417833) (owner: 10D3r1ck01) [13:04:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:04:36] (03Merged) 10jenkins-bot: Set a JWT cookie for OAuth 1 and OAuth 2 owner-only requests [extensions/OAuth] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1265367 (https://phabricator.wikimedia.org/T417833) (owner: 10D3r1ck01) [13:04:38] (03Merged) 10jenkins-bot: tests: OAuth1 and OAuth2 owner-only JWT support [extensions/OAuth] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1265368 (https://phabricator.wikimedia.org/T417833) (owner: 10D3r1ck01) [13:04:41] (03Merged) 10jenkins-bot: tests: Add test for asserting JWT cookie not set for OAuth2 consumers [extensions/OAuth] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1265369 (https://phabricator.wikimedia.org/T417833) (owner: 10D3r1ck01) [13:06:01] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 6 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11772577 (10Ladsgroup) ugh, I mistook normaliseParams with getSteppedThumbWidth [13:07:18] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T419635)', diff saved to https://phabricator.wikimedia.org/P90063 and previous config saved to /var/cache/conftool/dbconfig/20260331-130717-fceratto.json [13:07:23] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2189.codfw.wmnet with reason: Maintenance [13:07:23] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [13:07:31] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2189 (T419635)', diff saved to https://phabricator.wikimedia.org/P90064 and previous config saved to /var/cache/conftool/dbconfig/20260331-130731-fceratto.json [13:08:27] (03CR) 10D3r1ck01: [C:03+2] Enable JWTs for OAuth1 consumers and OAuth2 owner-only consumers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260006 (https://phabricator.wikimedia.org/T417833) (owner: 10D3r1ck01) [13:09:33] (03Merged) 10jenkins-bot: Enable JWTs for OAuth1 consumers and OAuth2 owner-only consumers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260006 (https://phabricator.wikimedia.org/T417833) (owner: 10D3r1ck01) [13:09:55] (03CR) 10D3r1ck01: "Not sure why it got stuck (after the patches it depended on all got merged), kicked it by hitting +2." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260006 (https://phabricator.wikimedia.org/T417833) (owner: 10D3r1ck01) [13:10:02] !log derick@deploy1003 Started scap sync-world: Backport for [[gerrit:1265367|Set a JWT cookie for OAuth 1 and OAuth 2 owner-only requests (T417833)]], [[gerrit:1265368|tests: OAuth1 and OAuth2 owner-only JWT support (T417833 T415281)]], [[gerrit:1265369|tests: Add test for asserting JWT cookie not set for OAuth2 consumers (T417833 T415281)]], [[gerrit:1260006|Enable JWTs for OAuth1 consumers and OAuth2 owner-only consume [13:10:02] rs (T417833)]] [13:10:09] T417833: Set a JWT cookie for OAuth 1 requests and OAuth 2 owner-only requests - https://phabricator.wikimedia.org/T417833 [13:10:10] T415281: [EPIC] OAuth extension critical workflows (for automated tests enhancement) - https://phabricator.wikimedia.org/T415281 [13:10:22] !log 💙cdanis@cumin1003.eqiad.wmnet ~ 🕘☕ sudo cumin A:cp-eqiad 'apt install lua5.4-ciderbloom-dbgsym' [13:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:18] (03CR) 10Elukey: ml-serve: add modified kserve 0.17 chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [13:13:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:16:09] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting shell access and membership of the ops group for atsuko - https://phabricator.wikimedia.org/T421860#11772635 (10Kappakayala) approving ops membership for @atsuko [13:17:37] PROBLEM - MariaDB read only s3 on clouddb1023 is CRITICAL: Could not connect to localhost:3313 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [13:17:39] PROBLEM - MariaDB read only wikireplica-s3 on clouddb1023 is CRITICAL: Could not connect to localhost:3313 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [13:17:55] PROBLEM - MegaRAID on an-worker1148 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:17:58] ACKNOWLEDGEMENT - MegaRAID on an-worker1148 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T421892 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:17:58] PROBLEM - MariaDB Replica IO: s3 on clouddb1023 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:17:58] PROBLEM - MariaDB Replica SQL: s3 on clouddb1023 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:17:58] PROBLEM - MariaDB Replica Lag: s3 on clouddb1023 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:18:00] PROBLEM - mysqld processes on clouddb1023 is CRITICAL: PROCS CRITICAL: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [13:18:03] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1148 - https://phabricator.wikimedia.org/T421892 (10ops-monitoring-bot) 03NEW [13:21:22] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T419635)', diff saved to https://phabricator.wikimedia.org/P90065 and previous config saved to /var/cache/conftool/dbconfig/20260331-132121-fceratto.json [13:21:27] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [13:22:09] 10ops-eqiad, 06SRE, 06DC-Ops, 07Essential-Work: hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11772677 (10brouberol) Puppet failed with ` Error: Failed to set group to '903': Read-only file system @ apply2files - /var/lib/hadoop/data/d Error: /... [13:23:05] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1363.eqiad.wmnet with OS trixie [13:23:33] !log ayounsi@cumin1003 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1363 [13:24:22] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:25:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:26:35] ayounsi@cumin1003 reimage (PID 395777) is awaiting input [13:28:37] !log derick@deploy1003 derick, d3r1ck01: Backport for [[gerrit:1265367|Set a JWT cookie for OAuth 1 and OAuth 2 owner-only requests (T417833)]], [[gerrit:1265368|tests: OAuth1 and OAuth2 owner-only JWT support (T417833 T415281)]], [[gerrit:1265369|tests: Add test for asserting JWT cookie not set for OAuth2 consumers (T417833 T415281)]], [[gerrit:1260006|Enable JWTs for OAuth1 consumers and OAuth2 owner-only consumers (T41 [13:28:37] 7833)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:28:45] T417833: Set a JWT cookie for OAuth 1 requests and OAuth 2 owner-only requests - https://phabricator.wikimedia.org/T417833 [13:28:46] T415281: [EPIC] OAuth extension critical workflows (for automated tests enhancement) - https://phabricator.wikimedia.org/T415281 [13:28:58] 10ops-eqiad, 06SRE, 06DC-Ops, 07Essential-Work: hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11772721 (10brouberol) I'm seeing ` Fault detected on drive 1 in disk drive bay 1. Tue Mar 31 2026 12:56:39 ` in the IDRAC UI, which maps to ~1min aft... [13:30:24] 10ops-eqiad, 06SRE, 06DC-Ops, 07Essential-Work: hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11772737 (10BTullis) Agreed. I'm happy to decom this server. As per the original description, this drive bay keeps connecting and dropping. It costs us... [13:31:12] (03CR) 10Btullis: [C:03+2] Deploy Videoplay Endpoint to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1264680 (https://phabricator.wikimedia.org/T415202) (owner: 10Snwachukwu) [13:31:14] !log derick@deploy1003 derick, d3r1ck01: Continuing with sync [13:31:30] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P90066 and previous config saved to /var/cache/conftool/dbconfig/20260331-133129-fceratto.json [13:32:53] (03PS2) 10Elukey: opensearch-semantic-search-test: Add to services proxy [puppet] - 10https://gerrit.wikimedia.org/r/1264739 (https://phabricator.wikimedia.org/T421293) (owner: 10Bking) [13:32:53] (03PS1) 10Elukey: profile::service_proxy::envoy: remove mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1265420 (https://phabricator.wikimedia.org/T420468) [13:33:17] (03Merged) 10jenkins-bot: Deploy Videoplay Endpoint to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1264680 (https://phabricator.wikimedia.org/T415202) (owner: 10Snwachukwu) [13:34:49] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting shell access and membership of the ops group for atsuko - https://phabricator.wikimedia.org/T421860#11772749 (10atsuko) p:05Triage→03Medium a:05Kappakayala→03brouberol [13:38:16] jouncebot: nowandnext [13:38:16] For the next 0 hour(s) and 21 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260331T1300) [13:38:16] In 0 hour(s) and 21 minute(s): Test Kitchen UI Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260331T1400) [13:38:57] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [13:39:18] xSavitar: Hi, let me know once you're done [13:39:29] Amir1, sure, I'll remind you. [13:39:35] Thanks <3 [13:39:44] (03PS1) 10Brouberol: hadoop/analytics: exclude an-worker1148.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1265421 (https://phabricator.wikimedia.org/T411919) [13:40:23] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting shell access and membership of the ops group for atsuko - https://phabricator.wikimedia.org/T421860#11772807 (10atsuko) a:05brouberol→03None [13:40:25] (03CR) 10Brouberol: [C:03+2] admin/data: added atsuko to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1265383 (https://phabricator.wikimedia.org/T421860) (owner: 10Atsuko) [13:41:36] !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-build1001.eqiad.wmnet [13:41:38] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P90067 and previous config saved to /var/cache/conftool/dbconfig/20260331-134137-fceratto.json [13:42:12] !log push pfw policies - T421895 [13:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:43] !log derick@deploy1003 Finished scap sync-world: Backport for [[gerrit:1265367|Set a JWT cookie for OAuth 1 and OAuth 2 owner-only requests (T417833)]], [[gerrit:1265368|tests: OAuth1 and OAuth2 owner-only JWT support (T417833 T415281)]], [[gerrit:1265369|tests: Add test for asserting JWT cookie not set for OAuth2 consumers (T417833 T415281)]], [[gerrit:1260006|Enable JWTs for OAuth1 consumers and OAuth2 owner-only consum [13:43:43] ers (T417833)]] (duration: 33m 41s) [13:43:45] (03CR) 10Elukey: opensearch-semantic-search-test: Add to services proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1264739 (https://phabricator.wikimedia.org/T421293) (owner: 10Bking) [13:43:51] T417833: Set a JWT cookie for OAuth 1 requests and OAuth 2 owner-only requests - https://phabricator.wikimedia.org/T417833 [13:43:51] T415281: [EPIC] OAuth extension critical workflows (for automated tests enhancement) - https://phabricator.wikimedia.org/T415281 [13:44:15] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_drmrs - 3.2 upgrade (T421402) [13:44:21] T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402 [13:44:39] ayounsi@cumin1003 reimage (PID 395777) is awaiting input [13:44:48] thanks [13:45:40] Amir1, I'm done. Over to you 🙏🏽 [13:45:50] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1363 - ayounsi@cumin1003" [13:45:56] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1363 - ayounsi@cumin1003" [13:45:56] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:45:56] !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache wikikube-worker1363.eqiad.wmnet 135.32.64.10.in-addr.arpa 5.3.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:46:00] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1363.eqiad.wmnet 135.32.64.10.in-addr.arpa 5.3.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:46:01] !log ayounsi@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1363 [13:46:01] Thanks! [13:46:18] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1363 [13:46:18] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1363 [13:46:37] (03PS1) 10Ladsgroup: maintenance: Introduce reconcileTables [core] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1265423 (https://phabricator.wikimedia.org/T410145) [13:46:47] (03CR) 10Ladsgroup: [C:03+2] maintenance: Introduce reconcileTables [core] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1265423 (https://phabricator.wikimedia.org/T410145) (owner: 10Ladsgroup) [13:47:21] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-build1001.eqiad.wmnet [13:48:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1265423 (https://phabricator.wikimedia.org/T410145) (owner: 10Ladsgroup) [13:48:48] (03CR) 10Btullis: [C:03+1] hadoop/analytics: exclude an-worker1148.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1265421 (https://phabricator.wikimedia.org/T411919) (owner: 10Brouberol) [13:49:38] RECOVERY - MariaDB read only s3 on clouddb1023 is OK: Version 10.11.13-MariaDB, Uptime 30s, read_only: True, event_scheduler: False, 10709.04 QPS, connection latency: 0.026265s, query latency: 0.000601s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [13:49:42] RECOVERY - MariaDB read only wikireplica-s3 on clouddb1023 is OK: Version 10.11.13-MariaDB, Uptime 33s, read_only: True, event_scheduler: False, 13184.78 QPS, connection latency: 0.025419s, query latency: 0.000523s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [13:49:54] RECOVERY - MariaDB Replica SQL: s3 on clouddb1023 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:49:54] RECOVERY - MariaDB Replica IO: s3 on clouddb1023 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:50:00] RECOVERY - mysqld processes on clouddb1023 is OK: PROCS OK: 2 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [13:50:04] (03PS12) 10Tiziano Fogli: thanos/compact: add support for instance-based partitioning [puppet] - 10https://gerrit.wikimedia.org/r/1260650 (https://phabricator.wikimedia.org/T386911) [13:50:04] (03PS12) 10Tiziano Fogli: pontoon: override promethues_instances designated_compactor [puppet] - 10https://gerrit.wikimedia.org/r/1260651 (https://phabricator.wikimedia.org/T386911) [13:51:46] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T419635)', diff saved to https://phabricator.wikimedia.org/P90068 and previous config saved to /var/cache/conftool/dbconfig/20260331-135145-fceratto.json [13:51:49] (03CR) 10BPirkle: [C:03+1] REST: Publish ReadingLists v0 module in REST Sandbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264856 (https://phabricator.wikimedia.org/T419619) (owner: 10KineticPelagic) [13:51:51] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [13:51:54] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_drmrs - 3.2 upgrade (T421402) [13:51:55] RECOVERY - MariaDB Replica Lag: s3 on clouddb1023 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:51:59] T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402 [13:52:02] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2197.codfw.wmnet with reason: Maintenance [13:52:03] (03PS13) 10Tiziano Fogli: thanos/compact: add support for instance-based partitioning [puppet] - 10https://gerrit.wikimedia.org/r/1260650 (https://phabricator.wikimedia.org/T386911) [13:52:03] (03PS13) 10Tiziano Fogli: pontoon: override promethues_instances designated_compactor [puppet] - 10https://gerrit.wikimedia.org/r/1260651 (https://phabricator.wikimedia.org/T386911) [13:52:28] (03PS1) 10Filippo Giunchedi: openstack: add lock_path for trove-guestagent [puppet] - 10https://gerrit.wikimedia.org/r/1265425 (https://phabricator.wikimedia.org/T421857) [13:52:38] (03PS1) 10Elukey: Upgrade aux-k8s-codfw to k8s 1.31 [puppet] - 10https://gerrit.wikimedia.org/r/1265426 (https://phabricator.wikimedia.org/T414486) [13:52:50] (03CR) 10Brouberol: [C:03+2] hadoop/analytics: exclude an-worker1148.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1265421 (https://phabricator.wikimedia.org/T411919) (owner: 10Brouberol) [13:53:03] (03PS14) 10Tiziano Fogli: thanos/compact: add support for instance-based partitioning [puppet] - 10https://gerrit.wikimedia.org/r/1260650 (https://phabricator.wikimedia.org/T386911) [13:53:03] (03PS14) 10Tiziano Fogli: pontoon: override promethues_instances designated_compactor [puppet] - 10https://gerrit.wikimedia.org/r/1260651 (https://phabricator.wikimedia.org/T386911) [13:53:50] (03PS15) 10Tiziano Fogli: thanos/compact: add support for instance-based partitioning [puppet] - 10https://gerrit.wikimedia.org/r/1260650 (https://phabricator.wikimedia.org/T386911) [13:53:50] (03PS15) 10Tiziano Fogli: pontoon: override promethues_instances designated_compactor [puppet] - 10https://gerrit.wikimedia.org/r/1260651 (https://phabricator.wikimedia.org/T386911) [13:54:52] (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1260650 (https://phabricator.wikimedia.org/T386911) (owner: 10Tiziano Fogli) [13:55:42] (03PS1) 10Elukey: admin_ng: upgrade aux-k8s-codfw to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265427 (https://phabricator.wikimedia.org/T414486) [13:55:47] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1004.eqiad.wmnet with OS bookworm [13:56:02] (03PS7) 10Fabfur: hiera: upgrade haproxy to version 3.2 on magru [puppet] - 10https://gerrit.wikimedia.org/r/1262060 (https://phabricator.wikimedia.org/T421402) [13:56:15] !log btullis@cumin1003 START - Cookbook sre.hosts.move-vlan for host dse-k8s-worker1004 [13:58:08] !log ayounsi@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1363.eqiad.wmnet with reason: host reimage [13:59:18] btullis@cumin1003 reimage (PID 424809) is awaiting input [13:59:19] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [14:00:04] (03Merged) 10jenkins-bot: maintenance: Introduce reconcileTables [core] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1265423 (https://phabricator.wikimedia.org/T410145) (owner: 10Ladsgroup) [14:00:05] Deploy window Test Kitchen UI Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260331T1400) [14:00:56] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1265423|maintenance: Introduce reconcileTables (T410145 T408137)]] [14:01:04] T410145: Build a way to duplicate a whole wiki from one section to another - https://phabricator.wikimedia.org/T410145 [14:01:04] T408137: Build a write duplicator system in mediawiki core - https://phabricator.wikimedia.org/T408137 [14:02:03] !log ebysans@deploy1003 helmfile [staging] START helmfile.d/services/media-analytics: apply [14:03:18] (03PS16) 10Tiziano Fogli: thanos/compact: add support for instance-based partitioning [puppet] - 10https://gerrit.wikimedia.org/r/1260650 (https://phabricator.wikimedia.org/T386911) [14:03:18] (03PS16) 10Tiziano Fogli: pontoon: override promethues_instances designated_compactor [puppet] - 10https://gerrit.wikimedia.org/r/1260651 (https://phabricator.wikimedia.org/T386911) [14:03:18] (03PS1) 10Tiziano Fogli: thanos/compact: assign prometheus instances to compactors [puppet] - 10https://gerrit.wikimedia.org/r/1265429 (https://phabricator.wikimedia.org/T386911) [14:03:39] (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1265429 (https://phabricator.wikimedia.org/T386911) (owner: 10Tiziano Fogli) [14:04:17] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1363.eqiad.wmnet with reason: host reimage [14:05:00] btullis@cumin1003 reimage (PID 424809) is awaiting input [14:05:27] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1265423|maintenance: Introduce reconcileTables (T410145 T408137)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:05:54] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2204.codfw.wmnet with reason: Maintenance [14:06:02] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2204 (T419635)', diff saved to https://phabricator.wikimedia.org/P90070 and previous config saved to /var/cache/conftool/dbconfig/20260331-140602-fceratto.json [14:06:07] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [14:06:08] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [14:07:28] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2204 (T419635)', diff saved to https://phabricator.wikimedia.org/P90071 and previous config saved to /var/cache/conftool/dbconfig/20260331-140727-fceratto.json [14:10:33] (03CR) 10Herron: "Oh interesting! Would that call for 2x kafkamon per-site and some kind of pool/depool state between them? Assuming that's the case, I'm " [puppet] - 10https://gerrit.wikimedia.org/r/1262176 (https://phabricator.wikimedia.org/T418858) (owner: 10Herron) [14:12:08] !log ebysans@deploy1003 helmfile [staging] DONE helmfile.d/services/media-analytics: apply [14:12:14] (03PS1) 10Btullis: Update pod security standards for dse-k8s namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265434 (https://phabricator.wikimedia.org/T419259) [14:12:31] (03CR) 10Tiziano Fogli: "The previous version did not account for ruler-generated blocks; this one does." [puppet] - 10https://gerrit.wikimedia.org/r/1260650 (https://phabricator.wikimedia.org/T386911) (owner: 10Tiziano Fogli) [14:12:58] (03CR) 10Brouberol: [C:03+2] envoy: remove mw-parsoid listener [puppet] - 10https://gerrit.wikimedia.org/r/1262054 (https://phabricator.wikimedia.org/T420468) (owner: 10Effie Mouzeli) [14:13:14] (03CR) 10CDanis: [C:03+1] envoy: remove mw-parsoid listener [puppet] - 10https://gerrit.wikimedia.org/r/1262054 (https://phabricator.wikimedia.org/T420468) (owner: 10Effie Mouzeli) [14:13:29] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1265423|maintenance: Introduce reconcileTables (T410145 T408137)]] (duration: 12m 33s) [14:13:37] T410145: Build a way to duplicate a whole wiki from one section to another - https://phabricator.wikimedia.org/T410145 [14:13:37] T408137: Build a write duplicator system in mediawiki core - https://phabricator.wikimedia.org/T408137 [14:16:10] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host dse-k8s-worker1004 - btullis@cumin1003" [14:16:16] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host dse-k8s-worker1004 - btullis@cumin1003" [14:16:16] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:16:16] !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache dse-k8s-worker1004.eqiad.wmnet 52.48.64.10.in-addr.arpa 2.5.0.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:16:20] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dse-k8s-worker1004.eqiad.wmnet 52.48.64.10.in-addr.arpa 2.5.0.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:16:20] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1004 [14:16:23] (03PS1) 10Clare Ming: ConfigsFetcher: Increasing the cache version [extensions/TestKitchen] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1265436 (https://phabricator.wikimedia.org/T421828) [14:17:36] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2204', diff saved to https://phabricator.wikimedia.org/P90072 and previous config saved to /var/cache/conftool/dbconfig/20260331-141735-fceratto.json [14:17:55] (03PS14) 10Vgutierrez: sre.loadbalancer: Provide check-ipip cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1251442 (https://phabricator.wikimedia.org/T419873) [14:18:21] (03CR) 10Jforrester: REST: Publish ReadingLists v0 module in REST Sandbox (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264856 (https://phabricator.wikimedia.org/T419619) (owner: 10KineticPelagic) [14:19:46] (03CR) 10Santiago Faci: [C:03+1] ConfigsFetcher: Increasing the cache version [extensions/TestKitchen] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1265436 (https://phabricator.wikimedia.org/T421828) (owner: 10Clare Ming) [14:20:03] am I ok to deploy a UBN fix to wmf.22? https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TestKitchen/+/1265436 [14:20:09] (03CR) 10Urbanecm: [C:03+1] "LGTM" [extensions/TestKitchen] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1265436 (https://phabricator.wikimedia.org/T421828) (owner: 10Clare Ming) [14:20:29] cjming: i see Amir1 was just deploying few minutes ago [14:20:32] (03CR) 10Eevans: [C:03+2] restbase,aqs: canary Cassandra 4.1.11 [puppet] - 10https://gerrit.wikimedia.org/r/1264741 (https://phabricator.wikimedia.org/T418417) (owner: 10Eevans) [14:20:38] otherwise, LGTM [14:20:44] I'm done! [14:20:51] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1363.eqiad.wmnet with OS trixie [14:20:59] cool - scap backporting UBN fix now then [14:21:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [extensions/TestKitchen] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1265436 (https://phabricator.wikimedia.org/T421828) (owner: 10Clare Ming) [14:22:37] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1364.eqiad.wmnet with OS trixie [14:22:55] !log ayounsi@cumin1003 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1364 [14:23:01] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [14:23:46] (03CR) 10Vgutierrez: sre.loadbalancer: Provide check-ipip cookbook (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1251442 (https://phabricator.wikimedia.org/T419873) (owner: 10Vgutierrez) [14:25:27] (03CR) 10Brouberol: [C:03+1] Update pod security standards for dse-k8s namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265434 (https://phabricator.wikimedia.org/T419259) (owner: 10Btullis) [14:26:05] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting shell access and membership of the ops group for atsuko - https://phabricator.wikimedia.org/T421860#11773149 (10brouberol) [14:26:44] FIRING: [3x] NodeTextfileStale: Stale textfile for wcqs1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:27:44] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2204', diff saved to https://phabricator.wikimedia.org/P90074 and previous config saved to /var/cache/conftool/dbconfig/20260331-142743-fceratto.json [14:27:55] (03Merged) 10jenkins-bot: ConfigsFetcher: Increasing the cache version [extensions/TestKitchen] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1265436 (https://phabricator.wikimedia.org/T421828) (owner: 10Clare Ming) [14:28:19] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1265436|ConfigsFetcher: Increasing the cache version (T421828)]] [14:28:24] T421828: PHP Warning: Undefined array key "user_identifier_type" - https://phabricator.wikimedia.org/T421828 [14:28:42] ayounsi@cumin1003 reimage (PID 455107) is awaiting input [14:29:11] (03CR) 10Giuseppe Lavagetto: [C:03+1] Enable $wgTrackMediaRequestProvenance on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260029 (https://phabricator.wikimedia.org/T414338) (owner: 10Krinkle) [14:29:50] (03CR) 10JMeybohm: [C:03+1] "Thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1251442 (https://phabricator.wikimedia.org/T419873) (owner: 10Vgutierrez) [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260331T1430) [14:30:15] !log cjming@deploy1003 cjming: Backport for [[gerrit:1265436|ConfigsFetcher: Increasing the cache version (T421828)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:30:57] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1364 - ayounsi@cumin1003" [14:31:03] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1364 - ayounsi@cumin1003" [14:31:03] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:31:03] !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache wikikube-worker1364.eqiad.wmnet 147.32.64.10.in-addr.arpa 7.4.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:31:07] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1364.eqiad.wmnet 147.32.64.10.in-addr.arpa 7.4.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:31:08] !log ayounsi@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1364 [14:31:26] (03CR) 10JavierMonton: [V:03+1 C:03+1] stream: mediawiki.page_edit_type_simple.dev1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261695 (https://phabricator.wikimedia.org/T421005) (owner: 10AKhatun) [14:31:30] testing [14:31:43] FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2023:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:31:48] FIRING: [18x] NodeTextfileStale: Stale textfile for wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:32:21] !log cjming@deploy1003 cjming: Continuing with sync [14:33:11] (03CR) 10Brouberol: ml-serve: add modified kserve 0.17 chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [14:33:23] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1004 [14:33:23] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host dse-k8s-worker1004 [14:33:51] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1364 [14:33:51] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1364 [14:36:31] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1265436|ConfigsFetcher: Increasing the cache version (T421828)]] (duration: 08m 12s) [14:36:37] T421828: PHP Warning: Undefined array key "user_identifier_type" - https://phabricator.wikimedia.org/T421828 [14:36:43] FIRING: [6x] NodeTextfileStale: Stale textfile for wdqs1025:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:36:58] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:37:17] (03CR) 10JMeybohm: [C:04-1] Update pod security standards for dse-k8s namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265434 (https://phabricator.wikimedia.org/T419259) (owner: 10Btullis) [14:37:44] FIRING: [3x] NodeTextfileStale: Stale textfile for wdqs1027:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:37:52] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2204 (T419635)', diff saved to https://phabricator.wikimedia.org/P90075 and previous config saved to /var/cache/conftool/dbconfig/20260331-143751-fceratto.json [14:37:58] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [14:38:08] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2225.codfw.wmnet with reason: Maintenance [14:38:16] (03CR) 10Brouberol: "> I'm a bit inclined to put the config/platforming effort towards burrow on k8s" [puppet] - 10https://gerrit.wikimedia.org/r/1262176 (https://phabricator.wikimedia.org/T418858) (owner: 10Herron) [14:38:17] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2225 (T419635)', diff saved to https://phabricator.wikimedia.org/P90076 and previous config saved to /var/cache/conftool/dbconfig/20260331-143816-fceratto.json [14:38:21] !log ebysans@deploy1003 helmfile [staging] START helmfile.d/services/media-analytics: apply [14:39:59] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 104139224 and 9 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:40:01] (03PS2) 10Btullis: Update pod security standards for dse-k8s namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265434 (https://phabricator.wikimedia.org/T419259) [14:40:53] (03CR) 10Btullis: Update pod security standards for dse-k8s namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265434 (https://phabricator.wikimedia.org/T419259) (owner: 10Btullis) [14:41:03] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_drmrs - 3.2 upgrade (T421402) [14:41:09] T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402 [14:41:59] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3698352 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:42:39] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:44:00] cjming: ty for the blocker backport [14:44:17] I'm going to deploy the train in the next few minutes if there are not objections [14:44:52] jnuche: thanks for your patience - relieved it stopped the bleeding [14:45:44] (03PS1) 10Effie Mouzeli: restbase: remove mw-parsoid hardcoded listener [puppet] - 10https://gerrit.wikimedia.org/r/1265441 [14:46:04] !log ayounsi@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1364.eqiad.wmnet with reason: host reimage [14:46:14] (03CR) 10CI reject: [V:04-1] restbase: remove mw-parsoid hardcoded listener [puppet] - 10https://gerrit.wikimedia.org/r/1265441 (owner: 10Effie Mouzeli) [14:46:54] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:47:33] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1262060 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [14:48:26] !log ebysans@deploy1003 helmfile [staging] DONE helmfile.d/services/media-analytics: apply [14:48:43] (03CR) 10CI reject: [V:04-1] Update pod security standards for dse-k8s namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265434 (https://phabricator.wikimedia.org/T419259) (owner: 10Btullis) [14:48:44] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11773297 (10ayounsi) For the RIPE atlas we will need to decom it and provision a new one on the future sandbox vlan as IPs will change. The good news is that the standard IP ranges we u... [14:48:53] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1004.eqiad.wmnet with reason: host reimage [14:49:30] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11773316 (10ayounsi) [14:49:40] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1364.eqiad.wmnet with reason: host reimage [14:50:33] 10SRE-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11773317 (10BTullis) [14:50:41] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:51:33] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:51:48] (03PS3) 10Btullis: Update pod security standards for dse-k8s namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265434 (https://phabricator.wikimedia.org/T419259) [14:51:57] !log creating links tables on x1 for testcommonswiki (T421914) [14:52:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:03] T421914: Test links virtual domain split on testcommonswiki - https://phabricator.wikimedia.org/T421914 [14:52:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2225 (T419635)', diff saved to https://phabricator.wikimedia.org/P90078 and previous config saved to /var/cache/conftool/dbconfig/20260331-145233-fceratto.json [14:52:40] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [14:53:09] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:53:15] (03PS2) 10Effie Mouzeli: restbase: remove mw-parsoid hardcoded listener [puppet] - 10https://gerrit.wikimedia.org/r/1265441 [14:53:15] ok, train is going ahead [14:53:26] (03PS1) 10TrainBranchBot: group0 to 1.46.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265442 (https://phabricator.wikimedia.org/T420480) [14:53:26] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1004.eqiad.wmnet with reason: host reimage [14:53:29] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by jnuche@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265442 (https://phabricator.wikimedia.org/T420480) (owner: 10TrainBranchBot) [14:53:45] (03CR) 10CI reject: [V:04-1] restbase: remove mw-parsoid hardcoded listener [puppet] - 10https://gerrit.wikimedia.org/r/1265441 (owner: 10Effie Mouzeli) [14:54:35] (03Merged) 10jenkins-bot: group0 to 1.46.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265442 (https://phabricator.wikimedia.org/T420480) (owner: 10TrainBranchBot) [14:59:24] (03CR) 10CI reject: [V:04-1] Update pod security standards for dse-k8s namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265434 (https://phabricator.wikimedia.org/T419259) (owner: 10Btullis) [15:00:05] jelto, arnoldokoth, mutante, and arnaudb: It is that lovely time of the day again! You are hereby commanded to deploy SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260331T1500). [15:00:36] 10ops-eqiad, 06SRE, 06DC-Ops: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11773389 (10RobH) Please note 'S480845X3505676' is NOT a valid serial under Supermicro support, but S480845X4915849 is. I would suggest not changing anything... [15:00:43] !log jnuche@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.46.0-wmf.22 refs T420480 [15:00:48] T420480: 1.46.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T420480 [15:01:00] (03PS1) 10Arnaudb: gerrit: update timeouts for gitiles [puppet] - 10https://gerrit.wikimedia.org/r/1265448 (https://phabricator.wikimedia.org/T421904) [15:02:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2225', diff saved to https://phabricator.wikimedia.org/P90079 and previous config saved to /var/cache/conftool/dbconfig/20260331-150241-fceratto.json [15:05:23] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1364.eqiad.wmnet with OS trixie [15:06:38] (03PS1) 10Eevans: restbase: unbreak puppet; address missing mw-parsoid listener [puppet] - 10https://gerrit.wikimedia.org/r/1265450 (https://phabricator.wikimedia.org/T420468) [15:08:06] (03PS1) 10Ladsgroup: Enable links db split on testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265451 (https://phabricator.wikimedia.org/T421914) [15:08:39] (03CR) 10CI reject: [V:04-1] restbase: unbreak puppet; address missing mw-parsoid listener [puppet] - 10https://gerrit.wikimedia.org/r/1265450 (https://phabricator.wikimedia.org/T420468) (owner: 10Eevans) [15:11:09] (03PS2) 10Ladsgroup: Enable links db split on testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265451 (https://phabricator.wikimedia.org/T421914) [15:11:14] (03CR) 10Ladsgroup: [C:03+2] Enable links db split on testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265451 (https://phabricator.wikimedia.org/T421914) (owner: 10Ladsgroup) [15:11:15] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1004.eqiad.wmnet with OS bookworm [15:11:39] (03PS1) 10Brouberol: fixtures: remove mw-parsoid listener from fixtures after it's been dropped [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265452 (https://phabricator.wikimedia.org/T420468) [15:11:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265451 (https://phabricator.wikimedia.org/T421914) (owner: 10Ladsgroup) [15:12:15] (03PS2) 10Eevans: restbase: address missing mw-parsoid listener (unbreak puppet) [puppet] - 10https://gerrit.wikimedia.org/r/1265450 (https://phabricator.wikimedia.org/T420468) [15:12:21] (03Merged) 10jenkins-bot: Enable links db split on testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265451 (https://phabricator.wikimedia.org/T421914) (owner: 10Ladsgroup) [15:12:47] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1265451|Enable links db split on testcommonswiki (T421914)]] [15:12:51] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2225', diff saved to https://phabricator.wikimedia.org/P90080 and previous config saved to /var/cache/conftool/dbconfig/20260331-151250-fceratto.json [15:12:53] T421914: Test links virtual domain split on testcommonswiki - https://phabricator.wikimedia.org/T421914 [15:14:16] (03CR) 10CI reject: [V:04-1] restbase: address missing mw-parsoid listener (unbreak puppet) [puppet] - 10https://gerrit.wikimedia.org/r/1265450 (https://phabricator.wikimedia.org/T420468) (owner: 10Eevans) [15:14:23] (03CR) 10CDanis: [C:03+1] fixtures: remove mw-parsoid listener from fixtures after it's been dropped [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265452 (https://phabricator.wikimedia.org/T420468) (owner: 10Brouberol) [15:14:35] (03PS1) 10Ayounsi: eqsin routed ganeti: initial setup [puppet] - 10https://gerrit.wikimedia.org/r/1265453 (https://phabricator.wikimedia.org/T421863) [15:14:38] (03CR) 10Brouberol: [C:03+2] fixtures: remove mw-parsoid listener from fixtures after it's been dropped [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265452 (https://phabricator.wikimedia.org/T420468) (owner: 10Brouberol) [15:14:46] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1265451|Enable links db split on testcommonswiki (T421914)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:14:47] (03PS3) 10Eevans: restbase: address missing mw-parsoid listener (unbreak puppet) [puppet] - 10https://gerrit.wikimedia.org/r/1265450 (https://phabricator.wikimedia.org/T420468) [15:14:51] (03PS4) 10Btullis: Update pod security standards for dse-k8s namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265434 (https://phabricator.wikimedia.org/T419259) [15:17:25] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1265450 (https://phabricator.wikimedia.org/T420468) (owner: 10Eevans) [15:17:50] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [15:22:02] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1265451|Enable links db split on testcommonswiki (T421914)]] (duration: 09m 15s) [15:22:08] T421914: Test links virtual domain split on testcommonswiki - https://phabricator.wikimedia.org/T421914 [15:22:59] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2225 (T419635)', diff saved to https://phabricator.wikimedia.org/P90082 and previous config saved to /var/cache/conftool/dbconfig/20260331-152258-fceratto.json [15:23:04] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [15:23:15] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2226.codfw.wmnet with reason: Maintenance [15:23:23] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2226 (T419635)', diff saved to https://phabricator.wikimedia.org/P90083 and previous config saved to /var/cache/conftool/dbconfig/20260331-152323-fceratto.json [15:24:22] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:24:40] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1365.eqiad.wmnet with OS trixie [15:24:49] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2226 (T419635)', diff saved to https://phabricator.wikimedia.org/P90084 and previous config saved to /var/cache/conftool/dbconfig/20260331-152448-fceratto.json [15:25:10] !log ayounsi@cumin1003 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1365 [15:25:15] jouncebot: nowandnext [15:25:15] For the next 0 hour(s) and 34 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260331T1500) [15:25:15] In 0 hour(s) and 34 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260331T1600) [15:25:15] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [15:25:16] (03PS4) 10Eevans: restbase: address missing mw-parsoid listener (unbreak puppet) [puppet] - 10https://gerrit.wikimedia.org/r/1265450 (https://phabricator.wikimedia.org/T420468) [15:27:31] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1265450 (https://phabricator.wikimedia.org/T420468) (owner: 10Eevans) [15:30:39] (03PS1) 10Ayounsi: eqsin: add routed ganeti ranges [homer/public] - 10https://gerrit.wikimedia.org/r/1265456 (https://phabricator.wikimedia.org/T421863) [15:30:58] ayounsi@cumin1003 reimage (PID 521928) is awaiting input [15:33:30] (03CR) 10JMeybohm: [C:03+1] Update pod security standards for dse-k8s namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265434 (https://phabricator.wikimedia.org/T419259) (owner: 10Btullis) [15:34:49] (03PS8) 10CDanis: Add fundraising-data-uploader role user [puppet] - 10https://gerrit.wikimedia.org/r/1255036 (https://phabricator.wikimedia.org/T416948) [15:34:57] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2226', diff saved to https://phabricator.wikimedia.org/P90086 and previous config saved to /var/cache/conftool/dbconfig/20260331-153456-fceratto.json [15:34:57] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255036 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis) [15:36:58] (03CR) 10Muehlenhoff: [C:03+1] "This looks like it will work, but we should doublecheck in Pontoon" [puppet] - 10https://gerrit.wikimedia.org/r/1265382 (https://phabricator.wikimedia.org/T420993) (owner: 10Elukey) [15:38:56] (03CR) 10Tiziano Fogli: "You’re removing the test file for the alerts in puppet-ca.yaml but leaving the file with the rules definition." [alerts] - 10https://gerrit.wikimedia.org/r/1265191 (https://phabricator.wikimedia.org/T421517) (owner: 10Muehlenhoff) [15:39:32] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [15:41:55] (03PS5) 10Eevans: restbase: address missing mw-parsoid listener (unbreak puppet) [puppet] - 10https://gerrit.wikimedia.org/r/1265450 (https://phabricator.wikimedia.org/T420468) [15:42:36] (03PS9) 10CDanis: deployment_server: fundraising-data-uploader role [puppet] - 10https://gerrit.wikimedia.org/r/1255036 (https://phabricator.wikimedia.org/T416948) [15:42:43] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255036 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis) [15:42:43] (03CR) 10CI reject: [V:04-1] restbase: address missing mw-parsoid listener (unbreak puppet) [puppet] - 10https://gerrit.wikimedia.org/r/1265450 (https://phabricator.wikimedia.org/T420468) (owner: 10Eevans) [15:43:05] (03PS1) 10Muehlenhoff: Record LDAP access for wmf-ldlulisa [puppet] - 10https://gerrit.wikimedia.org/r/1265461 [15:45:04] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2226', diff saved to https://phabricator.wikimedia.org/P90087 and previous config saved to /var/cache/conftool/dbconfig/20260331-154504-fceratto.json [15:45:09] (03PS6) 10Eevans: restbase: address missing mw-parsoid listener (unbreak puppet) [puppet] - 10https://gerrit.wikimedia.org/r/1265450 (https://phabricator.wikimedia.org/T420468) [15:45:09] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1365 - ayounsi@cumin1003" [15:45:40] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for wmf-ldlulisa [puppet] - 10https://gerrit.wikimedia.org/r/1265461 (owner: 10Muehlenhoff) [15:46:21] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1265450 (https://phabricator.wikimedia.org/T420468) (owner: 10Eevans) [15:48:14] ayounsi@cumin1003 reimage (PID 521928) is awaiting input [15:48:23] (03CR) 10Tiziano Fogli: "Do we have an idea of how many additional series this change will allow Prometheus to ingest? Are there any histograms that could be tuned" [puppet] - 10https://gerrit.wikimedia.org/r/1261485 (https://phabricator.wikimedia.org/T421343) (owner: 10JMeybohm) [15:48:53] (03CR) 10BPirkle: [C:03+1] REST: Publish ReadingLists v0 module in REST Sandbox (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264856 (https://phabricator.wikimedia.org/T419619) (owner: 10KineticPelagic) [15:48:58] (03CR) 10Btullis: [C:03+2] Update pod security standards for dse-k8s namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265434 (https://phabricator.wikimedia.org/T419259) (owner: 10Btullis) [15:51:04] (03PS10) 10CDanis: deployment_server: fundraising-data-uploader role [puppet] - 10https://gerrit.wikimedia.org/r/1255036 (https://phabricator.wikimedia.org/T421751) [15:52:33] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/1255036 (https://phabricator.wikimedia.org/T421751) (owner: 10CDanis) [15:52:42] (03PS11) 10CDanis: deployment_server: fundraising-data-uploader role [puppet] - 10https://gerrit.wikimedia.org/r/1255036 (https://phabricator.wikimedia.org/T421751) [15:53:52] (03PS7) 10Eevans: restbase: address missing mw-parsoid listener (unbreak puppet) [puppet] - 10https://gerrit.wikimedia.org/r/1265450 (https://phabricator.wikimedia.org/T420468) [15:54:15] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1265450 (https://phabricator.wikimedia.org/T420468) (owner: 10Eevans) [15:54:19] (03CR) 10Greg Grossmeier: [C:03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1265224 (https://phabricator.wikimedia.org/T421703) (owner: 10Muehlenhoff) [15:54:34] (03CR) 10CDanis: [C:03+2] deployment_server: fundraising-data-uploader role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1255036 (https://phabricator.wikimedia.org/T421751) (owner: 10CDanis) [15:54:38] (03PS1) 10Hashar: Revert "gerrit: forward Gitiles traffic to gerrit-replica" [puppet] - 10https://gerrit.wikimedia.org/r/1265465 (https://phabricator.wikimedia.org/T420595) [15:55:08] (03CR) 10Muehlenhoff: [C:03+2] Bitu: Add approval config for airflow-fr-tech-ops [puppet] - 10https://gerrit.wikimedia.org/r/1265224 (https://phabricator.wikimedia.org/T421703) (owner: 10Muehlenhoff) [15:55:13] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2226 (T419635)', diff saved to https://phabricator.wikimedia.org/P90088 and previous config saved to /var/cache/conftool/dbconfig/20260331-155512-fceratto.json [15:55:24] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [15:55:30] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2238.codfw.wmnet with reason: Maintenance [15:55:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2238 (T419635)', diff saved to https://phabricator.wikimedia.org/P90089 and previous config saved to /var/cache/conftool/dbconfig/20260331-155538-fceratto.json [15:55:42] (03PS1) 10Elukey: elasticsearch: fix test for non-utc timezones [software/spicerack] - 10https://gerrit.wikimedia.org/r/1265466 [15:56:58] (03Merged) 10jenkins-bot: Update pod security standards for dse-k8s namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265434 (https://phabricator.wikimedia.org/T419259) (owner: 10Btullis) [15:58:05] (03CR) 10Eevans: [C:03+2] restbase: address missing mw-parsoid listener (unbreak puppet) [puppet] - 10https://gerrit.wikimedia.org/r/1265450 (https://phabricator.wikimedia.org/T420468) (owner: 10Eevans) [15:58:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by javiermonton@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261695 (https://phabricator.wikimedia.org/T421005) (owner: 10AKhatun) [15:59:22] (03Merged) 10jenkins-bot: stream: mediawiki.page_edit_type_simple.dev1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261695 (https://phabricator.wikimedia.org/T421005) (owner: 10AKhatun) [15:59:37] (03PS2) 10Hashar: Revert "gerrit: forward Gitiles traffic to gerrit-replica" [puppet] - 10https://gerrit.wikimedia.org/r/1265465 (https://phabricator.wikimedia.org/T420595) [15:59:43] (03CR) 10Tiziano Fogli: "I'm not sure if a persistent failure could be a critical scenario, but in that case, it might make sense to add a second alert with a long" [alerts] - 10https://gerrit.wikimedia.org/r/1225507 (owner: 10Gehel) [15:59:43] !log javiermonton@deploy1003 Started scap sync-world: Backport for [[gerrit:1261695|stream: mediawiki.page_edit_type_simple.dev1 (T421005)]] [15:59:49] T421005: Update edit-type flink job with new schema - https://phabricator.wikimedia.org/T421005 [16:00:05] jhathaway and rzl: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260331T1600). [16:00:05] sfaci: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:35] sfaci: o/ looking [16:00:51] oh good, elukey already told you what I was about to :) [16:00:56] (03PS1) 10Isabelle Hurbain-Palatin: Enable legacy post-processing cache for DiscussionTools [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265467 (https://phabricator.wikimedia.org/T376183) [16:00:59] (03PS1) 10Isabelle Hurbain-Palatin: Actually enable parsoid postproc for all wikis (except enwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265468 [16:02:04] !log javiermonton@deploy1003 akhatun, javiermonton: Backport for [[gerrit:1261695|stream: mediawiki.page_edit_type_simple.dev1 (T421005)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:02:57] sfaci: my recommendation is to actually not make this change -- I'm pretty sure that as-is, it would lose track of your SLO history (since you'd have a brand-new recording rule under the new name) and I don't think that's what you want [16:03:51] sfaci: normally my advice would be that we can work with observability to make this work without losing any data, *but* since we're about to move everybody to sloth and decommission the pyrra dashboards, how do you feel about just leaving it? [16:04:09] !log javiermonton@deploy1003 akhatun, javiermonton: Continuing with sync [16:04:52] !log ebysans@deploy1003 helmfile [staging] START helmfile.d/services/media-analytics: apply [16:06:29] RESOLVED: [2x] NodeTextfileStale: Stale textfile for wdqs2023:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:06:31] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [16:06:33] FIRING: [6x] NodeTextfileStale: Stale textfile for wdqs1025:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:06:38] FIRING: [18x] NodeTextfileStale: Stale textfile for wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:07:09] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [16:07:28] RESOLVED: [3x] NodeTextfileStale: Stale textfile for wdqs1027:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:08:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 31 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1264921 (https://phabricator.wikimedia.org/T421366) (owner: 10Catrope) [16:08:19] rzl: Sorry Ruben! We are dealing right now with an unbreaknow thing I forgot this. I didn't know what you mention. Can I discuss it with the team and let you know later? [16:08:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 31 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1264922 (https://phabricator.wikimedia.org/T420007) (owner: 10Catrope) [16:08:24] !log javiermonton@deploy1003 Finished scap sync-world: Backport for [[gerrit:1261695|stream: mediawiki.page_edit_type_simple.dev1 (T421005)]] (duration: 08m 40s) [16:08:30] T421005: Update edit-type flink job with new schema - https://phabricator.wikimedia.org/T421005 [16:08:55] sfaci: sure :) I'll comment that on the patch instead [16:08:59] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 888580128 and 64 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:09:02] puppet window's complete, in that case [16:09:06] That's cool. Thank you very much Ruben [16:09:13] FIRING: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:09:40] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2238 (T419635)', diff saved to https://phabricator.wikimedia.org/P90090 and previous config saved to /var/cache/conftool/dbconfig/20260331-160939-fceratto.json [16:09:46] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [16:10:20] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1365 - ayounsi@cumin1003" [16:10:20] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:10:20] !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache wikikube-worker1365.eqiad.wmnet 206.48.64.10.in-addr.arpa 6.0.2.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:10:24] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1365.eqiad.wmnet 206.48.64.10.in-addr.arpa 6.0.2.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:10:25] !log ayounsi@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1365 [16:10:41] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1365 [16:10:41] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1365 [16:10:51] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [16:10:59] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 2660664 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:11:13] (03PS2) 10AKhatun: stream: mediawiki.page_edit_type_simple.dev1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261706 (https://phabricator.wikimedia.org/T421005) [16:11:29] FIRING: [18x] NodeTextfileStale: Stale textfile for wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:11:29] FIRING: [6x] NodeTextfileStale: Stale textfile for wdqs1025:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:11:46] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [16:12:02] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/mw-experimental: apply [16:12:50] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply [16:13:15] (03CR) 10JavierMonton: [V:03+1 C:03+1] stream: mediawiki.page_edit_type_simple.dev1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261706 (https://phabricator.wikimedia.org/T421005) (owner: 10AKhatun) [16:14:34] (03CR) 10AKhatun: [C:03+2] stream: mediawiki.page_edit_type_simple.dev1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261706 (https://phabricator.wikimedia.org/T421005) (owner: 10AKhatun) [16:14:50] (03PS1) 10Elukey: First pass of ruff check --fix [software/spicerack] - 10https://gerrit.wikimedia.org/r/1265476 (https://phabricator.wikimedia.org/T420475) [16:14:58] !log ebysans@deploy1003 helmfile [staging] DONE helmfile.d/services/media-analytics: apply [16:15:00] !log eevans@cumin1003 START - Cookbook sre.misc-clusters.roll-restart-restbase rolling restart_daemons on A:restbase [16:16:37] (03Merged) 10jenkins-bot: stream: mediawiki.page_edit_type_simple.dev1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261706 (https://phabricator.wikimedia.org/T421005) (owner: 10AKhatun) [16:19:14] (03PS1) 10Btullis: Update the container used for the dumps toolbox [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265478 (https://phabricator.wikimedia.org/T398436) [16:19:48] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2238', diff saved to https://phabricator.wikimedia.org/P90091 and previous config saved to /var/cache/conftool/dbconfig/20260331-161947-fceratto.json [16:21:08] FIRING: [6x] ProbeDown: Service pki1002:443 has failed probes (http_PKI_aux_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#pki1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:21:49] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install fransw100[23] - https://phabricator.wikimedia.org/T417295#11774108 (10Jgreen) passwords have been reset to the frack mgmt password [16:22:06] (03CR) 10CI reject: [V:04-1] First pass of ruff check --fix [software/spicerack] - 10https://gerrit.wikimedia.org/r/1265476 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey) [16:22:07] !log akhatun@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-edit-type-enrich-next: apply [16:22:25] !log akhatun@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-edit-type-enrich-next: apply [16:22:39] !log ayounsi@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1365.eqiad.wmnet with reason: host reimage [16:24:05] rzl: Could you help us with something? We are dealing with an unbreaknow thing on test wiki and we think the issue is related somehow with the cache and a maintenance job that fills that cache whose version we have increased. It seems it's empty now. We were wondering whether we could run manually that maintenance job by passwing "--wiki testwiki" (instead of aawiki) to make testwiki fill its cache [16:24:23] (03PS1) 10Jasmine: Add Kubernetes POD IP reverse range delegations for wikikube-ctrl200[4-5] [dns] - 10https://gerrit.wikimedia.org/r/1265480 (https://phabricator.wikimedia.org/T390861) [16:25:56] the maintenance job is configured to run it with "--wiki aawiki" which is in group 2 and we were wondering that's why we have an empty cache right now. aawiki is running still a previous version of the code we are trying to deploy [16:26:04] (03PS1) 10LorenMora: Legal Footer Link Deploys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265482 (https://phabricator.wikimedia.org/T420348) [16:26:08] RESOLVED: [22x] ProbeDown: Service pki1002:443 has failed probes (http_PKI_aux_front_proxy_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#pki1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:27:07] sfaci: I'm just about to go into meetings for a bit, so I can't start digging with you right now unfortunately. https://wikitech.wikimedia.org/wiki/Maintenance_scripts has the instructions on how to start a maintenance script by hand, if that's what you're asking for? [16:27:26] Yes, that's the thing we want to do [16:27:32] (03PS4) 10Trueg: wdqs-queryhammer: Deployment fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1258956 (https://phabricator.wikimedia.org/T417415) [16:27:34] and also to understand well how the --wiki parameter works [16:28:19] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp1114.eqiad.wmnet [reason: trixie reimaging] [16:29:15] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1365.eqiad.wmnet with reason: host reimage [16:29:57] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2238', diff saved to https://phabricator.wikimedia.org/P90092 and previous config saved to /var/cache/conftool/dbconfig/20260331-162956-fceratto.json [16:31:40] !log eevans@cumin1003 END (PASS) - Cookbook sre.misc-clusters.roll-restart-restbase (exit_code=0) rolling restart_daemons on A:restbase [16:34:13] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:36:32] sfaci: did you get what you needed from https://wikitech.wikimedia.org/wiki/Maintenance_scripts? [16:37:19] we are wondering what's the effect of the --wiki parameter for a maintenance job. I think that's not explained there, right? [16:37:56] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-restart for nodes matching aqs1010.eqiad.wmnet: Upgrade Cassandra to 4.1.11 — T418417 - eevans@cumin1003 [16:38:02] T418417: Upgrade Cassandra clusters to 4.1.11 - https://phabricator.wikimedia.org/T418417 [16:39:42] sfaci: I might be misunderstanding the question. `--wiki` is a standard parameter of mediawiki maintenance scripts in general, rather than anything specific to the mwscript-k8s tool. could you point me to the job you're debugging? [16:39:51] (03CR) 10Vgutierrez: [C:03+2] sre.loadbalancer: Provide check-ipip cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1251442 (https://phabricator.wikimedia.org/T419873) (owner: 10Vgutierrez) [16:40:05] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2238 (T419635)', diff saved to https://phabricator.wikimedia.org/P90094 and previous config saved to /var/cache/conftool/dbconfig/20260331-164004-fceratto.json [16:40:12] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [16:42:53] (03CR) 10CDanis: [C:03+1] Revert "gerrit: forward Gitiles traffic to gerrit-replica" [puppet] - 10https://gerrit.wikimedia.org/r/1265465 (https://phabricator.wikimedia.org/T420595) (owner: 10Hashar) [16:43:47] (03CR) 10Ssingh: [C:03+2] Revert "gerrit: forward Gitiles traffic to gerrit-replica" [puppet] - 10https://gerrit.wikimedia.org/r/1265465 (https://phabricator.wikimedia.org/T420595) (owner: 10Hashar) [16:44:08] (03PS1) 10Muehlenhoff: bitu: Remove inactive approver [puppet] - 10https://gerrit.wikimedia.org/r/1265490 [16:45:09] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1365.eqiad.wmnet with OS trixie [16:45:41] !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs1010.eqiad.wmnet: Upgrade Cassandra to 4.1.11 — T418417 - eevans@cumin1003 [16:45:47] T418417: Upgrade Cassandra clusters to 4.1.11 - https://phabricator.wikimedia.org/T418417 [16:46:35] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install payments101[0-2] - https://phabricator.wikimedia.org/T416252#11774335 (10Jgreen) Password has been set to frack mgmt password for all three. [16:47:34] swfrench-wmf: that one: https://gerrit.wikimedia.org/g/operations/puppet/+/4ee785f4dc0b6955c13614039b8dff102cd224c1/modules/profile/manifests/mediawiki/maintenance/testkitchen.pp [16:48:15] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-restart for nodes matching aqs2001.codfw.wmnet: Upgrade Cassandra to 4.1.11 — T418417 - eevans@cumin1003 [16:51:44] !log installing Bind security updates (client-side tools and libs) [16:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:48] sfaci: alright, so if you need to run that script "as" testwiki (i.e., in order to have get the mediawiki release version that testwiki maps to), you should be able to use something like `mwscript-k8s -f -- extensions/TestKitchen/maintenance/UpdateConfigs.php --wiki=testwiki` (that assumes you want to follow the output produced by the job) [16:52:19] (the parenthetical there refers to the `-f`) [16:53:25] however, I see that your periodic job runs every minute with aawiki ... would that just overwrite the result? [16:53:46] the new code that is running on testwiki increased the cache version to 2. Other wikis in Group 2 are still running the old one. Do you know whether something can be broken by running the maintenance job on testwiki? Will v1 keys be running in the group 2 wikis after that? [16:54:23] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [16:54:29] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [16:55:57] !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs2001.codfw.wmnet: Upgrade Cassandra to 4.1.11 — T418417 - eevans@cumin1003 [16:56:02] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1110.eqiad.wmnet with OS trixie [16:56:05] T418417: Upgrade Cassandra clusters to 4.1.11 - https://phabricator.wikimedia.org/T418417 [16:58:05] !log joal@deploy1003 Started deploy [analytics/refinery@8d91f24] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@8d91f242] [16:58:23] sfaci: so, I'm afraid that's very much an implementation-dependent questions. if running the script using the code currently live in testwiki (.22) persists data that's not backward compatible (with .21), then presumably yes, that would cause problems you would need to solve some other way. e.g. backport compatibility fixes to .21 and then introduce the new config. [16:58:58] eevans@cumin1003 roll-restart (PID 610325) is awaiting input [16:59:17] in fact, if you backport this to .21, then you wouldn't need to a special one-off maintenance script run, as the aawiki run would write the "new" data [16:59:20] the .21 code will use the v2 keys from the cache? [16:59:48] sfaci: alas, I do not know and you would need to consult with the authors of the code [16:59:59] !log joal@deploy1003 Finished deploy [analytics/refinery@8d91f24] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@8d91f242] (duration: 01m 53s) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260331T1700) [17:00:23] !log joal@deploy1003 Started deploy [analytics/refinery@8d91f24]: Regular analytics weekly train [analytics/refinery@8d91f242] [17:00:26] we are the authors, just wondering how the maintenance job and the cache works [17:01:19] I mean, if there are v1 and v2 keys in a cache, any code will use the new ones (v2)? [17:01:28] (03PS3) 10LorenMora: Transition reading list experiment to instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259251 (https://phabricator.wikimedia.org/T421939) [17:02:26] ah, got it - maintenance scripts _are_ mediawiki, so there shouldn't really be a significant difference from how you would expect production mediawiki to behave in general [17:03:20] so, is this T421828? [17:03:21] T421828: PHP Warning: Undefined array key "user_identifier_type" - https://phabricator.wikimedia.org/T421828 [17:04:38] swfrench-wmf: yes, that's the issue [17:04:53] we need to "reload" the cache in the new code without breaking the old one [17:04:57] that's the thing we are trying to do [17:05:32] at least it's something we think can fix the thing [17:06:59] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [17:07:03] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [17:07:10] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [17:07:15] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [17:07:31] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [17:07:35] so, just to recap: .22 introduces new fields to the cached config, but the periodic job that writes the config runs group2 code (so still on .21). thus, testwiki (group0) is failing to read the new fields it expects. is that correct? [17:07:36] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [17:08:11] !log joal@deploy1003 Finished deploy [analytics/refinery@8d91f24]: Regular analytics weekly train [analytics/refinery@8d91f242] (duration: 07m 47s) [17:08:26] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1110.eqiad.wmnet with OS trixie [17:08:27] !log joal@deploy1003 Started deploy [analytics/refinery@8d91f24] (thin): Regular analytics weekly train THIN [analytics/refinery@8d91f242] [17:08:44] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1110.eqiad.wmnet with OS trixie [17:09:29] (03CR) 10Btullis: [C:03+2] Update the container used for the dumps toolbox [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265478 (https://phabricator.wikimedia.org/T398436) (owner: 10Btullis) [17:10:24] !log joal@deploy1003 Finished deploy [analytics/refinery@8d91f24] (thin): Regular analytics weekly train THIN [analytics/refinery@8d91f242] (duration: 01m 56s) [17:10:38] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1003.eqiad.wmnet [17:11:32] (03Merged) 10jenkins-bot: Update the container used for the dumps toolbox [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265478 (https://phabricator.wikimedia.org/T398436) (owner: 10Btullis) [17:12:24] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [17:12:52] what we think right now is that testwiki is fetching nothing because the cache for v2 is empty because the maintenance job is running with --wiki aawiki which is in group 2 [17:13:10] that's why we want to run the maintenance job for testwiki, to populate v2 cache there [17:13:28] and check if we can see experiment configs there [17:13:31] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [17:13:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:15:17] (03PS3) 10BCornwall: site.pp/conftool: Remove deprecated hcaptcha nodes [puppet] - 10https://gerrit.wikimedia.org/r/1264748 (https://phabricator.wikimedia.org/T411097) [17:15:17] (03PS3) 10BCornwall: Remove deprecated hcaptcha role [puppet] - 10https://gerrit.wikimedia.org/r/1264749 (https://phabricator.wikimedia.org/T411097) [17:16:54] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1003.eqiad.wmnet [17:18:30] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase[1031,2034]*: Upgrade Cassandra to 4.1.11 — T418417 - eevans@cumin1003 [17:18:35] sfaci: ah, okay - thanks for providing these details. so, [0] was cherrypicked into .22, so testwiki should indeed only be *reading* from the VERSION = 2. where does the cache key for *writes* by your maintenance script get created? [17:18:35] [0] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TestKitchen/+/1265436 [17:18:37] T418417: Upgrade Cassandra clusters to 4.1.11 - https://phabricator.wikimedia.org/T418417 [17:18:55] swfrench-wmf: yes [17:20:47] (03PS1) 10BCornwall: Remove proxoid dns records [dns] - 10https://gerrit.wikimedia.org/r/1265502 (https://phabricator.wikimedia.org/T411097) [17:21:07] ah, never mind - I now see ConfigsFetcher is also the *writer* [17:21:17] (03PS2) 10Muehlenhoff: proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265388 [17:21:20] was going to ask how you keep these key structures in sync :) [17:21:39] Yes, ConfigsFetcher is the piece that keep the cache sync [17:24:29] 06SRE, 10SRE-Access-Requests: Requesting access to superset dashboard for mpostoronca - https://phabricator.wikimedia.org/T421471#11774509 (10hnowlan) [17:25:00] 06SRE, 10SRE-Access-Requests: Requesting access to superset dashboard for mpostoronca - https://phabricator.wikimedia.org/T421471#11774511 (10hnowlan) Thanks for the request @MPostoronca-WMF - I can't see you on the contact list, could you please tag your manager on this ticket for approval of this request? [17:25:11] swfrench-wmf: according to that scenario, do you think we can run manually the maintenance job for testwiki to try to fix the thing? [17:25:20] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1110.eqiad.wmnet with reason: host reimage [17:29:25] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1110.eqiad.wmnet with reason: host reimage [17:30:14] (03CR) 10Ssingh: [C:03+1] Remove deprecated hcaptcha role [puppet] - 10https://gerrit.wikimedia.org/r/1264749 (https://phabricator.wikimedia.org/T411097) (owner: 10BCornwall) [17:30:56] sfaci: so, if your cherrypick of the version bump has been backported to .22, and indeed if testwiki is failing to read a config at all, then that's consistent with that being true, running the maintenance script for testwiki _should_ have it write the config to the VERSION = 2 global key. [17:32:06] (03PS1) 10BCornwall: service: Set proxoid to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1265505 (https://phabricator.wikimedia.org/T411097) [17:32:09] (03PS1) 10BCornwall: service: rm hcaptcha_proxy, set to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/1265506 (https://phabricator.wikimedia.org/T411097) [17:32:09] (03CR) 10Ssingh: [C:03+1] Remove proxoid dns records [dns] - 10https://gerrit.wikimedia.org/r/1265502 (https://phabricator.wikimedia.org/T411097) (owner: 10BCornwall) [17:32:49] (03PS1) 10AKhatun: stream: mediawiki.page_edit_type_simple.dev1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265507 (https://phabricator.wikimedia.org/T421005) [17:34:12] (03CR) 10BCornwall: [C:03+2] Remove proxoid dns records [dns] - 10https://gerrit.wikimedia.org/r/1265502 (https://phabricator.wikimedia.org/T411097) (owner: 10BCornwall) [17:34:21] swfrench-wmf: ok! Thanks! and what about the code that is running on wmf.21 with the v1 cache? will running the maintenance job in testwiki be a problem for those wikis? [17:34:55] !log brett@dns1006 START - running authdns-update [17:35:14] (03CR) 10AKhatun: [C:03+2] stream: mediawiki.page_edit_type_simple.dev1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265507 (https://phabricator.wikimedia.org/T421005) (owner: 10AKhatun) [17:35:54] !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase[1031,2034]*: Upgrade Cassandra to 4.1.11 — T418417 - eevans@cumin1003 [17:36:00] T418417: Upgrade Cassandra clusters to 4.1.11 - https://phabricator.wikimedia.org/T418417 [17:36:32] !log brett@dns1006 END - running authdns-update [17:36:49] swfrench-wmf: by the way, in the case we have to backport something pretty soon after running the job, are you ok with us deploying a thing to mediawiki-config? [17:36:58] (03Merged) 10jenkins-bot: stream: mediawiki.page_edit_type_simple.dev1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265507 (https://phabricator.wikimedia.org/T421005) (owner: 10AKhatun) [17:37:44] !log akhatun@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-edit-type-enrich-next: apply [17:37:50] !log akhatun@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-edit-type-enrich-next: apply [17:39:05] !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on hcaptcha[1001-1002,2001-2002].wikimedia.org with reason: Decommissioning [17:39:47] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1004.eqiad.wmnet [17:39:55] sfaci: so, the `--wiki` parameter passed to the maintenance script is what selects which code is used, the same was as that selection works for an HTTP request. if .22 has been updated to write to VERSION = 2, then running the script with --wiki=testwiki should only update specifically that key (not VERSION = 1). [17:40:10] *the same way [17:42:46] (this is for the same reason that running the maintenance script every minute with --wiki=aawiki *doesn't* update the VERSION = 2 key) [17:44:04] swfrench-wmf: Cool! Thanks for all this! [17:46:13] are we ok to deploy some config changes if needed in the next hour? [17:46:15] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1004.eqiad.wmnet [17:46:48] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase[1031,2034]*,aqs[1010,2001]*: Actually upgrade Cassandra to 4.1.11 — T418417 - eevans@cumin1003 [17:46:54] T418417: Upgrade Cassandra clusters to 4.1.11 - https://phabricator.wikimedia.org/T418417 [17:48:21] sfaci: this is an example showing that VERSION is indeed 2 in a maintanance script invoked in testwiki https://www.irccloud.com/pastebin/WMZbX1AY/ [17:48:44] no objections to config patches on my end (I don't plan to use the rest of the infra window) [17:49:45] (and running that same example in aawiki shows VERSION = 1, in contrast) [17:51:25] (03CR) 10BCornwall: [C:03+1] Add Kubernetes POD IP reverse range delegations for wikikube-ctrl200[4-5] [dns] - 10https://gerrit.wikimedia.org/r/1265480 (https://phabricator.wikimedia.org/T390861) (owner: 10Jasmine) [17:52:04] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1110.eqiad.wmnet with OS trixie [17:52:28] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp1110.* [17:53:27] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1112.eqiad.wmnet with OS trixie [17:56:39] (03CR) 10Jdlrobson: "Which is the right category?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1238787 (https://phabricator.wikimedia.org/T414852) (owner: 10Jdlrobson) [17:59:14] !log cmooney@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 216302 [18:00:38] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 216302 [18:02:59] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 514948568 and 46 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:04:59] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3660688 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:06:41] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [18:07:33] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1112.eqiad.wmnet with OS trixie [18:07:52] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1112.eqiad.wmnet with OS trixie [18:08:34] (03CR) 10Ssingh: [C:03+1] service: rm hcaptcha_proxy, set to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/1265506 (https://phabricator.wikimedia.org/T411097) (owner: 10BCornwall) [18:08:56] (03CR) 10Ssingh: [C:03+1] service: Set proxoid to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1265505 (https://phabricator.wikimedia.org/T411097) (owner: 10BCornwall) [18:10:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest2003.codfw.wmnet [18:11:37] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [18:12:40] (03CR) 10BCornwall: [C:03+2] service: Set proxoid to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1265505 (https://phabricator.wikimedia.org/T411097) (owner: 10BCornwall) [18:14:22] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:15:47] (03CR) 10Cathal Mooney: [C:03+1] eqsin: add routed ganeti ranges [homer/public] - 10https://gerrit.wikimedia.org/r/1265456 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi) [18:16:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest2003.codfw.wmnet [18:17:01] (03CR) 10Cathal Mooney: [C:03+1] eqsin routed ganeti: initial setup [puppet] - 10https://gerrit.wikimedia.org/r/1265453 (https://phabricator.wikimedia.org/T421863) (owner: 10Ayounsi) [18:19:42] !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase[1031,2034]*,aqs[1010,2001]*: Actually upgrade Cassandra to 4.1.11 — T418417 - eevans@cumin1003 [18:19:48] T418417: Upgrade Cassandra clusters to 4.1.11 - https://phabricator.wikimedia.org/T418417 [18:19:48] (03CR) 10BCornwall: [C:03+2] service: rm hcaptcha_proxy, set to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/1265506 (https://phabricator.wikimedia.org/T411097) (owner: 10BCornwall) [18:20:19] (03PS1) 10Snwachukwu: Media-analytics Image version change [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265523 (https://phabricator.wikimedia.org/T415202) [18:20:31] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:20:35] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:21:14] swfrench-wmf: Thank you very much for your help! We ran the maintenance job according to your instructions and things are being fixed. It worked! [18:21:35] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:21:41] LVS errors are expected, I'm removing the old proxoid services [18:22:00] er, *upcoming* errors are expected, not the previous wdqs ones :) [18:22:11] sfaci: awesome, that's great to hear! :) [18:22:31] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:23:08] brett WDQS errors are always expected ;P [18:23:26] ha, fair [18:24:12] !log brett@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-codfw and A:lvs (T411097) [18:24:21] T411097: Deprecate low-traffic proxoid service and O:hcaptcha_proxy for the older hcaptcha proxy setup - https://phabricator.wikimedia.org/T411097 [18:24:42] !log brett@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-secondary-codfw and A:lvs (T411097) [18:26:21] !log brett@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-low-traffic-codfw and A:lvs (T411097) [18:26:34] ryankemper just fixed up some auto-remediation for WDQS which helps a lot, but we may need to do some turnilo/requestctl if this keeps up [18:26:41] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: Power Supply - Status - issue on cloudbackup2003:9290 - https://phabricator.wikimedia.org/T420948#11774886 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [18:26:44] FIRING: [3x] NodeTextfileStale: Stale textfile for wcqs1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:26:52] !log brett@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-low-traffic-codfw and A:lvs (T411097) [18:28:02] !log brett@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-eqiad and A:lvs (T411097) [18:28:42] !log brett@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-secondary-eqiad and A:lvs (T411097) [18:29:53] !log brett@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-low-traffic-eqiad and A:lvs (T411097) [18:29:59] T411097: Deprecate low-traffic proxoid service and O:hcaptcha_proxy for the older hcaptcha proxy setup - https://phabricator.wikimedia.org/T411097 [18:30:40] !log brett@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-low-traffic-eqiad and A:lvs (T411097) [18:30:43] (03CR) 10Ottomata: [C:03+2] Media-analytics Image version change [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265523 (https://phabricator.wikimedia.org/T415202) (owner: 10Snwachukwu) [18:31:29] RESOLVED: [9x] NodeTextfileStale: Stale textfile for wdqs1012:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:31:29] RESOLVED: [3x] NodeTextfileStale: Stale textfile for wcqs1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:31:38] RESOLVED: [3x] NodeTextfileStale: Stale textfile for wdqs1025:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:32:41] (03Merged) 10jenkins-bot: Media-analytics Image version change [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265523 (https://phabricator.wikimedia.org/T415202) (owner: 10Snwachukwu) [18:33:35] !log sudo -i cumin 'A:lvs-secondary-codfw or A:lvs-low-traffic-codfw' 'ipvsadm --delete-service --tcp-service 10.2.1.12:4260' - T411097 [18:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:44] !log sudo -i cumin 'A:lvs-secondary-eqiad or A:lvs-low-traffic-eqiad' 'ipvsadm --delete-service --tcp-service 10.2.2.12:4260' - T411097 [18:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:37] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1112.eqiad.wmnet with OS trixie [18:36:05] (03PS4) 10Scott French: hieradata: disable and remove unused image-suggestion listener [puppet] - 10https://gerrit.wikimedia.org/r/1178657 (https://phabricator.wikimedia.org/T368096) [18:36:07] (03PS3) 10Scott French: service: move image-suggestion to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/1198575 (https://phabricator.wikimedia.org/T368096) [18:37:34] (03PS1) 10Clare Ming: Update the Test Kitchen maintenance script to target testwiki [puppet] - 10https://gerrit.wikimedia.org/r/1265525 (https://phabricator.wikimedia.org/T408186) [18:38:39] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [18:38:51] PROBLEM - PyBal IPVS diff check on lvs2013 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [18:38:51] PROBLEM - PyBal IPVS diff check on lvs2014 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [18:40:11] (03CR) 10Santiago Faci: [C:03+1] Update the Test Kitchen maintenance script to target testwiki [puppet] - 10https://gerrit.wikimedia.org/r/1265525 (https://phabricator.wikimedia.org/T408186) (owner: 10Clare Ming) [18:40:17] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1112.eqiad.wmnet with OS trixie [18:40:47] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [18:46:05] (03CR) 10BCornwall: [C:03+2] site.pp/conftool: Remove deprecated hcaptcha nodes [puppet] - 10https://gerrit.wikimedia.org/r/1264748 (https://phabricator.wikimedia.org/T411097) (owner: 10BCornwall) [18:46:09] (03CR) 10BCornwall: [C:03+2] Remove deprecated hcaptcha role [puppet] - 10https://gerrit.wikimedia.org/r/1264749 (https://phabricator.wikimedia.org/T411097) (owner: 10BCornwall) [18:49:53] (03PS4) 10BCornwall: site.pp/conftool: Remove deprecated hcaptcha nodes [puppet] - 10https://gerrit.wikimedia.org/r/1264748 (https://phabricator.wikimedia.org/T411097) [18:49:53] (03PS4) 10BCornwall: Remove deprecated hcaptcha role [puppet] - 10https://gerrit.wikimedia.org/r/1264749 (https://phabricator.wikimedia.org/T411097) [18:50:27] (03CR) 10Dr0ptp4kt: [C:03+1] Update the Test Kitchen maintenance script to target testwiki [puppet] - 10https://gerrit.wikimedia.org/r/1265525 (https://phabricator.wikimedia.org/T408186) (owner: 10Clare Ming) [18:51:57] PROBLEM - MariaDB Replica Lag: s3 on clouddb1022 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 659.75 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:52:46] (03CR) 10BCornwall: [C:03+2] site.pp/conftool: Remove deprecated hcaptcha nodes [puppet] - 10https://gerrit.wikimedia.org/r/1264748 (https://phabricator.wikimedia.org/T411097) (owner: 10BCornwall) [18:52:49] (03CR) 10BCornwall: [C:03+2] Remove deprecated hcaptcha role [puppet] - 10https://gerrit.wikimedia.org/r/1264749 (https://phabricator.wikimedia.org/T411097) (owner: 10BCornwall) [18:55:26] (03PS1) 10BCornwall: Revert "Remove deprecated hcaptcha role" [puppet] - 10https://gerrit.wikimedia.org/r/1265531 [18:55:29] (03PS1) 10BCornwall: Revert "site.pp/conftool: Remove deprecated hcaptcha nodes" [puppet] - 10https://gerrit.wikimedia.org/r/1265532 [18:56:27] (03CR) 10BCornwall: [C:03+2] Revert "site.pp/conftool: Remove deprecated hcaptcha nodes" [puppet] - 10https://gerrit.wikimedia.org/r/1265532 (owner: 10BCornwall) [18:57:03] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1112.eqiad.wmnet with reason: host reimage [18:57:33] (03CR) 10BCornwall: [C:03+2] Revert "Remove deprecated hcaptcha role" [puppet] - 10https://gerrit.wikimedia.org/r/1265531 (owner: 10BCornwall) [18:57:38] (03PS1) 10Papaul: Add BGP sessions from mr1-eqiad to cr1/2.eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/1265533 (https://phabricator.wikimedia.org/T421238) [19:02:06] rzl: Hi again! Sorry for being a bit distracted before. We were dealing with a bug that blocked the train and so on. Now that everything is fixed I have read your previous comment carefully and also see the comment that elukey posted in the corresponding SLO change. I assume that the dashboard mentioned there is what you mean about moving to a new platform so I full agree with you on leaving the current patch unchanged [19:02:18] it doesn't make sense to change something that we are going to stop using pretty soon [19:02:37] !log brett@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-codfw and A:lvs (T411097) [19:02:37] We'll abandon the patch then [19:02:41] Thank you very much! [19:02:43] T411097: Deprecate low-traffic proxoid service and O:hcaptcha_proxy for the older hcaptcha proxy setup - https://phabricator.wikimedia.org/T411097 [19:03:11] RECOVERY - PyBal IPVS diff check on lvs2014 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [19:03:25] !log brett@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-secondary-codfw and A:lvs (T411097) [19:03:30] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1112.eqiad.wmnet with reason: host reimage [19:03:38] !log brett@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-low-traffic-codfw and A:lvs (T411097) [19:04:11] RECOVERY - PyBal IPVS diff check on lvs2013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [19:04:25] !log brett@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-low-traffic-codfw and A:lvs (T411097) [19:04:40] !log brett@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-eqiad and A:lvs (T411097) [19:05:21] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [19:05:42] !log brett@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-secondary-eqiad and A:lvs (T411097) [19:05:51] !log brett@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-low-traffic-eqiad and A:lvs (T411097) [19:06:31] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [19:07:17] (03PS1) 10BCornwall: Revert^2 "site.pp/conftool: Remove deprecated hcaptcha nodes" [puppet] - 10https://gerrit.wikimedia.org/r/1265536 [19:07:19] (03PS1) 10BCornwall: Revert^2 "Remove deprecated hcaptcha role" [puppet] - 10https://gerrit.wikimedia.org/r/1265537 [19:07:29] !log brett@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-low-traffic-eqiad and A:lvs (T411097) [19:08:22] (03CR) 10BCornwall: [C:03+2] Revert^2 "site.pp/conftool: Remove deprecated hcaptcha nodes" [puppet] - 10https://gerrit.wikimedia.org/r/1265536 (owner: 10BCornwall) [19:09:46] (03CR) 10BCornwall: [C:03+2] Revert^2 "Remove deprecated hcaptcha role" [puppet] - 10https://gerrit.wikimedia.org/r/1265537 (owner: 10BCornwall) [19:11:02] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [19:11:07] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [19:12:09] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [19:12:14] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [19:13:31] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [19:13:36] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [19:13:41] (03PS1) 10BCornwall: service: Remove remainder of proxoid definitions [puppet] - 10https://gerrit.wikimedia.org/r/1265541 (https://phabricator.wikimedia.org/T411097) [19:14:45] (03CR) 10Ssingh: [C:03+1] service: Remove remainder of proxoid definitions [puppet] - 10https://gerrit.wikimedia.org/r/1265541 (https://phabricator.wikimedia.org/T411097) (owner: 10BCornwall) [19:14:53] (03CR) 10BCornwall: [C:03+2] service: Remove remainder of proxoid definitions [puppet] - 10https://gerrit.wikimedia.org/r/1265541 (https://phabricator.wikimedia.org/T411097) (owner: 10BCornwall) [19:15:04] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [19:15:09] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [19:16:57] !log ebysans@deploy1003 helmfile [staging] START helmfile.d/services/media-analytics: apply [19:17:09] !log ebysans@deploy1003 helmfile [staging] DONE helmfile.d/services/media-analytics: apply [19:17:26] (03CR) 10Cathal Mooney: [C:03+1] "LGTM. The naming of the bgp group "sw_mr" is kind of unfortunate given this is on a CR. But as it won't be there for too long I think pr" [homer/public] - 10https://gerrit.wikimedia.org/r/1265533 (https://phabricator.wikimedia.org/T421238) (owner: 10Papaul) [19:17:41] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_proxoid.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [19:19:45] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [19:20:11] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [19:21:09] !log brett@cumin2002 START - Cookbook sre.hosts.decommission for hosts hcaptcha[1001-1002,2001-2002].wikimedia.org [19:22:41] FIRING: [4x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_proxoid.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [19:23:30] FIRING: Outbound discards: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [19:23:56] !log dancy@deploy1003 Started scap sync-world: (no justification provided) [19:24:27] (03CR) 10Cathal Mooney: "Actually I need to take back my vote. The config will need "local-address " in the 'Management' bgp group on the CR, and lik" [homer/public] - 10https://gerrit.wikimedia.org/r/1265533 (https://phabricator.wikimedia.org/T421238) (owner: 10Papaul) [19:25:33] !log cdobbins@cumin2002 START - Cookbook sre.loadbalancer.admin depooling P{lvs6002.drmrs.wmnet} and A:liberica [19:26:01] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs6002.drmrs.wmnet} and A:liberica [19:26:19] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1112.eqiad.wmnet with OS trixie [19:26:30] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp1112.* [19:26:47] (03CR) 10Cathal Mooney: [C:03+1] "It's good, it will work please go ahead and ignore me!" [homer/public] - 10https://gerrit.wikimedia.org/r/1265533 (https://phabricator.wikimedia.org/T421238) (owner: 10Papaul) [19:27:12] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1114.eqiad.wmnet with OS trixie [19:29:22] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:29:51] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1148 - https://phabricator.wikimedia.org/T421892#11775080 (10VRiley-WMF) a:03VRiley-WMF [19:30:29] !log dancy@deploy1003 Finished scap sync-world: (no justification provided) (duration: 06m 59s) [19:30:46] !log brett@cumin2002 START - Cookbook sre.dns.netbox [19:32:49] (03CR) 10Bking: [C:03+2] opensearch-cluster: Terminate TLS on the ingress gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248865 (https://phabricator.wikimedia.org/T418175) (owner: 10Btullis) [19:34:22] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:35:10] !log brett@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: hcaptcha[1001-1002,2001-2002].wikimedia.org decommissioned, removing all IPs except the asset tag one - brett@cumin2002" [19:36:49] !log cdobbins@cumin2002 START - Cookbook sre.loadbalancer.admin rebooting P{lvs6002.drmrs.wmnet} and A:liberica [19:36:50] !log cdobbins@cumin2002 END (FAIL) - Cookbook sre.loadbalancer.admin (exit_code=1) rebooting P{lvs6002.drmrs.wmnet} and A:liberica [19:37:23] !log cdobbins@cumin2002 START - Cookbook sre.loadbalancer.admin rebooting P{lvs6002.drmrs.wmnet} and A:liberica [19:37:24] !log cdobbins@cumin2002 END (FAIL) - Cookbook sre.loadbalancer.admin (exit_code=1) rebooting P{lvs6002.drmrs.wmnet} and A:liberica [19:37:41] RESOLVED: [4x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_proxoid.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [19:38:15] brett@cumin2002 decommission (PID 3406297) is awaiting input [19:38:26] !log cdobbins@cumin2002 START - Cookbook sre.loadbalancer.admin pooling P{lvs6002.drmrs.wmnet} and A:liberica [19:38:37] !log brett@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: hcaptcha[1001-1002,2001-2002].wikimedia.org decommissioned, removing all IPs except the asset tag one - brett@cumin2002" [19:38:37] !log brett@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:38:38] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts hcaptcha[1001-1002,2001-2002].wikimedia.org [19:38:51] 06SRE, 06Traffic: Deprecate low-traffic proxoid service and O:hcaptcha_proxy for the older hcaptcha proxy setup - https://phabricator.wikimedia.org/T411097#11775112 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by brett@cumin2002 for hosts: `hcaptcha[1001-1002,2001-2002].wikimedia.org` - hca... [19:38:54] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P{lvs6002.drmrs.wmnet} and A:liberica [19:39:32] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [19:42:59] !log cdobbins@cumin2002 START - Cookbook sre.loadbalancer.admin rebooting P{lvs6002.drmrs.wmnet} and A:liberica [19:43:20] 10ops-eqiad, 06SRE, 06DC-Ops: Discrepancy for wikikube-worker[1360-1372] - https://phabricator.wikimedia.org/T421442#11775120 (10VRiley-WMF) Updated MAC addresses for wikikube-worker1370, 1371, and 1372. [19:44:43] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1114.eqiad.wmnet with reason: host reimage [19:46:23] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs6002.drmrs.wmnet} and A:liberica [19:48:42] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1114.eqiad.wmnet with reason: host reimage [19:49:45] 10ops-eqiad, 06SRE, 06DC-Ops: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet - https://phabricator.wikimedia.org/T421007#11775138 (10VRiley-WMF) I fear the problem will still be there. As it turns out, it's going to be a bit of a mystery due to the fact that iDrac isn't showing anything wrong at... [19:50:32] 10ops-eqiad, 06SRE, 06DC-Ops: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet - https://phabricator.wikimedia.org/T421007#11775140 (10VRiley-WMF) I'll be checking in with dell again, and talking to the other engineers as well about this. [19:58:58] !log cdobbins@cumin2002 START - Cookbook sre.loadbalancer.admin rebooting P{lvs6003.drmrs.wmnet} and A:liberica [19:58:59] !log cdobbins@cumin2002 END (ERROR) - Cookbook sre.loadbalancer.admin (exit_code=97) rebooting P{lvs6003.drmrs.wmnet} and A:liberica [19:59:49] !log cdobbins@cumin2002 START - Cookbook sre.loadbalancer.admin rebooting P{lvs6001.drmrs.wmnet} and A:liberica [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260331T2000). Please do the needful. [20:00:05] AaronSchulz and manfredi: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:12] (03PS1) 10Bking: opensearch-ipoid: pin to chart version 0.0.17 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265553 (https://phabricator.wikimedia.org/T419289) [20:00:19] Hey I am here [20:00:40] * AaronSchulz is around [20:01:41] (03PS1) 10Snwachukwu: Update Media-analytics helmfile.d global-staging to use cassandra Staging Hosts. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265555 (https://phabricator.wikimedia.org/T415202) [20:03:15] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs6001.drmrs.wmnet} and A:liberica [20:04:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by aaron@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261732 (https://phabricator.wikimedia.org/T419429) (owner: 10Aaron Schulz) [20:04:45] * AaronSchulz goes [20:05:23] (03Merged) 10jenkins-bot: Move all analytics API sandbox entries to testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261732 (https://phabricator.wikimedia.org/T419429) (owner: 10Aaron Schulz) [20:05:52] !log aaron@deploy1003 Started scap sync-world: Backport for [[gerrit:1261732|Move all analytics API sandbox entries to testwiki (T419429)]] [20:05:57] Hi, could someone take a look at my scheduled deploy in this window? It’s ready to go when you are. Thanks! [20:05:59] T419429: [SPIKE?] Create an API module for the Analytics API - https://phabricator.wikimedia.org/T419429 [20:07:00] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 485134176 and 26 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:07:51] !log aaron@deploy1003 aaron: Backport for [[gerrit:1261732|Move all analytics API sandbox entries to testwiki (T419429)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:08:48] !log aaron@deploy1003 aaron: Continuing with sync [20:08:56] manfredi: do those two patches require a certain order? [20:09:00] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 128744 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:09:08] yes [20:09:18] 1264922 first [20:10:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Degraded RAID on an-worker1213 - https://phabricator.wikimedia.org/T420812#11775177 (10VRiley-WMF) 05Open→03Resolved [20:12:28] AaronSchulz: are you handling the deploy for this window, or should I check with someone else? [20:12:58] !log aaron@deploy1003 Finished scap sync-world: Backport for [[gerrit:1261732|Move all analytics API sandbox entries to testwiki (T419429)]] (duration: 07m 05s) [20:13:04] T419429: [SPIKE?] Create an API module for the Analytics API - https://phabricator.wikimedia.org/T419429 [20:13:04] (03CR) 10Scardenasmolinar: "I just have one question before I can +1 this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264631 (https://phabricator.wikimedia.org/T421415) (owner: 10Kgraessle) [20:13:18] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1114.eqiad.wmnet with OS trixie [20:13:48] manfredi: I can do it if no regular is around [20:14:22] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:14:46] i'm also here if needed now [20:14:54] Thanks [20:15:02] AaronSchulz: do you want to do the honors or should i? [20:15:14] urbanecm: oh, you can go then :) [20:15:23] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp1114.* [20:15:23] * AaronSchulz needs to walk his dog soon [20:15:45] manfredi: double checking, 1264922 needs to go _first_? [20:15:48] 06SRE, 06Traffic: Deprecate low-traffic proxoid service and O:hcaptcha_proxy for the older hcaptcha proxy setup - https://phabricator.wikimedia.org/T411097#11775191 (10BCornwall) The LVS service has been remove, the hosts, decommissioned, and the hcaptcha_proxy module removed from puppet. I'm not sure that any... [20:15:52] (aka, it is not enough to do both at the same time) [20:16:06] Yes 1264922 first [20:16:19] (03CR) 10Urbanecm: [C:03+2] Add instrumentation for email confirmation lifecycle events [extensions/WikimediaEvents] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1264922 (https://phabricator.wikimedia.org/T420007) (owner: 10Catrope) [20:17:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1264922 (https://phabricator.wikimedia.org/T420007) (owner: 10Catrope) [20:19:22] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:20:42] (03Merged) 10jenkins-bot: Add instrumentation for email confirmation lifecycle events [extensions/WikimediaEvents] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1264922 (https://phabricator.wikimedia.org/T420007) (owner: 10Catrope) [20:21:09] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1264922|Add instrumentation for email confirmation lifecycle events (T420007)]] [20:21:16] T420007: Measurement plan: Email confirmation banner instrumentation - https://phabricator.wikimedia.org/T420007 [20:23:06] !log urbanecm@deploy1003 urbanecm, catrope: Backport for [[gerrit:1264922|Add instrumentation for email confirmation lifecycle events (T420007)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:23:16] manfredi: can you test? [20:23:25] (03CR) 10Urbanecm: [C:03+2] Email confirmation banner: Add Test Kitchen A/B gating [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1264921 (https://phabricator.wikimedia.org/T421366) (owner: 10Catrope) [20:23:38] yes, give me a bit [20:23:59] sure [20:26:52] Everything's good [20:29:43] perf [20:29:46] !log urbanecm@deploy1003 urbanecm, catrope: Continuing with sync [20:31:19] manfredi: CI fails for the second patch [20:31:38] let me have a look [20:33:56] !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1264922|Add instrumentation for email confirmation lifecycle events (T420007)]] (duration: 12m 46s) [20:34:05] T420007: Measurement plan: Email confirmation banner instrumentation - https://phabricator.wikimedia.org/T420007 [20:34:25] (03CR) 10CI reject: [V:04-1] Email confirmation banner: Add Test Kitchen A/B gating [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1264921 (https://phabricator.wikimedia.org/T421366) (owner: 10Catrope) [20:34:32] urbanecm: This looks like an npm issue in CI rather than something by my patch. I’ll retry to confirm if it’s flaky [20:34:48] (03CR) 10Urbanecm: Email confirmation banner: Add Test Kitchen A/B gating [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1264921 (https://phabricator.wikimedia.org/T421366) (owner: 10Catrope) [20:34:53] (03CR) 10Urbanecm: [C:03+2] Email confirmation banner: Add Test Kitchen A/B gating [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1264921 (https://phabricator.wikimedia.org/T421366) (owner: 10Catrope) [20:35:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 21.85% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:35:26] ehh...that doesn't seem good [20:40:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 21.85% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:44:45] (03Merged) 10jenkins-bot: Email confirmation banner: Add Test Kitchen A/B gating [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1264921 (https://phabricator.wikimedia.org/T421366) (owner: 10Catrope) [20:46:52] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [20:47:08] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [20:47:21] (03CR) 10Ottomata: [C:03+1] Update Media-analytics helmfile.d global-staging to use cassandra Staging Hosts. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265555 (https://phabricator.wikimedia.org/T415202) (owner: 10Snwachukwu) [20:48:08] (03PS5) 10Jdlrobson: Enable parser survey for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1238787 (https://phabricator.wikimedia.org/T414852) [20:49:31] (03PS1) 10Eevans: cassandra: pin dev package to 4.1.11 [puppet] - 10https://gerrit.wikimedia.org/r/1265567 (https://phabricator.wikimedia.org/T418417) [20:50:21] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1265567 (https://phabricator.wikimedia.org/T418417) (owner: 10Eevans) [20:52:16] (03CR) 10Eevans: [C:03+2] cassandra: pin dev package to 4.1.11 [puppet] - 10https://gerrit.wikimedia.org/r/1265567 (https://phabricator.wikimedia.org/T418417) (owner: 10Eevans) [20:53:32] urbanecm: it went through [20:58:40] (03CR) 10C. Scott Ananian: [C:03+1] "works for me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1238787 (https://phabricator.wikimedia.org/T414852) (owner: 10Jdlrobson) [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260331T2100) [21:03:59] !log dancy@deploy1003 Installing scap version "4.243.0" for 2 host(s) [21:05:51] !log dancy@deploy1003 Installation of scap version "4.243.0" completed for 2 hosts [21:10:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:13:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:15:17] I know I'm late, but can I go ahead and use the web team deploy window? [21:15:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:16:02] Jdlrobson: check what dancy is doing with scap first I think [21:16:18] I'm done. have at it! [21:16:22] Reedy: ack. Thanks dancy ! [21:16:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1238787 (https://phabricator.wikimedia.org/T414852) (owner: 10Jdlrobson) [21:17:21] (03Merged) 10jenkins-bot: Enable parser survey for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1238787 (https://phabricator.wikimedia.org/T414852) (owner: 10Jdlrobson) [21:17:49] https://www.irccloud.com/pastebin/opY3pPWt/ [21:17:51] ^ dancy something left over? [21:18:07] taking a look [21:18:11] (it won't let me see the diff) [21:18:30] It doesn't show the diff in the notification. You have to click through to the job log [21:18:32] (console) [21:19:16] got it.. looks like something to do with resources/src/mediawiki.emailConfirmationBanner/abTest.js [21:19:32] patch from Mmartorana [21:20:07] manfredi: still around? [21:20:23] yes [21:20:56] (03CR) 10Bking: query_service: Add Prometheus metrics to deadlock remediation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1262510 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [21:21:04] just catching up with the backport window but it seems like your patch is still on mediawiki-staging. Was urbanecm doing that deploy or you? [21:21:23] (03CR) 10Ryan Kemper: "Will address the other comments in subsequent patch after I see if this iteration of the code works" [puppet] - 10https://gerrit.wikimedia.org/r/1262510 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [21:21:40] (03CR) 10Bking: [C:03+1] query_service: Add Prometheus metrics to deadlock remediation [puppet] - 10https://gerrit.wikimedia.org/r/1262510 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [21:22:01] it looks like https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1264921 wasn't deployed (Do you want it deployed?) [21:22:43] dancy: i'm also not sure if spiderpig allows me to clean up this sort of thing ? [21:22:46] Jdlrobson: btw, it looks like that change touches l10n, so it will probably be a slow deployment. [21:23:03] Jdlrobson: please hold [21:23:06] urbanecm: ack [21:24:13] (03CR) 10Ryan Kemper: [C:03+2] query_service: Add Prometheus metrics to deadlock remediation [puppet] - 10https://gerrit.wikimedia.org/r/1262510 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [21:24:22] (03CR) 10Bking: [C:03+1] P:opensearch::cirrus::test: Convert port to an integer [puppet] - 10https://gerrit.wikimedia.org/r/1260719 (owner: 10Majavah) [21:24:56] manfredi: you still here? [21:25:04] yes [21:25:15] !log urbanecm@deploy1003 Started scap sync-world: Email confirmation banner: Add Test Kitchen A/B gating (T421366) [21:25:20] T421366: Test Kitchen Experiment setup to measure the impact of the banner - https://phabricator.wikimedia.org/T421366 [21:27:02] Jdlrobson: apologies, i got distracted by an irl emergency [21:27:06] but i'm syncing now [21:29:16] urbanecm: no worries. Hope everything is okay now. [21:29:23] yep yep [21:29:44] urbanecm: will your sync also sync my change? [21:29:54] Jdlrobson: no, i am syncing this change manually [21:30:37] okay. Not sure how spider pig will handle this.. I don't see it in the interface any more, but its merged : https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1238787 [21:31:16] (03PS1) 10Eevans: admin: add FIDO key for eevans (spare) [puppet] - 10https://gerrit.wikimedia.org/r/1265580 [21:32:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 12.56% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:34:19] urbanecm: let me know when you are done and I'll try a few things (but may have some follow up questions :-)) [21:34:26] Jdlrobson: sure sure [21:36:00] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 156839480 and 16 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:36:41] manfredi: image is still building :-/ [21:37:00] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 17272 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:37:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.12% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:37:36] (03CR) 10Kamila Součková: [C:03+1] service: move image-suggestion to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/1198575 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French) [21:37:43] (03CR) 10Kamila Součková: [C:03+1] hieradata: disable and remove unused image-suggestion listener [puppet] - 10https://gerrit.wikimedia.org/r/1178657 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French) [21:38:45] image build completed [21:38:54] so it should be on testwikis soon [21:42:15] (03PS1) 10Bking: opensearch-ipoid: pin to chart version 0.0.17 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265553 (https://phabricator.wikimedia.org/T419289) [21:42:44] (03CR) 10Ryan Kemper: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265553 (https://phabricator.wikimedia.org/T419289) (owner: 10Bking) [21:43:05] (03CR) 10Bking: [C:03+2] opensearch-ipoid: pin to chart version 0.0.17 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265553 (https://phabricator.wikimedia.org/T419289) (owner: 10Bking) [21:45:42] (03PS2) 10LorenMora: Legal Footer Link Deploys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265482 (https://phabricator.wikimedia.org/T420348) [21:46:48] hmm...scap sync-world doesn't stop on testservers? [21:46:51] dancy, do you know? [21:46:59] Not unless you tell it to [21:47:04] scap backport tells it to [21:47:20] makes sense [21:47:28] It's always save to control-C it if you want to stop it early [21:47:34] manfredi: okay, please take a look and if we have a problem, tell me [21:47:44] `--pause-after-testserver-sync` is the flag [21:47:49] ty [21:48:24] urbanecm: is it on mwdebug on test wiki ? [21:48:28] manfredi: yes [21:51:02] it does not seem to be working [21:51:12] in a bad way? [21:51:15] (as in, should i abort?) [21:51:22] mm more like no op [21:51:36] okay. then i'll let it finish probably [21:51:54] do you think its not active yet ? [21:52:00] manfredi: it should be on testwikis [21:52:17] (but it's rolling out to prod, as i didn't know the flag da.ncy mentioned) [21:52:19] im testing it on https://test.wikipedia.org/ [21:52:24] manfredi: with x-wikimedia-debug? [21:52:51] yeah with mwdebug extension [21:52:56] interesting... [21:53:20] the switch in the extension is surely on? which backend do you have selected? [21:53:36] k8s-mwdebug [21:53:40] also, if it is on, please try switching it off and on again [21:53:47] i noticed the extension stopping its routing for a while [21:54:17] yeah it keeps to switch off [21:54:48] wait its working [21:56:48] !log urbanecm@deploy1003 Finished scap sync-world: Email confirmation banner: Add Test Kitchen A/B gating (T421366) (duration: 31m 33s) [21:56:53] T421366: Test Kitchen Experiment setup to measure the impact of the banner - https://phabricator.wikimedia.org/T421366 [21:56:59] now it should be global [21:57:18] urbanecm: is it on prod now ? [21:57:22] yes [21:57:27] oh yeah it makes sense [21:57:30] (i accidentally invoked scap in a way that doesn't stop) [21:57:30] it is working [21:57:33] perf [21:58:01] yeah but the extension wasn't working for the first patch so i couldn't test it properly. Alright I'll find a way tomorrow [21:58:05] Thanks a lot [21:58:10] np, sorry for the delay [21:58:11] Jdlrobson: in that case, over to you? [22:01:29] urbanecm: looking now [22:01:35] ping me if you have qs [22:02:08] !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1238787|Enable parser survey for all wikis (T414852)]] [22:02:14] T414852: Run a survey to understand why existing logged in users might be opting out of Parsoid - https://phabricator.wikimedia.org/T414852 [22:02:18] cool looks like I can just pick off where i left off :) [22:02:56] yep yep [22:04:11] !log jdlrobson@deploy1003 jdlrobson: Backport for [[gerrit:1238787|Enable parser survey for all wikis (T414852)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:04:34] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:04:48] 10ops-eqiad, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T421970 (10phaultfinder) 03NEW [22:06:40] !log jdlrobson@deploy1003 jdlrobson: Continuing with sync [22:13:34] !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1238787|Enable parser survey for all wikis (T414852)]] (duration: 11m 25s) [22:13:40] T414852: Run a survey to understand why existing logged in users might be opting out of Parsoid - https://phabricator.wikimedia.org/T414852 [22:16:15] (all done) [22:18:16] Jdlrobson: perf, so taking floor back [22:24:26] !log urbanecm@deploy1003 Started scap sync-world: T420154 [22:29:22] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:31:45] (03Abandoned) 10Santiago Faci: Test Kitchen SLOs: Renaming slos because of the Test Kitchen renaming [puppet] - 10https://gerrit.wikimedia.org/r/1238312 (https://phabricator.wikimedia.org/T414381) (owner: 10Santiago Faci) [22:34:22] (03CR) 10Ryan Kemper: [C:03+1] elasticsearch: fix test for non-utc timezones [software/spicerack] - 10https://gerrit.wikimedia.org/r/1265466 (owner: 10Elukey) [22:44:05] !log urbanecm@deploy1003 urbanecm: T420154 synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:45:26] !log urbanecm@deploy1003 urbanecm: Continuing with sync [22:48:34] (03CR) 10Bartosz Dziewoński: rest gateway: rate limiting for InstantCommons (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1263878 (owner: 10Daniel Kinzler) [22:51:12] (03CR) 10Jdlrobson: [C:03+1] Legal Footer Link Deploys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265482 (https://phabricator.wikimedia.org/T420348) (owner: 10LorenMora) [22:58:57] !log urbanecm@deploy1003 Finished scap sync-world: T420154 (duration: 34m 31s) [22:59:02] (03CR) 10Bartosz Dziewoński: rest gateway: add second Lua filter for header handling (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250675 (https://phabricator.wikimedia.org/T418969) (owner: 10Daniel Kinzler) [23:02:43] (03CR) 10Bartosz Dziewoński: [C:03+1] rest gateway: add support for centralauthtoken (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259242 (https://phabricator.wikimedia.org/T420280) (owner: 10Daniel Kinzler) [23:04:34] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:06:16] jouncebot: next [23:06:16] In 6 hour(s) and 53 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T0600) [23:08:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260029 (https://phabricator.wikimedia.org/T414338) (owner: 10Krinkle) [23:10:19] (03Merged) 10jenkins-bot: Enable $wgTrackMediaRequestProvenance on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260029 (https://phabricator.wikimedia.org/T414338) (owner: 10Krinkle) [23:10:45] !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1260029|Enable $wgTrackMediaRequestProvenance on group0 wikis (T414338)]] [23:10:52] T414338: FY25-26 WE5.4.12: Identify the provenance of image requests - https://phabricator.wikimedia.org/T414338 [23:12:45] !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1260029|Enable $wgTrackMediaRequestProvenance on group0 wikis (T414338)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:23:30] RESOLVED: Outbound discards: Device asw2-b-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [23:38:07] (03PS1) 10Ladsgroup: LinksUpdate: Consolidate links virtual domains [core] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1265623 (https://phabricator.wikimedia.org/T421914) [23:38:20] (03PS1) 10Ladsgroup: LinksUpdate: Consolidate links virtual domains [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1265624 (https://phabricator.wikimedia.org/T421914) [23:39:06] Krinkle: please ping me once you're done, I have a couple of backports [23:39:17] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [23:41:57] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1265625 [23:41:57] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1265625 (owner: 10TrainBranchBot) [23:43:57] !log krinkle@deploy1003 krinkle: Continuing with sync [23:44:00] Amir1: ack [23:44:12] I always find more bugs.. [23:44:17] but not new ones fortunately [23:51:06] !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1260029|Enable $wgTrackMediaRequestProvenance on group0 wikis (T414338)]] (duration: 40m 21s) [23:51:12] T414338: FY25-26 WE5.4.12: Identify the provenance of image requests - https://phabricator.wikimedia.org/T414338 [23:52:10] Amir1: all yours [23:53:03] Thanks! [23:53:07] (03CR) 10Ladsgroup: [C:03+2] LinksUpdate: Consolidate links virtual domains [core] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1265623 (https://phabricator.wikimedia.org/T421914) (owner: 10Ladsgroup) [23:53:15] (03CR) 10Ladsgroup: [C:03+2] LinksUpdate: Consolidate links virtual domains [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1265624 (https://phabricator.wikimedia.org/T421914) (owner: 10Ladsgroup) [23:53:21] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1265625 (owner: 10TrainBranchBot) [23:55:34] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1015.eqiad.wmnet, wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:55:36] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:57:34] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:57:36] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:57:56] (03Merged) 10jenkins-bot: LinksUpdate: Consolidate links virtual domains [core] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1265623 (https://phabricator.wikimedia.org/T421914) (owner: 10Ladsgroup) [23:58:47] (03Merged) 10jenkins-bot: LinksUpdate: Consolidate links virtual domains [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1265624 (https://phabricator.wikimedia.org/T421914) (owner: 10Ladsgroup)