[00:00:33] (03PS1) 10Ladsgroup: Pass whether the image is svg to adjustThumbWidthForSteps [extensions/Popups] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1265629 (https://phabricator.wikimedia.org/T414805) [00:00:50] (03PS1) 10Ladsgroup: util.js: Allow passing isVectorized to adjustThumbWidthForSteps [core] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1265630 (https://phabricator.wikimedia.org/T414805) [00:00:57] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1265623|LinksUpdate: Consolidate links virtual domains (T421914)]], [[gerrit:1265624|LinksUpdate: Consolidate links virtual domains (T421914)]] [00:00:57] (03CR) 10CI reject: [V:04-1] Pass whether the image is svg to adjustThumbWidthForSteps [extensions/Popups] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1265629 (https://phabricator.wikimedia.org/T414805) (owner: 10Ladsgroup) [00:01:03] T421914: Test links virtual domain split on testcommonswiki - https://phabricator.wikimedia.org/T421914 [00:01:14] (03PS1) 10Ladsgroup: util.js: Allow passing isVectorized to adjustThumbWidthForSteps [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1265631 (https://phabricator.wikimedia.org/T414805) [00:01:25] (03PS1) 10Ladsgroup: Pass whether the image is svg to adjustThumbWidthForSteps [extensions/Popups] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1265632 (https://phabricator.wikimedia.org/T414805) [00:03:02] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1265623|LinksUpdate: Consolidate links virtual domains (T421914)]], [[gerrit:1265624|LinksUpdate: Consolidate links virtual domains (T421914)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:03:36] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [00:07:47] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1265623|LinksUpdate: Consolidate links virtual domains (T421914)]], [[gerrit:1265624|LinksUpdate: Consolidate links virtual domains (T421914)]] (duration: 06m 50s) [00:07:50] T421914: Test links virtual domain split on testcommonswiki - https://phabricator.wikimedia.org/T421914 [00:09:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [00:09:22] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [00:10:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.3% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:10:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [00:12:01] (03CR) 10Ladsgroup: [C:03+2] util.js: Allow passing isVectorized to adjustThumbWidthForSteps [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1265631 (https://phabricator.wikimedia.org/T414805) (owner: 10Ladsgroup) [00:12:06] (03CR) 10Ladsgroup: [C:03+2] util.js: Allow passing isVectorized to adjustThumbWidthForSteps [core] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1265630 (https://phabricator.wikimedia.org/T414805) (owner: 10Ladsgroup) [00:15:03] (03CR) 10Ladsgroup: [C:03+2] Pass whether the image is svg to adjustThumbWidthForSteps [extensions/Popups] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1265629 (https://phabricator.wikimedia.org/T414805) (owner: 10Ladsgroup) [00:15:07] (03CR) 10Ladsgroup: [C:03+2] Pass whether the image is svg to adjustThumbWidthForSteps [extensions/Popups] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1265632 (https://phabricator.wikimedia.org/T414805) (owner: 10Ladsgroup) [00:15:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.98% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:16:16] (03CR) 10Ladsgroup: [C:04-1] "requires rebase since it's checked out minified 😞 For later then" [extensions/Popups] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1265629 (https://phabricator.wikimedia.org/T414805) (owner: 10Ladsgroup) [00:17:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.24% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:22:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.24% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:23:18] (03Merged) 10jenkins-bot: util.js: Allow passing isVectorized to adjustThumbWidthForSteps [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1265631 (https://phabricator.wikimedia.org/T414805) (owner: 10Ladsgroup) [00:23:27] (03Merged) 10jenkins-bot: util.js: Allow passing isVectorized to adjustThumbWidthForSteps [core] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1265630 (https://phabricator.wikimedia.org/T414805) (owner: 10Ladsgroup) [00:24:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [00:24:22] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [00:25:31] (03Merged) 10jenkins-bot: Pass whether the image is svg to adjustThumbWidthForSteps [extensions/Popups] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1265632 (https://phabricator.wikimedia.org/T414805) (owner: 10Ladsgroup) [00:25:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [00:26:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 21.05% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:27:18] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1265631|util.js: Allow passing isVectorized to adjustThumbWidthForSteps (T414805 T411013 T421589)]], [[gerrit:1265630|util.js: Allow passing isVectorized to adjustThumbWidthForSteps (T414805 T411013 T421589)]], [[gerrit:1265632|Pass whether the image is svg to adjustThumbWidthForSteps (T414805 T411013 T421589)]] [00:27:25] T414805: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805 [00:27:26] T411013: Popups should use standard thumbnail sizes - https://phabricator.wikimedia.org/T411013 [00:27:26] T421589: Page Previews uses low quality thumbnails - https://phabricator.wikimedia.org/T421589 [00:29:11] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1265631|util.js: Allow passing isVectorized to adjustThumbWidthForSteps (T414805 T411013 T421589)]], [[gerrit:1265630|util.js: Allow passing isVectorized to adjustThumbWidthForSteps (T414805 T411013 T421589)]], [[gerrit:1265632|Pass whether the image is svg to adjustThumbWidthForSteps (T414805 T411013 T421589)]] synced to the testservers (see https://wikitech.wiki [00:29:11] media.org/wiki/Mwdebug). Changes can now be verified there. [00:32:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [00:35:49] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [00:37:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [00:39:59] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1265631|util.js: Allow passing isVectorized to adjustThumbWidthForSteps (T414805 T411013 T421589)]], [[gerrit:1265630|util.js: Allow passing isVectorized to adjustThumbWidthForSteps (T414805 T411013 T421589)]], [[gerrit:1265632|Pass whether the image is svg to adjustThumbWidthForSteps (T414805 T411013 T421589)]] (duration: 12m 40s) [00:40:05] T414805: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805 [00:40:06] T411013: Popups should use standard thumbnail sizes - https://phabricator.wikimedia.org/T411013 [00:40:06] T421589: Page Previews uses low quality thumbnails - https://phabricator.wikimedia.org/T421589 [00:42:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 927.1ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:48:36] (03CR) 10Ladsgroup: [C:03+2] Pass whether the image is svg to adjustThumbWidthForSteps [extensions/Popups] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1265629 (https://phabricator.wikimedia.org/T414805) (owner: 10Ladsgroup) [00:49:58] (03Merged) 10jenkins-bot: Pass whether the image is svg to adjustThumbWidthForSteps [extensions/Popups] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1265629 (https://phabricator.wikimedia.org/T414805) (owner: 10Ladsgroup) [00:52:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 811.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:53:30] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1265629|Pass whether the image is svg to adjustThumbWidthForSteps (T414805 T411013 T421589)]] [00:53:37] T414805: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805 [00:53:38] T411013: Popups should use standard thumbnail sizes - https://phabricator.wikimedia.org/T411013 [00:53:38] T421589: Page Previews uses low quality thumbnails - https://phabricator.wikimedia.org/T421589 [00:54:17] RESOLVED: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [00:55:27] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1265629|Pass whether the image is svg to adjustThumbWidthForSteps (T414805 T411013 T421589)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:56:26] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [00:56:30] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [00:57:15] FIRING: [7x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:57:52] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [01:02:05] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1265629|Pass whether the image is svg to adjustThumbWidthForSteps (T414805 T411013 T421589)]] (duration: 08m 35s) [01:02:12] T414805: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805 [01:02:12] T411013: Popups should use standard thumbnail sizes - https://phabricator.wikimedia.org/T411013 [01:02:12] T421589: Page Previews uses low quality thumbnails - https://phabricator.wikimedia.org/T421589 [01:02:15] RESOLVED: [7x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:04:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 971.3ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:08:15] FIRING: [3x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:09:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 802.1ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:11:54] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1265654 [01:11:54] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1265654 (owner: 10TrainBranchBot) [01:12:38] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [01:12:43] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [01:12:57] PROBLEM - MariaDB Replica IO: s3 on clouddb1022 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:12:57] PROBLEM - MariaDB Replica SQL: s3 on clouddb1022 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:13:15] RESOLVED: [3x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:21:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.1% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:23:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-ext releases routed via main (k8s) 1.6s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:25:47] (03PS1) 10Krinkle: robots.php: Change Beta Cluster override from prepend to replace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265672 [01:28:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-ext releases routed via main (k8s) 1.6s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:28:47] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1265654 (owner: 10TrainBranchBot) [01:48:33] (03PS1) 10Dr0ptp4kt: Edit modules/admin/data/data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1265675 [01:51:33] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [01:51:36] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [01:54:01] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11776109 (10AWesterinen) 05Resolved→03Open I believe that the problem is my two different accounts (I am unsure how I e... [01:58:35] (03PS2) 10Dr0ptp4kt: Edit modules/admin/data/data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1265675 [02:00:22] (03PS3) 10Dr0ptp4kt: Update deployment key for dr0ptp4kt [puppet] - 10https://gerrit.wikimedia.org/r/1265675 [02:00:48] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:05:14] (03CR) 10Ottomata: [C:03+2] Update Media-analytics helmfile.d global-staging to use cassandra Staging Hosts. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265555 (https://phabricator.wikimedia.org/T415202) (owner: 10Snwachukwu) [02:07:11] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 23s) [02:07:46] (03Merged) 10jenkins-bot: Update Media-analytics helmfile.d global-staging to use cassandra Staging Hosts. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265555 (https://phabricator.wikimedia.org/T415202) (owner: 10Snwachukwu) [02:09:13] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:14:38] !log ebysans@deploy1003 helmfile [staging] START helmfile.d/services/media-analytics: apply [02:24:43] !log ebysans@deploy1003 helmfile [staging] DONE helmfile.d/services/media-analytics: apply [02:33:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.92% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:34:13] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:38:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 21.87% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:50:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.12% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:55:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.12% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:09:13] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:14:13] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:19:16] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: Update ULSFO LVS service IP's - https://phabricator.wikimedia.org/T418971#11776183 (10Papaul) @SLyngshede-WMF thank you very much. [03:25:45] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11776198 (10Papaul) [04:24:22] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:29:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:09:13] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:14:13] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:17:19] (03CR) 10ArielGlenn: [C:03+1] "Nice cleanup, one typo noted." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260763 (https://phabricator.wikimedia.org/T419796) (owner: 10Bartosz Dziewoński) [05:17:57] RECOVERY - MariaDB Replica IO: s3 on clouddb1022 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:17:57] RECOVERY - MariaDB Replica SQL: s3 on clouddb1022 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:26:17] !log Drop global_block_whitelist on closed wikis T420525 [05:26:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:20] T420525: Drop global_block_whitelist from closed wikis - https://phabricator.wikimedia.org/T420525 [05:30:57] RECOVERY - MariaDB Replica Lag: s3 on clouddb1022 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:33:07] !log Drop empty ores_classification and ores_model on closed wikis T420093 [05:33:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:33:10] T420093: Drop ORES tables from wikis without ORES - https://phabricator.wikimedia.org/T420093 [05:52:26] (03PS1) 101F616EMO: arbcom_zhwiki: Enable SecurePoll without PII rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265959 (https://phabricator.wikimedia.org/T419309) [05:56:03] !log Drop empty tables cusi_case, cusi_user, and cusi_signal on wikis not listed at checkuser-suggested-investigations.dblist T421353 [05:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:06] T421353: Drop cusi_case, cusi_signal, and cusi_user tables from wikis where they are unused - https://phabricator.wikimedia.org/T421353 [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T0600) [06:14:59] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1265580 (owner: 10Eevans) [06:17:26] (03PS1) 10Marostegui: clouddb1022: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1265990 [06:17:42] (03PS3) 10Muehlenhoff: ncredir: Switch to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1250517 [06:22:02] (03CR) 10Marostegui: [C:03+2] clouddb1022: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1265990 (owner: 10Marostegui) [06:23:24] ayounsi@cumin1003 reimage (PID 699330) is awaiting input [06:29:56] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1366.eqiad.wmnet with OS trixie [06:30:13] !log ayounsi@cumin1003 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1366 [06:30:55] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [06:34:22] (03CR) 10ArielGlenn: [C:03+1] "Looks fine, though I am a bit out of the loop on the precedence of the various classes any more." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1263878 (owner: 10Daniel Kinzler) [06:34:47] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1366 - ayounsi@cumin1003" [06:34:57] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1366 - ayounsi@cumin1003" [06:34:57] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:34:57] !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache wikikube-worker1366.eqiad.wmnet 200.48.64.10.in-addr.arpa 0.0.2.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [06:35:01] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1366.eqiad.wmnet 200.48.64.10.in-addr.arpa 0.0.2.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [06:35:02] !log ayounsi@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1366 [06:37:12] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [06:37:18] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [06:38:27] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1366 [06:38:27] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1366 [06:45:00] (03CR) 10Ayounsi: [C:03+1] Add BGP sessions from mr1-eqiad to cr1/2.eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/1265533 (https://phabricator.wikimedia.org/T421238) (owner: 10Papaul) [06:50:16] !log ayounsi@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1366.eqiad.wmnet with reason: host reimage [06:52:23] FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [06:54:10] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1366.eqiad.wmnet with reason: host reimage [07:00:04] Amir1, Urbanecm, and awight: UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T0700). Please do the needful. [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:07:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:08:04] (03PS1) 10Kevin Bazira: ml-services: increase parallel prefilling and concurrent decoding to improve gpt isvc performance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266023 (https://phabricator.wikimedia.org/T418350) [07:09:01] (03CR) 10Elukey: [C:03+2] elasticsearch: fix test for non-utc timezones [software/spicerack] - 10https://gerrit.wikimedia.org/r/1265466 (owner: 10Elukey) [07:09:08] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [07:09:11] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [07:10:35] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1366.eqiad.wmnet with OS trixie [07:14:47] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [07:14:51] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [07:15:18] (03Abandoned) 10Elukey: First pass of ruff check --fix [software/spicerack] - 10https://gerrit.wikimedia.org/r/1265476 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey) [07:15:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps1011.eqiad.wmnet [07:17:08] (03PS1) 10Brouberol: analytics/hadoop: allow fr-tech-users/admins to submi/manage jobs from the production queue [puppet] - 10https://gerrit.wikimedia.org/r/1266033 (https://phabricator.wikimedia.org/T417213) [07:19:18] (03CR) 10CI reject: [V:04-1] analytics/hadoop: allow fr-tech-users/admins to submi/manage jobs from the production queue [puppet] - 10https://gerrit.wikimedia.org/r/1266033 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol) [07:20:46] (03PS2) 10Brouberol: analytics/hadoop: allow fr-tech-users/admins to submi/manage YARN jobs [puppet] - 10https://gerrit.wikimedia.org/r/1266033 (https://phabricator.wikimedia.org/T417213) [07:22:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps1011.eqiad.wmnet [07:22:55] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:24:11] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [07:24:14] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [07:26:12] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [07:26:14] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [07:26:51] !log installing postgresql security updates [07:26:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:56] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [07:27:00] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [07:27:33] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps1012.eqiad.wmnet [07:31:09] (03PS2) 10Arnaudb: gerrit: update timeouts for gitiles [puppet] - 10https://gerrit.wikimedia.org/r/1265448 (https://phabricator.wikimedia.org/T421904) [07:32:46] (03PS1) 10Arnaudb: gerrit: increase packetGitWindowSize [puppet] - 10https://gerrit.wikimedia.org/r/1266044 (https://phabricator.wikimedia.org/T421904) [07:34:01] (03CR) 10Slyngshede: [C:03+1] bitu: Remove inactive approver [puppet] - 10https://gerrit.wikimedia.org/r/1265490 (owner: 10Muehlenhoff) [07:34:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps1012.eqiad.wmnet [07:34:58] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1367.eqiad.wmnet with OS trixie [07:35:26] !log ayounsi@cumin1003 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1367 [07:35:37] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [07:36:01] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [07:36:05] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [07:39:18] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1367 - ayounsi@cumin1003" [07:39:23] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1367 - ayounsi@cumin1003" [07:39:23] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:39:23] !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache wikikube-worker1367.eqiad.wmnet 201.48.64.10.in-addr.arpa 1.0.2.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [07:39:27] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1367.eqiad.wmnet 201.48.64.10.in-addr.arpa 1.0.2.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [07:39:28] !log ayounsi@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1367 [07:40:03] (03CR) 10Muehlenhoff: ci: Add 'Signed-by' keyfile reference to thirdparty APT repo config (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1260766 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn) [07:40:05] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 140623440 and 36 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [07:40:51] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1367 [07:40:51] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1367 [07:41:05] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3656 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [07:44:46] 06SRE, 10SRE-Access-Requests: Requesting access to superset dashboard for mpostoronca - https://phabricator.wikimedia.org/T421471#11776454 (10MPostoronca-WMF) Hi @OKryva-WMF, could you please approve this request? Thank you [07:46:36] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1156.eqiad.wmnet with reason: Maintenance [07:46:56] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [07:47:04] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1156 (T419635)', diff saved to https://phabricator.wikimedia.org/P90130 and previous config saved to /var/cache/conftool/dbconfig/20260401-074704-fceratto.json [07:47:08] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [07:48:27] (03CR) 10JMeybohm: "`envoy_cluster_update*` is 5 series, `envoy_dns*` is 6." [puppet] - 10https://gerrit.wikimedia.org/r/1261485 (https://phabricator.wikimedia.org/T421343) (owner: 10JMeybohm) [07:49:13] (03CR) 10JMeybohm: "If that happens to be too much, we can probably get away with just the `envoy_cluster_update` ones." [puppet] - 10https://gerrit.wikimedia.org/r/1261485 (https://phabricator.wikimedia.org/T421343) (owner: 10JMeybohm) [07:49:29] (03PS5) 10Daniel Kinzler: rest gateway: add support for centralauthtoken [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259242 (https://phabricator.wikimedia.org/T420280) [07:51:15] (03PS2) 10Arnaudb: gerrit: increase packedGitWindowSize [puppet] - 10https://gerrit.wikimedia.org/r/1266044 (https://phabricator.wikimedia.org/T421904) [07:52:36] !log ayounsi@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1367.eqiad.wmnet with reason: host reimage [07:54:10] (03PS2) 10ArielGlenn: rest-gateway: add values for auth-newuser rate limiting class for feature patch [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260774 (https://phabricator.wikimedia.org/T419796) [07:56:42] (03PS4) 10Daniel Kinzler: rest-gateway: Refactor request classification for readability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260763 (https://phabricator.wikimedia.org/T419796) (owner: 10Bartosz Dziewoński) [07:56:43] (03PS3) 10Daniel Kinzler: rest gateway: rate limiting for InstantCommons [deployment-charts] - 10https://gerrit.wikimedia.org/r/1263878 [07:57:02] (03CR) 10Daniel Kinzler: rest-gateway: Refactor request classification for readability (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260763 (https://phabricator.wikimedia.org/T419796) (owner: 10Bartosz Dziewoński) [07:57:41] (03CR) 10Daniel Kinzler: rest gateway: rate limiting for InstantCommons (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1263878 (owner: 10Daniel Kinzler) [07:57:43] (03CR) 10Ozge: [C:03+1] ml-services: increase parallel prefilling and concurrent decoding to improve gpt isvc performance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266023 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [07:58:41] (03CR) 10Kevin Bazira: [C:03+2] ml-services: increase parallel prefilling and concurrent decoding to improve gpt isvc performance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266023 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [07:59:11] (03PS1) 10Tiziano Fogli: thanos/store: add a scrape target for the ruler instance [puppet] - 10https://gerrit.wikimedia.org/r/1266067 (https://phabricator.wikimedia.org/T412924) [07:59:34] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1367.eqiad.wmnet with reason: host reimage [07:59:36] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1262055 (owner: 10Elukey) [08:00:04] jnuche and hashar: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T0800) [08:00:27] hi, the train will be rolling out soon [08:00:41] (03Merged) 10jenkins-bot: ml-services: increase parallel prefilling and concurrent decoding to improve gpt isvc performance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266023 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [08:01:00] (03PS2) 10Tiziano Fogli: thanos/store: add a scrape target for the ruler instance [puppet] - 10https://gerrit.wikimedia.org/r/1266067 (https://phabricator.wikimedia.org/T412924) [08:01:04] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 1088981560 and 90 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [08:01:36] (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1266067 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli) [08:01:48] 10SRE-swift-storage, 10Ceph, 06Data-Persistence: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11776495 (10MatthewVernon) [08:03:04] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 137336 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [08:03:04] (03CR) 10CI reject: [V:04-1] thanos/store: add a scrape target for the ruler instance [puppet] - 10https://gerrit.wikimedia.org/r/1266067 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli) [08:03:13] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [08:04:18] (03PS3) 10Tiziano Fogli: thanos/store: add a scrape target for the ruler instance [puppet] - 10https://gerrit.wikimedia.org/r/1266067 (https://phabricator.wikimedia.org/T412924) [08:05:34] (03CR) 10Elukey: [C:03+2] profile::base::certificates: rename Puppet Internal CA's path [puppet] - 10https://gerrit.wikimedia.org/r/1262055 (owner: 10Elukey) [08:05:39] 10SRE-swift-storage, 10Ceph, 06Data-Persistence: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11776499 (10MatthewVernon) [08:06:21] (03PS1) 10TrainBranchBot: group1 to 1.46.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266075 (https://phabricator.wikimedia.org/T420480) [08:06:24] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by jnuche@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266075 (https://phabricator.wikimedia.org/T420480) (owner: 10TrainBranchBot) [08:06:45] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T419635)', diff saved to https://phabricator.wikimedia.org/P90132 and previous config saved to /var/cache/conftool/dbconfig/20260401-080644-fceratto.json [08:06:48] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [08:07:00] !log upgrading Envoy on the Puppet servers to 1.35.9 T419637 T410975 [08:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:04] T419637: Upgrade Envoy to v1.35.9 - https://phabricator.wikimedia.org/T419637 [08:07:04] T410975: Upgrade Envoy to v1.35.7 - https://phabricator.wikimedia.org/T410975 [08:07:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:07:18] (03Merged) 10jenkins-bot: group1 to 1.46.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266075 (https://phabricator.wikimedia.org/T420480) (owner: 10TrainBranchBot) [08:10:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps1013.eqiad.wmnet [08:11:25] 06SRE, 06Infrastructure-Foundations, 10netops: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421704#11776508 (10MLechvien-WMF) [08:12:58] (03PS6) 10Daniel Kinzler: rest gateway: add second Lua filter for header handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250675 (https://phabricator.wikimedia.org/T418969) [08:13:13] (03CR) 10Daniel Kinzler: rest gateway: add second Lua filter for header handling (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250675 (https://phabricator.wikimedia.org/T418969) (owner: 10Daniel Kinzler) [08:13:19] 10SRE-swift-storage, 10Ceph, 06Data-Persistence: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11776514 (10jcrespo) > we're now asking service owners to re-image their existing baremetal servers We don't reimage backups hosts.... [08:14:18] (03PS1) 10TrainBranchBot: group0 to 1.46.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266136 (https://phabricator.wikimedia.org/T420480) [08:14:20] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by jnuche@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266136 (https://phabricator.wikimedia.org/T420480) (owner: 10TrainBranchBot) [08:14:42] (03PS1) 10MVernon: swift: drain 3 eqiad backends for reimage to per-rack VLAN [puppet] - 10https://gerrit.wikimedia.org/r/1266138 (https://phabricator.wikimedia.org/T421719) [08:15:13] (03Merged) 10jenkins-bot: group0 to 1.46.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266136 (https://phabricator.wikimedia.org/T420480) (owner: 10TrainBranchBot) [08:16:11] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1367.eqiad.wmnet with OS trixie [08:16:21] 06SRE, 10SRE-Access-Requests: Requesting access to superset dashboard for mpostoronca - https://phabricator.wikimedia.org/T421471#11776521 (10OKryva-WMF) Approve. [08:16:53] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P90134 and previous config saved to /var/cache/conftool/dbconfig/20260401-081652-fceratto.json [08:17:13] (03PS3) 10Arnaudb: gerrit: update timeouts for gitiles [puppet] - 10https://gerrit.wikimedia.org/r/1265448 (https://phabricator.wikimedia.org/T421904) [08:17:26] (03PS3) 10Arnaudb: gerrit: increase packedGitWindowSize [puppet] - 10https://gerrit.wikimedia.org/r/1266044 (https://phabricator.wikimedia.org/T421904) [08:18:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps1013.eqiad.wmnet [08:20:53] (03CR) 10Jcrespo: [C:03+1] swift: drain 3 eqiad backends for reimage to per-rack VLAN [puppet] - 10https://gerrit.wikimedia.org/r/1266138 (https://phabricator.wikimedia.org/T421719) (owner: 10MVernon) [08:21:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps1014.eqiad.wmnet [08:21:35] (03CR) 10MVernon: [C:03+2] swift: drain 3 eqiad backends for reimage to per-rack VLAN [puppet] - 10https://gerrit.wikimedia.org/r/1266138 (https://phabricator.wikimedia.org/T421719) (owner: 10MVernon) [08:21:38] (03PS3) 10Daniel Kinzler: rest-gateway: add values for auth-newuser rate limiting class for feature patch [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260774 (https://phabricator.wikimedia.org/T419796) (owner: 10ArielGlenn) [08:21:41] !log jnuche@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.46.0-wmf.22 refs T420480 [08:21:43] T420480: 1.46.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T420480 [08:22:32] (03PS4) 10Daniel Kinzler: rest-gateway: add values for auth-newuser rate limiting class for feature patch [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260774 (https://phabricator.wikimedia.org/T419796) (owner: 10ArielGlenn) [08:23:02] (03CR) 10Daniel Kinzler: rest-gateway: add values for auth-newuser rate limiting class for feature patch (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260774 (https://phabricator.wikimedia.org/T419796) (owner: 10ArielGlenn) [08:23:56] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1368.eqiad.wmnet with OS trixie [08:24:23] !log ayounsi@cumin1003 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1368 [08:24:37] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:25:18] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [08:27:01] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P90135 and previous config saved to /var/cache/conftool/dbconfig/20260401-082701-fceratto.json [08:27:26] ayounsi@cumin1003 reimage (PID 744305) is awaiting input [08:28:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps1014.eqiad.wmnet [08:28:42] (03PS6) 10Daniel Kinzler: rest gateway: add support for centralauthtoken [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259242 (https://phabricator.wikimedia.org/T420280) [08:29:11] (03CR) 10Btullis: [C:03+1] "Nice, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1266033 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol) [08:29:35] (03CR) 10Brouberol: [C:03+2] analytics/hadoop: allow fr-tech-users/admins to submi/manage YARN jobs [puppet] - 10https://gerrit.wikimedia.org/r/1266033 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol) [08:30:24] (03PS1) 10Jcrespo: installserver: Treat any attempt to reimage backup hosts as an error [puppet] - 10https://gerrit.wikimedia.org/r/1266148 (https://phabricator.wikimedia.org/T420506) [08:30:41] (03PS2) 10Jcrespo: installserver: Treat any attempt to reimage backup hosts as an error [puppet] - 10https://gerrit.wikimedia.org/r/1266148 (https://phabricator.wikimedia.org/T420506) [08:31:53] (03PS3) 10Jcrespo: installserver: Treat any attempt to reimage backup hosts as an error [puppet] - 10https://gerrit.wikimedia.org/r/1266148 (https://phabricator.wikimedia.org/T420506) [08:36:24] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdw) failed in ms-be1069 - https://phabricator.wikimedia.org/T421986 (10MatthewVernon) 03NEW [08:36:35] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdw) failed in ms-be1069 - https://phabricator.wikimedia.org/T421986#11776589 (10MatthewVernon) p:05Triage→03High [08:37:09] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T419635)', diff saved to https://phabricator.wikimedia.org/P90136 and previous config saved to /var/cache/conftool/dbconfig/20260401-083709-fceratto.json [08:37:13] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [08:37:26] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1162.eqiad.wmnet with reason: Maintenance [08:37:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1162 (T419635)', diff saved to https://phabricator.wikimedia.org/P90137 and previous config saved to /var/cache/conftool/dbconfig/20260401-083733-fceratto.json [08:38:09] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [08:38:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:40:47] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T419635)', diff saved to https://phabricator.wikimedia.org/P90138 and previous config saved to /var/cache/conftool/dbconfig/20260401-084047-fceratto.json [08:41:17] 06SRE, 07SRE-Unowned, 07Sustainability (Incident Followup): Noise in #wikimedia-operations is making incident response more difficult - https://phabricator.wikimedia.org/T417163#11776614 (10MLechvien-WMF) [08:42:27] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [08:42:30] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [08:43:45] ayounsi@cumin1003 reimage (PID 744305) is awaiting input [08:43:46] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-druid1003.eqiad.wmnet with OS bookworm [08:44:07] !log btullis@cumin1003 START - Cookbook sre.hosts.move-vlan for host an-druid1003 [08:44:07] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host an-druid1003 [08:44:17] !log installing Apache security updates [08:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:48] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [08:45:51] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [08:49:14] PSA: Train currently blocked at: T421988 [08:49:14] T421988: Failing deployment checks: URLs in Location header exepcted to be absolute, but relative found - https://phabricator.wikimedia.org/T421988 [08:49:23] 06SRE, 06Infrastructure-Foundations, 07ci-test-error, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 07Kubernetes: Unusual CI failure for aux-k8s when changing dse-k8s cert-manager values - https://phabricator.wikimedia.org/T421362#11776667 (10MLechvien-WMF) Routing this to #infrastructure-foundations as... [08:50:09] (03CR) 10MVernon: [C:03+1] installserver: Treat any attempt to reimage backup hosts as an error [puppet] - 10https://gerrit.wikimedia.org/r/1266148 (https://phabricator.wikimedia.org/T420506) (owner: 10Jcrespo) [08:50:54] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P90139 and previous config saved to /var/cache/conftool/dbconfig/20260401-085053-fceratto.json [08:52:25] 10SRE-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11776677 (10BTullis) [08:52:44] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1368 - ayounsi@cumin1003" [08:52:49] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1368 - ayounsi@cumin1003" [08:52:49] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:52:50] !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache wikikube-worker1368.eqiad.wmnet 202.48.64.10.in-addr.arpa 2.0.2.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:52:55] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1368.eqiad.wmnet 202.48.64.10.in-addr.arpa 2.0.2.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:52:55] !log ayounsi@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1368 [08:53:24] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1368 [08:53:25] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1368 [08:54:25] (03PS4) 10Tiziano Fogli: thanos/store: add a scrape target for the ruler instance [puppet] - 10https://gerrit.wikimedia.org/r/1266067 (https://phabricator.wikimedia.org/T412924) [08:54:31] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [08:54:33] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [08:57:11] (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1266067 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli) [08:57:54] !log mwscript-k8s --dblist=all -- purgeUserOptions.php --login-age 5 skin (T406724) [08:57:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:58] T406724: Clean up watchlist and user properties of users if they don't log in for certain time - https://phabricator.wikimedia.org/T406724 [09:00:11] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-druid1003.eqiad.wmnet with reason: host reimage [09:01:02] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P90141 and previous config saved to /var/cache/conftool/dbconfig/20260401-090101-fceratto.json [09:03:39] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST events) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-dse&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:03:44] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-druid1003.eqiad.wmnet with reason: host reimage [09:05:33] !log ayounsi@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1368.eqiad.wmnet with reason: host reimage [09:05:50] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [09:05:54] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [09:08:04] (03CR) 10Muehlenhoff: [C:03+2] bitu: Remove inactive approver [puppet] - 10https://gerrit.wikimedia.org/r/1265490 (owner: 10Muehlenhoff) [09:09:28] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1368.eqiad.wmnet with reason: host reimage [09:11:10] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T419635)', diff saved to https://phabricator.wikimedia.org/P90142 and previous config saved to /var/cache/conftool/dbconfig/20260401-091109-fceratto.json [09:11:13] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [09:11:26] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1182.eqiad.wmnet with reason: Maintenance [09:11:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1182 (T419635)', diff saved to https://phabricator.wikimedia.org/P90143 and previous config saved to /var/cache/conftool/dbconfig/20260401-091134-fceratto.json [09:14:13] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [09:14:16] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [09:20:54] 10ops-eqiad, 06DC-Ops: Inbound errors on interface cr1-eqiad:ae2 (asw2-b-eqiad:ae1) - https://phabricator.wikimedia.org/T421989 (10phaultfinder) 03NEW [09:21:49] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [09:21:52] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [09:22:05] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11776765 (10Aklapper) Phabricator itself has no influence on other systems. Per https://phabricator.wikimedia.org/p/AWester... [09:23:27] (03CR) 10Btullis: [C:03+1] "Looks good." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1258956 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg) [09:26:35] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1368.eqiad.wmnet with OS trixie [09:27:19] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1369.eqiad.wmnet with OS trixie [09:27:47] !log ayounsi@cumin1003 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1369 [09:27:56] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [09:28:56] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T419635)', diff saved to https://phabricator.wikimedia.org/P90146 and previous config saved to /var/cache/conftool/dbconfig/20260401-092855-fceratto.json [09:28:59] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [09:29:25] 06SRE: IP Block/Throttling relief request: urbipedia.org - Bot attack mitigated - https://phabricator.wikimedia.org/T421650#11776818 (10hnowlan) Is the UA you've provided the one used by your InstantCommons? The internal recommendation is to use the latest maintenance release of InstantCommons as older versions... [09:31:02] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-druid1003.eqiad.wmnet with OS bookworm [09:32:03] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1369 - ayounsi@cumin1003" [09:32:05] (03Abandoned) 10Arnaudb: gerrit: increase packedGitWindowSize [puppet] - 10https://gerrit.wikimedia.org/r/1266044 (https://phabricator.wikimedia.org/T421904) (owner: 10Arnaudb) [09:32:09] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1369 - ayounsi@cumin1003" [09:32:09] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:32:09] !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache wikikube-worker1369.eqiad.wmnet 203.48.64.10.in-addr.arpa 3.0.2.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:32:13] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1369.eqiad.wmnet 203.48.64.10.in-addr.arpa 3.0.2.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:32:14] !log ayounsi@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1369 [09:32:36] 06SRE: IP Block/Throttling relief request: urbipedia.org - Bot attack mitigated - https://phabricator.wikimedia.org/T421650#11776844 (10hnowlan) Upon reviewing our logs, every 429 for Urbipedia that I see is for the user agent `QuickInstantCommons/1.5 MediaWiki/1.39.5; Urbipedia` - addressing this UA will most l... [09:32:41] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1369 [09:32:41] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1369 [09:32:54] (03CR) 10Jcrespo: [C:03+2] installserver: Treat any attempt to reimage backup hosts as an error [puppet] - 10https://gerrit.wikimedia.org/r/1266148 (https://phabricator.wikimedia.org/T420506) (owner: 10Jcrespo) [09:33:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [09:39:04] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P90147 and previous config saved to /var/cache/conftool/dbconfig/20260401-093903-fceratto.json [09:41:01] (03PS1) 10Hnowlan: admin: add mpostoronca to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1266170 (https://phabricator.wikimedia.org/T421471) [09:42:23] RESOLVED: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [09:42:34] (03CR) 10Muehlenhoff: [C:03+2] proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265388 (owner: 10Muehlenhoff) [09:43:39] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST events) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-dse&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:44:25] !log ayounsi@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1369.eqiad.wmnet with reason: host reimage [09:45:17] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1266170 (https://phabricator.wikimedia.org/T421471) (owner: 10Hnowlan) [09:45:22] (03PS1) 10Marostegui: Revert "clouddb1022: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1266177 [09:45:56] (03CR) 10Hnowlan: [C:03+2] admin: add mpostoronca to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1266170 (https://phabricator.wikimedia.org/T421471) (owner: 10Hnowlan) [09:47:26] (03CR) 10Marostegui: [C:03+2] Revert "clouddb1022: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1266177 (owner: 10Marostegui) [09:47:35] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to superset dashboard for mpostoronca - https://phabricator.wikimedia.org/T421471#11776929 (10hnowlan) 05Open→03In progress Your access has been added - the change should be live within the next 30 or so minutes. [09:49:12] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P90148 and previous config saved to /var/cache/conftool/dbconfig/20260401-094912-fceratto.json [09:50:28] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1369.eqiad.wmnet with reason: host reimage [09:50:54] (03PS1) 10Arnaudb: gerrit: update upstream_response_timeout for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1266181 (https://phabricator.wikimedia.org/T421827) [09:51:17] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [09:53:04] (03PS1) 10JMeybohm: CI: Send User-Agent when fetching data from gitiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266185 [09:53:20] !log jmm@deploy1003 helmfile [staging] START helmfile.d/services/proton: apply [09:54:02] (03CR) 10CI reject: [V:04-1] CI: Send User-Agent when fetching data from gitiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266185 (owner: 10JMeybohm) [09:54:14] !log jmm@deploy1003 helmfile [staging] DONE helmfile.d/services/proton: apply [09:54:42] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11776945 (10MoritzMuehlenhoff) [09:54:56] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [09:55:00] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [09:55:39] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [09:55:42] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [09:57:39] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-druid1004.eqiad.wmnet with OS bookworm [09:58:03] !log btullis@cumin1003 START - Cookbook sre.hosts.move-vlan for host an-druid1004 [09:58:03] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host an-druid1004 [09:58:06] (03PS1) 10Muehlenhoff: Failover irc.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1266187 [09:59:20] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T419635)', diff saved to https://phabricator.wikimedia.org/P90149 and previous config saved to /var/cache/conftool/dbconfig/20260401-095920-fceratto.json [09:59:23] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [09:59:37] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1188.eqiad.wmnet with reason: Maintenance [09:59:44] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1188 (T419635)', diff saved to https://phabricator.wikimedia.org/P90150 and previous config saved to /var/cache/conftool/dbconfig/20260401-095943-fceratto.json [10:00:05] dusen and effie: May I have your attention please! MediaWiki infrastructure (UTC mid-day). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T1000) [10:03:21] (03PS1) 10Jforrester: MemcachedWrapper: Hash key when longer than 250 characters [extensions/WikiLambda] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266190 [10:03:59] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [10:04:02] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [10:06:42] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [10:06:45] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [10:06:49] (03PS1) 10Kevin Bazira: ml-services: update gpt isvc image to one that supports configurable tensor_parallel_size flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266195 (https://phabricator.wikimedia.org/T418350) [10:06:56] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1369.eqiad.wmnet with OS trixie [10:06:58] !log jmm@deploy1003 helmfile [codfw] START helmfile.d/services/proton: apply [10:08:07] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [10:08:10] !log jmm@deploy1003 helmfile [codfw] DONE helmfile.d/services/proton: apply [10:08:10] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [10:09:07] (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: Refactor request classification for readability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260763 (https://phabricator.wikimedia.org/T419796) (owner: 10Bartosz Dziewoński) [10:09:12] (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: rate limiting for InstantCommons [deployment-charts] - 10https://gerrit.wikimedia.org/r/1263878 (owner: 10Daniel Kinzler) [10:09:16] (03CR) 10Ladsgroup: [C:03+1] "Unpopular opinion: We should shut our IRC service down 😄" [dns] - 10https://gerrit.wikimedia.org/r/1266187 (owner: 10Muehlenhoff) [10:09:16] (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: add second Lua filter for header handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250675 (https://phabricator.wikimedia.org/T418969) (owner: 10Daniel Kinzler) [10:09:20] !log jmm@deploy1003 helmfile [eqiad] START helmfile.d/services/proton: apply [10:10:06] (03PS5) 10Daniel Kinzler: rest-gateway: add values for new rate limiting class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260774 (https://phabricator.wikimedia.org/T419796) (owner: 10ArielGlenn) [10:10:18] (03PS6) 10Daniel Kinzler: rest-gateway: add values for new rate limiting classes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260774 (https://phabricator.wikimedia.org/T419796) (owner: 10ArielGlenn) [10:10:22] (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: add values for new rate limiting classes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260774 (https://phabricator.wikimedia.org/T419796) (owner: 10ArielGlenn) [10:10:30] (03CR) 10Trueg: [C:03+2] wdqs-queryhammer: Deployment fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1258956 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg) [10:10:51] (03CR) 10Muehlenhoff: "Noted, but not in scope for the current reboot :-)" [dns] - 10https://gerrit.wikimedia.org/r/1266187 (owner: 10Muehlenhoff) [10:10:54] (03CR) 10Muehlenhoff: [C:03+2] Failover irc.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1266187 (owner: 10Muehlenhoff) [10:11:16] (03CR) 10Ozge: [C:03+1] ml-services: update gpt isvc image to one that supports configurable tensor_parallel_size flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266195 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [10:11:26] (03Merged) 10jenkins-bot: rest-gateway: Refactor request classification for readability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260763 (https://phabricator.wikimedia.org/T419796) (owner: 10Bartosz Dziewoński) [10:11:30] !log jmm@dns1004 START - running authdns-update [10:11:38] (03Merged) 10jenkins-bot: rest gateway: rate limiting for InstantCommons [deployment-charts] - 10https://gerrit.wikimedia.org/r/1263878 (owner: 10Daniel Kinzler) [10:11:53] (03Merged) 10jenkins-bot: rest gateway: add second Lua filter for header handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250675 (https://phabricator.wikimedia.org/T418969) (owner: 10Daniel Kinzler) [10:12:23] !log jmm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/proton: apply [10:12:35] (03Merged) 10jenkins-bot: rest-gateway: add values for new rate limiting classes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260774 (https://phabricator.wikimedia.org/T419796) (owner: 10ArielGlenn) [10:12:45] (03Merged) 10jenkins-bot: wdqs-queryhammer: Deployment fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1258956 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg) [10:13:08] !log jmm@dns1004 END - running authdns-update [10:13:53] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-druid1004.eqiad.wmnet with reason: host reimage [10:16:43] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1264257 (owner: 10PipelineBot) [10:17:59] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T419635)', diff saved to https://phabricator.wikimedia.org/P90151 and previous config saved to /var/cache/conftool/dbconfig/20260401-101758-fceratto.json [10:18:02] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [10:18:02] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [10:18:06] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [10:18:40] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1264257 (owner: 10PipelineBot) [10:19:01] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1370.eqiad.wmnet with OS trixie [10:19:31] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-druid1004.eqiad.wmnet with reason: host reimage [10:19:32] (03PS7) 10Daniel Kinzler: rest gateway: add support for centralauthtoken [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259242 (https://phabricator.wikimedia.org/T420280) [10:19:40] !log ayounsi@cumin1003 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1370 [10:19:47] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [10:20:42] 06SRE, 06ServiceOps new, 07Datacenter-Switchover: Increased rate of badtoken errors / session store issues due to datacenter switchover? - https://phabricator.wikimedia.org/T421168#11777087 (10MLechvien-WMF) 05Open→03Declined We discussed with the team and don't see a link between the DC switchover a... [10:21:54] (03PS1) 10Daniel Kinzler: rest gateway: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266204 [10:22:06] (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266204 (owner: 10Daniel Kinzler) [10:24:22] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1370 - ayounsi@cumin1003" [10:24:27] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1370 - ayounsi@cumin1003" [10:24:27] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:24:28] !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache wikikube-worker1370.eqiad.wmnet 204.48.64.10.in-addr.arpa 4.0.2.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:24:31] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1370.eqiad.wmnet 204.48.64.10.in-addr.arpa 4.0.2.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:24:32] !log ayounsi@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1370 [10:24:37] (03PS2) 10Majavah: P:opensearch::cirrus::test: Convert port to an integer [puppet] - 10https://gerrit.wikimedia.org/r/1260719 [10:24:48] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1370 [10:24:48] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1370 [10:26:32] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [10:26:36] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [10:27:34] (03PS8) 10Daniel Kinzler: rest gateway: add support for centralauthtoken [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259242 (https://phabricator.wikimedia.org/T420280) [10:28:06] (03CR) 10Majavah: [C:03+2] P:opensearch::cirrus::test: Convert port to an integer [puppet] - 10https://gerrit.wikimedia.org/r/1260719 (owner: 10Majavah) [10:28:08] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P90152 and previous config saved to /var/cache/conftool/dbconfig/20260401-102807-fceratto.json [10:29:26] ...this is taking a long time to merge... [10:29:30] (03PS2) 10Majavah: nftables: Fix issues around virtual resource dependencies [puppet] - 10https://gerrit.wikimedia.org/r/1260721 [10:29:30] (03PS14) 10Majavah: firewall: Declare resources for both providers [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) [10:29:30] (03PS14) 10Majavah: P:wmcs::instance: Convert to firewall wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1211652 (https://phabricator.wikimedia.org/T411089) [10:29:31] (03PS1) 10Majavah: P:base: Make nftables::set resources always defined [puppet] - 10https://gerrit.wikimedia.org/r/1266205 [10:29:41] i'm still not seeing the new chart on the deployment host [10:31:05] (03CR) 10Kevin Bazira: [C:03+2] ml-services: update gpt isvc image to one that supports configurable tensor_parallel_size flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266195 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [10:31:22] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) (owner: 10Majavah) [10:32:59] Raine: looks like Zuul is stuck, ithas been 10 minutes and https://integration.wikimedia.org/zuul/?#q=1266204 says "queued"... [10:33:35] fascinating [10:33:41] (that's code for "wtf") [10:33:55] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11777176 (10Marostegui) For db* related hosts (including pc*, es* and dbproxy*) will be tricky as this also requires changi... [10:34:32] (03PS1) 10Muehlenhoff: Rebuild against latest package versions in Bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1266210 [10:35:39] Raine: it also sais: "Queue lengths: 0 events, 0 results." [10:35:46] I'll try and re-trigger. [10:36:21] (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266204 (owner: 10Daniel Kinzler) [10:36:22] zuul can get jammed up sometimes iirc, you might just need to ping in releng to get someone to look [10:36:38] yeah, exactly, ping #wikimedia-releng [10:36:48] !log ayounsi@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1370.eqiad.wmnet with reason: host reimage [10:38:12] *sigh* [10:38:17] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P90154 and previous config saved to /var/cache/conftool/dbconfig/20260401-103816-fceratto.json [10:38:25] I actually need to get this done including testing in the next 60 minutes... [10:38:42] or we have to revert the four patches [10:38:46] (03Merged) 10jenkins-bot: rest gateway: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266204 (owner: 10Daniel Kinzler) [10:39:30] (03Merged) 10jenkins-bot: ml-services: update gpt isvc image to one that supports configurable tensor_parallel_size flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266195 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [10:39:51] oh! oh! it went through! [10:40:34] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1370.eqiad.wmnet with reason: host reimage [10:40:45] !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:40:45] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host irc1003.wikimedia.org [10:41:00] (03CR) 10Hnowlan: [C:03+1] Rebuild against latest package versions in Bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1266210 (owner: 10Muehlenhoff) [10:41:59] !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:44:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc1003.wikimedia.org [10:46:35] !log daniel@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [10:47:06] !log daniel@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [10:47:09] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-druid1004.eqiad.wmnet with OS bookworm [10:47:34] !log installing libpng1.6 security updates on Trixie/Bookworm [10:47:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:35] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [10:47:39] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [10:48:19] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [10:48:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T419635)', diff saved to https://phabricator.wikimedia.org/P90156 and previous config saved to /var/cache/conftool/dbconfig/20260401-104823-fceratto.json [10:48:27] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [10:48:39] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1197.eqiad.wmnet with reason: Maintenance [10:48:47] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1197 (T419635)', diff saved to https://phabricator.wikimedia.org/P90157 and previous config saved to /var/cache/conftool/dbconfig/20260401-104847-fceratto.json [10:49:23] (03CR) 10CI reject: [V:04-1] MemcachedWrapper: Hash key when longer than 250 characters [extensions/WikiLambda] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266190 (owner: 10Jforrester) [10:49:51] (03PS1) 10Gkyziridis: ml-services: Deploy rr-multilingual gpu model and eventstream in prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266212 (https://phabricator.wikimedia.org/T415892) [10:51:01] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T419635)', diff saved to https://phabricator.wikimedia.org/P90158 and previous config saved to /var/cache/conftool/dbconfig/20260401-105059-fceratto.json [10:51:05] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 70775336 and 7 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:52:05] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3782224 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:52:34] (03CR) 10Muehlenhoff: [C:03+2] Rebuild against latest package versions in Bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1266210 (owner: 10Muehlenhoff) [10:55:39] (03PS1) 10Blake: mw-web: downsize for multi-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266213 (https://phabricator.wikimedia.org/T413974) [10:56:02] (03PS3) 10Majavah: nftables: Fix issues around virtual resource dependencies [puppet] - 10https://gerrit.wikimedia.org/r/1260721 [10:56:02] (03PS2) 10Majavah: P:base: Make nftables::set resources always defined [puppet] - 10https://gerrit.wikimedia.org/r/1266205 [10:56:02] (03PS15) 10Majavah: firewall: Declare resources for both providers [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) [10:56:03] (03PS15) 10Majavah: P:wmcs::instance: Convert to firewall wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1211652 (https://phabricator.wikimedia.org/T411089) [10:56:36] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1260721 (owner: 10Majavah) [10:57:00] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1370.eqiad.wmnet with OS trixie [10:57:00] (03PS2) 10Blake: mw-web: downsize for multi-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266213 (https://phabricator.wikimedia.org/T413974) [10:57:11] (03PS3) 10Blake: mw-web: downsize for multi-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266213 (https://phabricator.wikimedia.org/T413974) [10:58:02] !log daniel@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [10:58:45] !log daniel@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [10:58:54] (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: add support for centralauthtoken (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259242 (https://phabricator.wikimedia.org/T420280) (owner: 10Daniel Kinzler) [11:00:05] mvolz: Time to do the Services – Citoid / Zotero deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T1100). [11:01:10] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P90159 and previous config saved to /var/cache/conftool/dbconfig/20260401-110109-fceratto.json [11:01:12] (03Merged) 10jenkins-bot: rest gateway: add support for centralauthtoken [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259242 (https://phabricator.wikimedia.org/T420280) (owner: 10Daniel Kinzler) [11:05:51] !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:06:30] !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:07:24] (03PS1) 10Muehlenhoff: Update Cumin alias for contint to also cover the spun-off Trixie role [puppet] - 10https://gerrit.wikimedia.org/r/1266215 [11:09:46] !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply [11:10:08] !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:11:18] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P90160 and previous config saved to /var/cache/conftool/dbconfig/20260401-111117-fceratto.json [11:11:44] !log mvolz@deploy1003 helmfile [codfw] START helmfile.d/services/citoid: apply [11:12:12] !log mvolz@deploy1003 helmfile [codfw] DONE helmfile.d/services/citoid: apply [11:14:34] 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11777268 (10BTullis) OK, thanks for all of the input so far.... [11:15:46] (03PS1) 10Effie Mouzeli: (WIP) fix fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266216 [11:16:35] !log daniel@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [11:17:14] !log daniel@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [11:17:58] !log mvolz@deploy1003 helmfile [eqiad] START helmfile.d/services/citoid: apply [11:18:31] !log mvolz@deploy1003 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [11:21:26] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T419635)', diff saved to https://phabricator.wikimedia.org/P90161 and previous config saved to /var/cache/conftool/dbconfig/20260401-112125-fceratto.json [11:21:29] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [11:21:42] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1225.eqiad.wmnet with reason: Maintenance [11:22:51] (03PS2) 10JMeybohm: CI: Send User-Agent when fetching data from gitiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266185 [11:23:07] (03CR) 10CI reject: [V:04-1] (WIP) fix fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266216 (owner: 10Effie Mouzeli) [11:23:45] (03CR) 10CI reject: [V:04-1] CI: Send User-Agent when fetching data from gitiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266185 (owner: 10JMeybohm) [11:27:30] !log daniel@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [11:27:52] !log daniel@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [11:27:56] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11777326 (10MoritzMuehlenhoff) [11:28:09] 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11777327 (10BTullis) [11:29:39] (03CR) 10Dpogorzelski: [C:03+1] ml-services: Deploy rr-multilingual gpu model and eventstream in prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266212 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [11:30:31] (03PS2) 10Effie Mouzeli: (WIP) fix fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266216 [11:30:58] (03CR) 10Ilias Sarantopoulos: ml-services: Deploy rr-multilingual gpu model and eventstream in prod. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266212 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [11:32:40] (03PS3) 10Effie Mouzeli: (WIP) fix fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266216 [11:33:05] !log installing tomcat10 security updates [11:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:49] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1229.eqiad.wmnet with reason: Maintenance [11:35:57] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1229 (T419635)', diff saved to https://phabricator.wikimedia.org/P90162 and previous config saved to /var/cache/conftool/dbconfig/20260401-113556-fceratto.json [11:36:00] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [11:41:24] (03CR) 10CI reject: [V:04-1] (WIP) fix fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266216 (owner: 10Effie Mouzeli) [11:42:58] (03PS1) 10Jforrester: Extend queue processing times for abstract fragments [extensions/WikiLambda] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266219 (https://phabricator.wikimedia.org/T421581) [11:44:41] (03CR) 10Jforrester: "recheck" [extensions/WikiLambda] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266190 (owner: 10Jforrester) [11:47:04] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2026-03-25-072715-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1264221 (owner: 10KartikMistry) [11:48:24] !log upgrading Envoy on the idp-test servers to 1.35.9 T419637 T410975 [11:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:31] T419637: Upgrade Envoy to v1.35.9 - https://phabricator.wikimedia.org/T419637 [11:48:31] T410975: Upgrade Envoy to v1.35.7 - https://phabricator.wikimedia.org/T410975 [11:49:09] (03PS1) 10Kosta Harlan: Revert "SuggestedInvestigations: Import session into signal matching job" [extensions/CheckUser] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266222 (https://phabricator.wikimedia.org/T421062) [11:49:21] (03Merged) 10jenkins-bot: Update cxserver to 2026-03-25-072715-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1264221 (owner: 10KartikMistry) [11:49:25] (03PS4) 10Effie Mouzeli: (WIP) fix fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266216 [11:51:15] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T419635)', diff saved to https://phabricator.wikimedia.org/P90164 and previous config saved to /var/cache/conftool/dbconfig/20260401-115114-fceratto.json [11:51:17] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [11:55:52] (03CR) 10Ilias Sarantopoulos: ml-services: Deploy rr-multilingual gpu model and eventstream in prod. (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266212 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [11:56:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:57:58] (03PS2) 10Gkyziridis: ml-services: Deploy rr-multilingual gpu model and eventstream in prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266212 (https://phabricator.wikimedia.org/T415892) [11:58:18] (03PS5) 10Effie Mouzeli: (WIP) fix fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266216 [11:58:30] FIRING: Outbound discards: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [11:58:31] Deploying cxserver.. [11:59:23] (03PS3) 10Gkyziridis: ml-services: Deploy rr-multilingual gpu model and eventstream in prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266212 (https://phabricator.wikimedia.org/T415892) [12:00:07] (03PS6) 10Effie Mouzeli: (WIP) fix fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266216 [12:01:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:01:23] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P90166 and previous config saved to /var/cache/conftool/dbconfig/20260401-120122-fceratto.json [12:02:10] (03PS1) 10Kosta Harlan: hCaptcha: Retry SiteVerify API on HTTP error and adjust timeout [extensions/ConfirmEdit] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266226 (https://phabricator.wikimedia.org/T421678) [12:02:19] !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/cxserver: apply [12:02:20] (03PS1) 10Kosta Harlan: hCaptcha: Retry SiteVerify API on HTTP error and adjust timeout [extensions/ConfirmEdit] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266227 (https://phabricator.wikimedia.org/T421678) [12:02:25] (03PS1) 10Gkyziridis: EventStreamConfig: Add rr-multilingual prediction_change stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266228 (https://phabricator.wikimedia.org/T415892) [12:02:53] !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/cxserver: apply [12:03:10] 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11777443 (10BTullis) [12:06:26] (03CR) 10CI reject: [V:04-1] hCaptcha: Retry SiteVerify API on HTTP error and adjust timeout [extensions/ConfirmEdit] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266226 (https://phabricator.wikimedia.org/T421678) (owner: 10Kosta Harlan) [12:07:03] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdt) failed in ms-be1065 - https://phabricator.wikimedia.org/T422011 (10MatthewVernon) 03NEW [12:07:10] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdt) failed in ms-be1065 - https://phabricator.wikimedia.org/T422011#11777457 (10MatthewVernon) p:05Triage→03High [12:07:12] (03CR) 10Gkyziridis: ml-services: Deploy rr-multilingual gpu model and eventstream in prod. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266212 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [12:09:10] (03PS2) 10Kosta Harlan: Revert "SuggestedInvestigations: Import session into signal matching job" [extensions/CheckUser] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266223 (https://phabricator.wikimedia.org/T421062) [12:09:28] (03CR) 10AikoChou: ml-services: Deploy rr-multilingual gpu model and eventstream in prod. (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266212 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [12:11:31] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P90167 and previous config saved to /var/cache/conftool/dbconfig/20260401-121130-fceratto.json [12:11:32] !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-lab1002.eqiad.wmnet [12:11:33] !log kartik@deploy1003 helmfile [codfw] START helmfile.d/services/cxserver: apply [12:12:05] !log kartik@deploy1003 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [12:12:54] !log kartik@deploy1003 helmfile [eqiad] START helmfile.d/services/cxserver: apply [12:13:26] !log kartik@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [12:13:29] (03CR) 10AikoChou: [C:03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266228 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [12:15:00] 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11777492 (10BTullis) In terms of manager approvals, @HShaikh... [12:15:18] jouncebot: nowandnext [12:15:18] No deployments scheduled for the next 0 hour(s) and 44 minute(s) [12:15:18] In 0 hour(s) and 44 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T1300) [12:15:38] I'd like to start on a few MW backports now, unless there's an objection [12:15:56] (03PS1) 10Muehlenhoff: thumbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266229 [12:17:00] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-lab1002.eqiad.wmnet [12:17:06] !log Updated cxserver to 2026-03-25-072715-production [12:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:06] (03PS7) 10Effie Mouzeli: (WIP) fix fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266216 [12:21:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T419635)', diff saved to https://phabricator.wikimedia.org/P90168 and previous config saved to /var/cache/conftool/dbconfig/20260401-122138-fceratto.json [12:21:41] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [12:21:55] ok, I will get started [12:21:56] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1233.eqiad.wmnet with reason: Maintenance [12:22:04] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1233 (T419635)', diff saved to https://phabricator.wikimedia.org/P90169 and previous config saved to /var/cache/conftool/dbconfig/20260401-122203-fceratto.json [12:22:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/CheckUser] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266223 (https://phabricator.wikimedia.org/T421062) (owner: 10Kosta Harlan) [12:22:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/CheckUser] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266222 (https://phabricator.wikimedia.org/T421062) (owner: 10Kosta Harlan) [12:24:37] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [12:25:02] (03Merged) 10jenkins-bot: Revert "SuggestedInvestigations: Import session into signal matching job" [extensions/CheckUser] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266223 (https://phabricator.wikimedia.org/T421062) (owner: 10Kosta Harlan) [12:25:16] (03Merged) 10jenkins-bot: Revert "SuggestedInvestigations: Import session into signal matching job" [extensions/CheckUser] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266222 (https://phabricator.wikimedia.org/T421062) (owner: 10Kosta Harlan) [12:25:59] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1266223|Revert "SuggestedInvestigations: Import session into signal matching job" (T421062)]], [[gerrit:1266222|Revert "SuggestedInvestigations: Import session into signal matching job" (T421062)]] [12:28:03] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1266223|Revert "SuggestedInvestigations: Import session into signal matching job" (T421062)]], [[gerrit:1266222|Revert "SuggestedInvestigations: Import session into signal matching job" (T421062)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:28:28] (03PS1) 10Hashar: gerrit: replace ProxyTimeout by ProxyPass ttl [puppet] - 10https://gerrit.wikimedia.org/r/1266231 (https://phabricator.wikimedia.org/T421904) [12:29:20] !log kharlan@deploy1003 kharlan: Continuing with sync [12:29:49] (03CR) 10Kosta Harlan: "recheck" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266226 (https://phabricator.wikimedia.org/T421678) (owner: 10Kosta Harlan) [12:30:47] (03PS8) 10Effie Mouzeli: (WIP) fix fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266216 [12:31:22] (03PS8) 10Slyngshede: P:idp Allow enabling of gauth mfa / TOTP [puppet] - 10https://gerrit.wikimedia.org/r/1254176 (https://phabricator.wikimedia.org/T372892) [12:33:34] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1266223|Revert "SuggestedInvestigations: Import session into signal matching job" (T421062)]], [[gerrit:1266222|Revert "SuggestedInvestigations: Import session into signal matching job" (T421062)]] (duration: 07m 34s) [12:34:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266227 (https://phabricator.wikimedia.org/T421678) (owner: 10Kosta Harlan) [12:34:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266226 (https://phabricator.wikimedia.org/T421678) (owner: 10Kosta Harlan) [12:34:24] 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11777592 (10BTullis) In the meantime, we will need an SSH key... [12:36:16] (03PS1) 10Bartosz Dziewoński: Update 'Location:' header tests for MediaWiki changes [puppet] - 10https://gerrit.wikimedia.org/r/1266232 (https://phabricator.wikimedia.org/T421988) [12:36:40] (03CR) 10Jforrester: REST: Publish ReadingLists v0 module in REST Sandbox (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264856 (https://phabricator.wikimedia.org/T419619) (owner: 10KineticPelagic) [12:37:29] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T419635)', diff saved to https://phabricator.wikimedia.org/P90170 and previous config saved to /var/cache/conftool/dbconfig/20260401-123728-fceratto.json [12:37:32] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [12:37:56] jnuche: hi. sorry for breaking the tests, i had no idea we're testing this. do we need to do anything special to deploy this? given that there's now a mutual dependency between the puppet patch i just wrote and the train D: [12:39:04] jnuche: i could submit a separate patch to remove these test cases first, then we roll out the train, then we add them back with corrections. let me know if that would be useful [12:40:06] MatmaRex: no worries, thanks for looking into it. Presumably once your puppet patch gets merged it will be eventually applied to the deploy server? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1266232 [12:40:18] at which point we should be able to deploy the train [12:40:46] I'll be done backporting my patches in ~10-15 minutes btw [12:40:58] sure, that's fine by me if that works [12:41:16] kostajh: ack, thx [12:41:26] jnuche: my concern is that deploying the puppet patch will result in test failures too, until we also deploy the train. you're saying that's okay? [12:43:04] MatmaRe: maybe I'm missing something, but my understanding is that the workflow is: 1) We merge your patch 2) `puppet run` runs on the box every 30m and applies the test changes 3) Train can now continue [12:43:05] (03CR) 10Gkyziridis: ml-services: Deploy rr-multilingual gpu model and eventstream in prod. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266212 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [12:43:19] the deployment tooling is not involved in that workflow [12:43:32] we can always modify the tests on disk by hand if we don't want to wait for puppet to run [12:44:11] jnuche: okay, cool. that makes sense to me, i'm just not very familiar with the workflow. please ship it at your leisure :) [12:44:56] (03PS9) 10Effie Mouzeli: Update fixtures and remove mw-parsoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266216 (https://phabricator.wikimedia.org/T420468) [12:45:34] MatmaRex: well, now we need someone with +2 for the puppet repo :D [12:46:19] 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11777636 (10BTullis) I will manually add @RThomas-WMF to the... [12:46:53] (03Merged) 10jenkins-bot: hCaptcha: Retry SiteVerify API on HTTP error and adjust timeout [extensions/ConfirmEdit] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266227 (https://phabricator.wikimedia.org/T421678) (owner: 10Kosta Harlan) [12:46:55] (03Merged) 10jenkins-bot: hCaptcha: Retry SiteVerify API on HTTP error and adjust timeout [extensions/ConfirmEdit] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266226 (https://phabricator.wikimedia.org/T421678) (owner: 10Kosta Harlan) [12:47:10] 10SRE-Access-Requests: Yubikey-SSH-FIDO for Tiziano Fogli (tappof / BACKUP) - https://phabricator.wikimedia.org/T422020 (10tappof) 03NEW [12:47:26] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1266227|hCaptcha: Retry SiteVerify API on HTTP error and adjust timeout (T421678)]], [[gerrit:1266226|hCaptcha: Retry SiteVerify API on HTTP error and adjust timeout (T421678)]] [12:47:28] (03CR) 10Ottomata: [C:03+1] EventStreamConfig: Add rr-multilingual prediction_change stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266228 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [12:47:29] T421678: hCaptcha: Retry SiteVerify API requests when http error occurs - https://phabricator.wikimedia.org/T421678 [12:47:37] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P90171 and previous config saved to /var/cache/conftool/dbconfig/20260401-124736-fceratto.json [12:48:10] (03CR) 10Majavah: [C:03+2] Update 'Location:' header tests for MediaWiki changes [puppet] - 10https://gerrit.wikimedia.org/r/1266232 (https://phabricator.wikimedia.org/T421988) (owner: 10Bartosz Dziewoński) [12:48:34] (03PS1) 10Tiziano Fogli: ssh: FIDO Backup key for Tiziano Fogli [puppet] - 10https://gerrit.wikimedia.org/r/1266234 (https://phabricator.wikimedia.org/T422020) [12:48:49] (03CR) 10Gkyziridis: "Lets wait for Ottomata to review it as well, and then I will schedule a deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266228 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [12:49:00] (03PS2) 10Hashar: gerrit: replace ProxyTimeout by ProxyPass ttl [puppet] - 10https://gerrit.wikimedia.org/r/1266231 (https://phabricator.wikimedia.org/T246763) [12:49:24] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1266227|hCaptcha: Retry SiteVerify API on HTTP error and adjust timeout (T421678)]], [[gerrit:1266226|hCaptcha: Retry SiteVerify API on HTTP error and adjust timeout (T421678)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:49:41] taavi just mered the patch, thanks a lot [12:49:52] s/mered/merged/ [12:50:01] thanks [12:51:02] (03PS1) 10Btullis: Record LDAP membership of the wmf group for renilthomas [puppet] - 10https://gerrit.wikimedia.org/r/1266235 (https://phabricator.wikimedia.org/T421214) [12:51:05] (03CR) 10Brouberol: [C:03+1] "Confirmed key out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1265675 (owner: 10Dr0ptp4kt) [12:51:11] (03CR) 10Brouberol: [C:03+2] Update deployment key for dr0ptp4kt [puppet] - 10https://gerrit.wikimedia.org/r/1265675 (owner: 10Dr0ptp4kt) [12:52:30] (03PS1) 10Muehlenhoff: use_linux612_on_bookworm: Bump kernel to 6.12.74 [puppet] - 10https://gerrit.wikimedia.org/r/1266236 [12:52:34] !log kharlan@deploy1003 kharlan: Continuing with sync [12:53:07] (03PS1) 10Daniel Kinzler: rest gateway: defined authed-user class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266237 (https://phabricator.wikimedia.org/T420280) [12:53:15] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1266235 (https://phabricator.wikimedia.org/T421214) (owner: 10Btullis) [12:53:21] 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 13Patch-For-Review: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11777722 (10BTullis) [12:53:42] jnuche: MatmaRex: and deployed [12:53:47] (03CR) 10Btullis: [C:03+2] Record LDAP membership of the wmf group for renilthomas [puppet] - 10https://gerrit.wikimedia.org/r/1266235 (https://phabricator.wikimedia.org/T421214) (owner: 10Btullis) [12:54:28] taavi: thanks once more! [12:54:47] I can see the changes on the disk [12:54:51] 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 13Patch-For-Review: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11777742 (10BTullis) [12:55:08] kostajh: please ping me once you're done with your backports [12:55:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader1004.wikimedia.org [12:56:31] (03CR) 10Btullis: [C:03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1266236 (owner: 10Muehlenhoff) [12:56:36] jnuche: will do [12:56:41] (03CR) 10Muehlenhoff: "This is the kernel running on dse-k8s" [puppet] - 10https://gerrit.wikimedia.org/r/1266236 (owner: 10Muehlenhoff) [12:56:47] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1266227|hCaptcha: Retry SiteVerify API on HTTP error and adjust timeout (T421678)]], [[gerrit:1266226|hCaptcha: Retry SiteVerify API on HTTP error and adjust timeout (T421678)]] (duration: 09m 21s) [12:56:49] T421678: hCaptcha: Retry SiteVerify API requests when http error occurs - https://phabricator.wikimedia.org/T421678 [12:56:53] jnuche: done [12:57:04] kostajh: ty [12:57:23] jouncebot: nowandnext [12:57:23] No deployments scheduled for the next 0 hour(s) and 2 minute(s) [12:57:23] In 0 hour(s) and 2 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T1300) [12:57:45] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P90173 and previous config saved to /var/cache/conftool/dbconfig/20260401-125744-fceratto.json [12:57:47] alright, train rolling out again [12:57:53] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1266234 (https://phabricator.wikimedia.org/T422020) (owner: 10Tiziano Fogli) [12:58:04] (03PS1) 10TrainBranchBot: group1 to 1.46.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266240 (https://phabricator.wikimedia.org/T420480) [12:58:07] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by jnuche@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266240 (https://phabricator.wikimedia.org/T420480) (owner: 10TrainBranchBot) [12:58:26] (03CR) 10Tiziano Fogli: [C:03+2] ssh: FIDO Backup key for Tiziano Fogli [puppet] - 10https://gerrit.wikimedia.org/r/1266234 (https://phabricator.wikimedia.org/T422020) (owner: 10Tiziano Fogli) [12:58:30] FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [12:58:45] 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 13Patch-For-Review: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11777753 (10KMontalva-WMF) Thanks @BTul... [12:59:11] (03CR) 10BPirkle: [C:03+1] REST: Publish ReadingLists v0 module in REST Sandbox (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264856 (https://phabricator.wikimedia.org/T419619) (owner: 10KineticPelagic) [12:59:53] (03Merged) 10jenkins-bot: group1 to 1.46.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266240 (https://phabricator.wikimedia.org/T420480) (owner: 10TrainBranchBot) [13:00:01] (03CR) 10Btullis: [C:03+1] "Thanks. Yep, we will schedule a rolling reboot of both clusters." [puppet] - 10https://gerrit.wikimedia.org/r/1266236 (owner: 10Muehlenhoff) [13:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T1300). [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:00:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader1004.wikimedia.org [13:00:16] (03CR) 10Muehlenhoff: [C:03+2] use_linux612_on_bookworm: Bump kernel to 6.12.74 [puppet] - 10https://gerrit.wikimedia.org/r/1266236 (owner: 10Muehlenhoff) [13:01:09] (03CR) 10AikoChou: ml-services: Deploy rr-multilingual gpu model and eventstream in prod. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266212 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [13:02:25] FIRING: SystemdUnitFailed: prometheus-nginx-exporter.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:02:49] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:02:49] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:04:36] (03CR) 10Bartosz Dziewoński: [C:03+1] rest gateway: defined authed-user class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266237 (https://phabricator.wikimedia.org/T420280) (owner: 10Daniel Kinzler) [13:05:18] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader2004.wikimedia.org [13:06:34] (03PS1) 10Muehlenhoff: Failover URL downloaders [dns] - 10https://gerrit.wikimedia.org/r/1266242 [13:06:39] !log jnuche@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.46.0-wmf.22 refs T420480 [13:06:42] T420480: 1.46.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T420480 [13:06:47] o/ [13:06:55] * Lucas_WMDE also sees nothing to deploy [13:07:49] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:07:49] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:07:49] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:07:54] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T419635)', diff saved to https://phabricator.wikimedia.org/P90174 and previous config saved to /var/cache/conftool/dbconfig/20260401-130753-fceratto.json [13:07:57] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [13:07:59] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1239.eqiad.wmnet with reason: Maintenance [13:09:39] (03CR) 10Arnaudb: [C:03+2] gerrit: replace ProxyTimeout by ProxyPass ttl [puppet] - 10https://gerrit.wikimedia.org/r/1266231 (https://phabricator.wikimedia.org/T246763) (owner: 10Hashar) [13:09:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader2004.wikimedia.org [13:11:26] (03CR) 10Fabfur: [C:03+2] hiera: upgrade haproxy to version 3.2 on magru [puppet] - 10https://gerrit.wikimedia.org/r/1262060 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [13:12:25] FIRING: [2x] SystemdUnitFailed: prometheus-nginx-exporter.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:12:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 02 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266228 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [13:16:19] (03PS1) 10Muehlenhoff: Use use_linux612_on_bookworm for ml-lab role [puppet] - 10https://gerrit.wikimedia.org/r/1266244 [13:19:52] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_magru - 3.2 upgrade (T421402) [13:19:53] (03PS1) 10Eevans: cassandra_dev: add media_analytics role & grants [puppet] - 10https://gerrit.wikimedia.org/r/1266247 (https://phabricator.wikimedia.org/T420008) [13:19:56] T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402 [13:20:55] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_magru - 3.2 upgrade (T421402) [13:21:03] !log upgrading magru to haproxy 3.2 (T421402) [13:21:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:22] (03CR) 10Eevans: [C:03+2] cassandra_dev: add media_analytics role & grants [puppet] - 10https://gerrit.wikimedia.org/r/1266247 (https://phabricator.wikimedia.org/T420008) (owner: 10Eevans) [13:21:43] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1254.eqiad.wmnet with reason: Maintenance [13:21:49] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1254 (T419635)', diff saved to https://phabricator.wikimedia.org/P90176 and previous config saved to /var/cache/conftool/dbconfig/20260401-132149-fceratto.json [13:21:52] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [13:22:50] !log purge prometheus-nginx-exporter from url downloaders, remnants of early hcapcha rollout [13:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:42] (03PS3) 10Fabfur: hiera: upgrade haproxy to version 3.2 on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1262061 (https://phabricator.wikimedia.org/T421402) [13:23:47] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1262061 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [13:23:58] (03PS1) 10Kamila Součková: shellbox-icu72: Add ClusterIP to TLS cert SANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266250 (https://phabricator.wikimedia.org/T419274) [13:24:44] (03PS1) 10Arnaudb: gerrit: add Cache-Control for Gitiles with mod_proxy [puppet] - 10https://gerrit.wikimedia.org/r/1266238 (https://phabricator.wikimedia.org/T409422) [13:24:57] 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 13Patch-For-Review: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11777910 (10E.Enabulele) Hello @BTullis... [13:25:48] (03CR) 10Klausman: [C:03+1] Use use_linux612_on_bookworm for ml-lab role [puppet] - 10https://gerrit.wikimedia.org/r/1266244 (owner: 10Muehlenhoff) [13:26:35] (03CR) 10Muehlenhoff: [C:03+2] Use use_linux612_on_bookworm for ml-lab role [puppet] - 10https://gerrit.wikimedia.org/r/1266244 (owner: 10Muehlenhoff) [13:27:25] RESOLVED: [2x] SystemdUnitFailed: prometheus-nginx-exporter.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:28:43] (03CR) 10Eevans: [C:03+2] admin: add FIDO key for eevans (spare) [puppet] - 10https://gerrit.wikimedia.org/r/1265580 (owner: 10Eevans) [13:28:46] (03CR) 10Cathal Mooney: [C:03+2] Nokia: BGP policy for unicast bgp sw_external outside peerings [homer/public] - 10https://gerrit.wikimedia.org/r/1262197 (https://phabricator.wikimedia.org/T408892) (owner: 10Cathal Mooney) [13:28:46] (03CR) 10Kamila Součková: "I am very, very, very sorry :')" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266250 (https://phabricator.wikimedia.org/T419274) (owner: 10Kamila Součková) [13:29:40] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [13:30:03] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [13:30:12] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [13:30:27] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [13:30:30] (03Merged) 10jenkins-bot: Nokia: BGP policy for unicast bgp sw_external outside peerings [homer/public] - 10https://gerrit.wikimedia.org/r/1262197 (https://phabricator.wikimedia.org/T408892) (owner: 10Cathal Mooney) [13:31:05] (03CR) 10Cathal Mooney: [C:03+1] Failover URL downloaders [dns] - 10https://gerrit.wikimedia.org/r/1266242 (owner: 10Muehlenhoff) [13:31:36] (03PS1) 10Elukey: sre.hosts.provision: add workaround for root user on X14 supermicros [cookbooks] - 10https://gerrit.wikimedia.org/r/1266257 (https://phabricator.wikimedia.org/T418929) [13:32:09] (03PS1) 10Brouberol: anlytics/hadoop: remove an-worker1148 from the topology [puppet] - 10https://gerrit.wikimedia.org/r/1266259 (https://phabricator.wikimedia.org/T417213) [13:33:20] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [13:33:25] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [13:34:14] (03CR) 10Btullis: [C:03+1] anlytics/hadoop: remove an-worker1148 from the topology [puppet] - 10https://gerrit.wikimedia.org/r/1266259 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol) [13:35:25] (03CR) 10Brouberol: [C:03+2] anlytics/hadoop: remove an-worker1148 from the topology [puppet] - 10https://gerrit.wikimedia.org/r/1266259 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol) [13:36:31] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1254 (T419635)', diff saved to https://phabricator.wikimedia.org/P90177 and previous config saved to /var/cache/conftool/dbconfig/20260401-133629-fceratto.json [13:36:34] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [13:41:34] !log ebysans@deploy1003 helmfile [staging] START helmfile.d/services/media-analytics: apply [13:41:42] !log ebysans@deploy1003 helmfile [staging] DONE helmfile.d/services/media-analytics: apply [13:42:46] (03PS1) 10Kamila Součková: Temporarily add shellbox-icu ClusterIP endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266264 (https://phabricator.wikimedia.org/T419049) [13:43:16] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:44:09] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-logging1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:45:58] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability, 13Patch-For-Review: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11778022 (10elukey) The workaround in the last patch needs a spicerack change for ipmi, since we assume the root user: ` Traceback (most recent call... [13:46:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P90178 and previous config saved to /var/cache/conftool/dbconfig/20260401-134638-fceratto.json [13:46:59] !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-lab1002.eqiad.wmnet [13:49:49] (03CR) 10JMeybohm: [C:03+2] "Cool, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266216 (https://phabricator.wikimedia.org/T420468) (owner: 10Effie Mouzeli) [13:50:25] !log klausman@cumin1003 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host ml-lab1002.eqiad.wmnet [13:51:32] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [13:51:45] !log klausman@cumin1003 START - Cookbook sre.hosts.remove-downtime for ml-lab1002.eqiad.wmnet [13:51:46] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ml-lab1002.eqiad.wmnet [13:56:47] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P90179 and previous config saved to /var/cache/conftool/dbconfig/20260401-135646-fceratto.json [13:57:29] (03CR) 10Jforrester: REST: Publish ReadingLists v0 module in REST Sandbox (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264856 (https://phabricator.wikimedia.org/T419619) (owner: 10KineticPelagic) [13:59:03] (03Merged) 10jenkins-bot: Update fixtures and remove mw-parsoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266216 (https://phabricator.wikimedia.org/T420468) (owner: 10Effie Mouzeli) [13:59:12] (03CR) 10Jforrester: [C:03+2] wikifunctions: Slim down staging resources, and fix main staging config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261455 (owner: 10Jforrester) [13:59:41] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wcqs1003.eqiad.wmnet with OS bullseye [14:00:04] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T1400) [14:00:05] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host wcqs1003 [14:00:16] !log bking@cumin2002 START - Cookbook sre.dns.netbox [14:00:54] jouncebot: nowandnext [14:00:54] For the next 0 hour(s) and 59 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T1400) [14:00:54] In 0 hour(s) and 29 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T1430) [14:01:03] (03Merged) 10jenkins-bot: wikifunctions: Slim down staging resources, and fix main staging config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261455 (owner: 10Jforrester) [14:01:04] Eek, let's get moving. [14:01:15] (03PS3) 10JMeybohm: CI: Send User-Agent when fetching data from gitiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266185 [14:02:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266190 (owner: 10Jforrester) [14:02:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266219 (https://phabricator.wikimedia.org/T421581) (owner: 10Jforrester) [14:02:33] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:02:49] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:02:49] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:02:56] 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11778138 (10lanebecker) @BTullis approved for @HShaikh! Thanks. [14:03:01] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:03:02] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11778139 (10AWesterinen) I still have the error, "Service access denied due to missing privileges." I think that I need "wm... [14:03:20] !log brouberol@cumin1003 START - Cookbook sre.hosts.decommission for hosts an-worker1148.eqiad.wmnet [14:04:09] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wcqs1003 - bking@cumin2002" [14:04:14] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wcqs1003 - bking@cumin2002" [14:04:15] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:04:15] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache wcqs1003.eqiad.wmnet 9.32.64.10.in-addr.arpa 9.0.0.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:04:19] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wcqs1003.eqiad.wmnet 9.32.64.10.in-addr.arpa 9.0.0.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:04:20] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wcqs1003 [14:04:21] (03PS6) 10Jforrester: wikifunctions: Bump up orchestrator resources + 2->4/4->6 CPU for evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261344 (https://phabricator.wikimedia.org/T415067) (owner: 10Elukey) [14:04:48] (03CR) 10Jforrester: [C:03+2] wikifunctions: Bump up orchestrator resources + 2->4/4->6 CPU for evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261344 (https://phabricator.wikimedia.org/T415067) (owner: 10Elukey) [14:04:59] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:05:12] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wcqs1003 [14:05:13] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wcqs1003 [14:05:15] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:05:22] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:05:28] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:06:24] (03PS4) 10Gkyziridis: ml-services: Deploy rr-multilingual gpu model and eventstream in prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266212 (https://phabricator.wikimedia.org/T415892) [14:06:49] (03Merged) 10jenkins-bot: wikifunctions: Bump up orchestrator resources + 2->4/4->6 CPU for evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261344 (https://phabricator.wikimedia.org/T415067) (owner: 10Elukey) [14:06:55] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1254 (T419635)', diff saved to https://phabricator.wikimedia.org/P90181 and previous config saved to /var/cache/conftool/dbconfig/20260401-140654-fceratto.json [14:06:58] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [14:07:00] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1259.eqiad.wmnet with reason: Maintenance [14:07:08] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1259 (T419635)', diff saved to https://phabricator.wikimedia.org/P90182 and previous config saved to /var/cache/conftool/dbconfig/20260401-140707-fceratto.json [14:07:14] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:07:30] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:07:47] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:07:49] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:07:49] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:07:49] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:07:57] (03Merged) 10jenkins-bot: MemcachedWrapper: Hash key when longer than 250 characters [extensions/WikiLambda] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266190 (owner: 10Jforrester) [14:07:58] (03Merged) 10jenkins-bot: Extend queue processing times for abstract fragments [extensions/WikiLambda] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266219 (https://phabricator.wikimedia.org/T421581) (owner: 10Jforrester) [14:08:28] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1266190|MemcachedWrapper: Hash key when longer than 250 characters]], [[gerrit:1266219|Extend queue processing times for abstract fragments (T421581)]] [14:08:31] T421581: Abstract Wikipedia is not compatible with new API rate limits - https://phabricator.wikimedia.org/T421581 [14:08:46] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:09:17] !log brouberol@cumin1003 START - Cookbook sre.dns.netbox [14:09:49] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:10:26] !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1266190|MemcachedWrapper: Hash key when longer than 250 characters]], [[gerrit:1266219|Extend queue processing times for abstract fragments (T421581)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:10:30] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:11:37] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_magru - 3.2 upgrade (T421402) [14:11:39] T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402 [14:11:44] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_magru - 3.2 upgrade (T421402) [14:11:51] !log uploaded cumin_6.0.0 to apt.wikimedia.org bookworm-wikimedia,trixie-wikimedia [14:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:29] !log jforrester@deploy1003 jforrester: Continuing with sync [14:12:51] (03PS5) 10Jforrester: wikifunctions: Replace check-wf-services.sh with a Python version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260738 (https://phabricator.wikimedia.org/T421243) [14:12:57] (03CR) 10Jforrester: [C:03+2] wikifunctions: Replace check-wf-services.sh with a Python version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260738 (https://phabricator.wikimedia.org/T421243) (owner: 10Jforrester) [14:13:02] !log brouberol@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-worker1148.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1003" [14:13:19] !log brouberol@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-worker1148.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1003" [14:13:19] !log brouberol@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:13:20] !log brouberol@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts an-worker1148.eqiad.wmnet [14:13:29] 10ops-eqiad, 06SRE, 06DC-Ops, 07Essential-Work: hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11778207 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by brouberol@cumin1003 for hosts: `an-worker1148.eqiad.wmnet` - an-worker1148... [14:13:40] (03CR) 10Jforrester: "Tested with https://www.wikifunctions.org/wiki/Special:RunFunction?call=%7B%22Z1K1%22%3A%22Z7%22%2C%22Z7K1%22%3A%22Z19661%22%2C%22Z19661K1" [extensions/WikiLambda] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266190 (owner: 10Jforrester) [14:14:39] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2026-03-25-132409 to 2026-04-01-092119 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266272 (https://phabricator.wikimedia.org/T412768) [14:14:53] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2026-03-25-132654 to 2026-03-31-162258 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266273 (https://phabricator.wikimedia.org/T413839) [14:15:13] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2026-03-25-132409 to 2026-04-01-092119 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266272 (https://phabricator.wikimedia.org/T412768) (owner: 10Jforrester) [14:15:15] (03Merged) 10jenkins-bot: wikifunctions: Replace check-wf-services.sh with a Python version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260738 (https://phabricator.wikimedia.org/T421243) (owner: 10Jforrester) [14:16:43] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1266190|MemcachedWrapper: Hash key when longer than 250 characters]], [[gerrit:1266219|Extend queue processing times for abstract fragments (T421581)]] (duration: 08m 14s) [14:16:46] T421581: Abstract Wikipedia is not compatible with new API rate limits - https://phabricator.wikimedia.org/T421581 [14:16:50] (03CR) 10Jforrester: [C:03+2] wikifunctions: Make old Bash check script call the Python one [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261456 (https://phabricator.wikimedia.org/T421243) (owner: 10Jforrester) [14:16:59] (03PS3) 10Jforrester: wikifunctions: Make old Bash check script call the Python one [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261456 (https://phabricator.wikimedia.org/T421243) [14:17:04] (03CR) 10Jforrester: "…" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261456 (https://phabricator.wikimedia.org/T421243) (owner: 10Jforrester) [14:17:12] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2026-03-25-132409 to 2026-04-01-092119 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266272 (https://phabricator.wikimedia.org/T412768) (owner: 10Jforrester) [14:17:18] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11778287 (10BTullis) >>! In T420053#11776109, @AWesterinen wrote: > I believe that the problem is my two different accounts... [14:18:06] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:19:03] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:19:05] (03PS5) 10Gkyziridis: ml-services: Deploy rr-multilingual gpu model and eventstream in prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266212 (https://phabricator.wikimedia.org/T415892) [14:19:06] (03Merged) 10jenkins-bot: wikifunctions: Make old Bash check script call the Python one [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261456 (https://phabricator.wikimedia.org/T421243) (owner: 10Jforrester) [14:19:36] (03PS6) 10Gkyziridis: ml-services: Deploy rr-multilingual gpu model and eventstream in prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266212 (https://phabricator.wikimedia.org/T415892) [14:19:59] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:20:56] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:21:06] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:21:47] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:21:58] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11778325 (10BTullis) I have created the kerberos principal. ` btullis@krb1002:~$ sudo manage_principals.py create andreawes... [14:22:21] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2026-03-25-132654 to 2026-03-31-162258 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266273 (https://phabricator.wikimedia.org/T413839) (owner: 10Jforrester) [14:22:29] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wcqs1003.eqiad.wmnet with reason: host reimage [14:22:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1259 (T419635)', diff saved to https://phabricator.wikimedia.org/P90184 and previous config saved to /var/cache/conftool/dbconfig/20260401-142231-fceratto.json [14:22:35] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [14:24:20] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2026-03-25-132654 to 2026-03-31-162258 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266273 (https://phabricator.wikimedia.org/T413839) (owner: 10Jforrester) [14:25:54] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:26:07] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 202271256 and 28 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:26:18] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:26:40] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:27:44] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:28:09] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3256960 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:28:10] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:28:18] (03PS1) 10Atsuko: admin/data: promoted atsuko to ops [puppet] - 10https://gerrit.wikimedia.org/r/1266275 (https://phabricator.wikimedia.org/T421860) [14:28:39] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:28:53] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wcqs1003.eqiad.wmnet with reason: host reimage [14:29:33] (03CR) 10Effie Mouzeli: "woohoo" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266185 (owner: 10JMeybohm) [14:30:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T1400) [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T1430) [14:32:40] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1259', diff saved to https://phabricator.wikimedia.org/P90186 and previous config saved to /var/cache/conftool/dbconfig/20260401-143239-fceratto.json [14:36:11] 06SRE, 06Infrastructure-Foundations, 07ci-test-error, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 07Kubernetes: Unusual CI failure for aux-k8s when changing dse-k8s cert-manager values - https://phabricator.wikimedia.org/T421362#11778449 (10JMeybohm) 05Open→03Invalid This might as well have be... [14:37:32] (03CR) 10AikoChou: ml-services: Deploy rr-multilingual gpu model and eventstream in prod. (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266212 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [14:37:59] 10ops-eqiad, 06SRE, 06DC-Ops, 07Essential-Work: hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11778469 (10brouberol) @Jclark-ctr an-worker1148 is now in decommissioning status (https://netbox.wikimedia.org/dcim/devices/3661/). Over to you, with... [14:38:52] (03CR) 10Muehlenhoff: admin/data: promoted atsuko to ops (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1266275 (https://phabricator.wikimedia.org/T421860) (owner: 10Atsuko) [14:39:39] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) (owner: 10Majavah) [14:40:43] (03PS1) 10Brouberol: anlytics/hadoop: fix typo in the yarn queue mapping [puppet] - 10https://gerrit.wikimedia.org/r/1266282 (https://phabricator.wikimedia.org/T417213) [14:41:38] (03CR) 10Btullis: [C:03+1] anlytics/hadoop: fix typo in the yarn queue mapping [puppet] - 10https://gerrit.wikimedia.org/r/1266282 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol) [14:41:52] (03CR) 10Ssingh: [C:03+1] hiera: upgrade haproxy to version 3.2 on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1262061 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [14:42:48] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1259', diff saved to https://phabricator.wikimedia.org/P90187 and previous config saved to /var/cache/conftool/dbconfig/20260401-144247-fceratto.json [14:43:25] (03PS7) 10Gkyziridis: ml-services: Deploy rr-multilingual gpu model and eventstream in prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266212 (https://phabricator.wikimedia.org/T415892) [14:43:48] (03CR) 10Fabfur: [C:03+2] hiera: upgrade haproxy to version 3.2 on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1262061 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [14:43:53] (03CR) 10Gkyziridis: ml-services: Deploy rr-multilingual gpu model and eventstream in prod. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266212 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [14:44:17] !log upgrading ulsfo to haproxy 3.2 (T421402) [14:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:20] T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402 [14:44:56] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_ulsfo - 3.2 upgrade (T421402) [14:44:57] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_ulsfo - 3.2 upgrade (T421402) [14:46:41] (03CR) 10Brouberol: [C:03+2] anlytics/hadoop: fix typo in the yarn queue mapping [puppet] - 10https://gerrit.wikimedia.org/r/1266282 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol) [14:47:53] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker2004.codfw.wmnet [14:48:27] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host db1208.eqiad.wmnet [14:50:18] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker2004.codfw.wmnet [14:50:43] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker2005.codfw.wmnet [14:52:14] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 34968 [14:52:44] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 34968 [14:52:57] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1259 (T419635)', diff saved to https://phabricator.wikimedia.org/P90188 and previous config saved to /var/cache/conftool/dbconfig/20260401-145256-fceratto.json [14:52:59] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [14:53:01] (03PS1) 10Majavah: Revert "dumps: web: Trust X-Client-IP from edge caches" [puppet] - 10https://gerrit.wikimedia.org/r/1266287 [14:53:13] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [14:54:23] (03PS1) 10Majavah: Revert "Add dumps-http.discovery.wmnet" [dns] - 10https://gerrit.wikimedia.org/r/1266288 [14:54:24] (03CR) 10JMeybohm: [C:03+2] CI: Send User-Agent when fetching data from gitiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266185 (owner: 10JMeybohm) [14:55:47] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [14:55:50] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [14:56:06] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker2005.codfw.wmnet [14:57:04] (03CR) 10Filippo Giunchedi: [C:03+1] Revert "dumps: web: Trust X-Client-IP from edge caches" [puppet] - 10https://gerrit.wikimedia.org/r/1266287 (owner: 10Majavah) [14:57:08] (03CR) 10Filippo Giunchedi: [C:03+1] Revert "Add dumps-http.discovery.wmnet" [dns] - 10https://gerrit.wikimedia.org/r/1266288 (owner: 10Majavah) [14:57:39] (03CR) 10Majavah: [C:03+2] Revert "Add dumps-http.discovery.wmnet" [dns] - 10https://gerrit.wikimedia.org/r/1266288 (owner: 10Majavah) [14:57:42] !log taavi@dns1004 START - running authdns-update [14:58:05] (03CR) 10Majavah: [C:03+2] Revert "dumps: web: Trust X-Client-IP from edge caches" [puppet] - 10https://gerrit.wikimedia.org/r/1266287 (owner: 10Majavah) [14:58:21] jouncebot: nowandnext [14:58:21] For the next 0 hour(s) and 1 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T1400) [14:58:21] For the next 0 hour(s) and 1 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T1430) [14:58:21] In 2 hour(s) and 1 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T1700) [14:59:00] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wcqs1003.eqiad.wmnet with OS bullseye [14:59:04] (03PS1) 10Jforrester: Wikifunctions: Switch cache from mcrouter-wikifunctions to special access [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266290 (https://phabricator.wikimedia.org/T411807) [14:59:23] !log taavi@dns1004 END - running authdns-update [14:59:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266290 (https://phabricator.wikimedia.org/T411807) (owner: 10Jforrester) [15:00:25] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host db1208.eqiad.wmnet [15:00:33] (03Merged) 10jenkins-bot: Wikifunctions: Switch cache from mcrouter-wikifunctions to special access [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266290 (https://phabricator.wikimedia.org/T411807) (owner: 10Jforrester) [15:00:53] PROBLEM - MariaDB Replica IO: matomo on db1208 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:00:53] PROBLEM - MariaDB Replica Lag: matomo on db1208 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:00:55] PROBLEM - MariaDB Replica SQL: matomo on db1208 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:00:55] PROBLEM - MariaDB read only matomo on db1208 is CRITICAL: Could not connect to localhost:3351 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:00:55] PROBLEM - mysqld processes on db1208 is CRITICAL: PROCS CRITICAL: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [15:01:00] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1266290|Wikifunctions: Switch cache from mcrouter-wikifunctions to special access (T411807)]] [15:01:03] T411807: WF memcached service is dc-local but used for dc-global content - https://phabricator.wikimedia.org/T411807 [15:02:55] RECOVERY - mysqld processes on db1208 is OK: PROCS OK: 2 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [15:02:59] !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1266290|Wikifunctions: Switch cache from mcrouter-wikifunctions to special access (T411807)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:03:49] Sorry for the blip on db1208. That was me restarting it. [15:03:55] RECOVERY - MariaDB Replica SQL: matomo on db1208 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:03:55] RECOVERY - MariaDB read only matomo on db1208 is OK: Version 10.6.18-MariaDB-log, Uptime 60s, read_only: True, event_scheduler: True, 11.22 QPS, connection latency: 0.032044s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:04:53] RECOVERY - MariaDB Replica IO: matomo on db1208 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:04:53] RECOVERY - MariaDB Replica Lag: matomo on db1208 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:07:50] (03CR) 10AikoChou: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266212 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [15:08:12] 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11778686 (10RThomas-WMF) Thanks @BTullis, here is my pub key... [15:09:19] (03PS1) 10Majavah: P:dumps::distribution::web: Rsync logs from all servers [puppet] - 10https://gerrit.wikimedia.org/r/1266291 (https://phabricator.wikimedia.org/T422042) [15:09:35] 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11778690 (10LDlulisa-WMF) Thanks @BTullis! Here is my public... [15:09:42] !log jforrester@deploy1003 jforrester: Continuing with sync [15:10:19] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8365/co" [puppet] - 10https://gerrit.wikimedia.org/r/1266291 (https://phabricator.wikimedia.org/T422042) (owner: 10Majavah) [15:10:58] !log lucaswerkmeister-wmde@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [15:11:27] * Lucas_WMDE tries to test https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1218858 on mw-experimental [15:11:39] !log lucaswerkmeister-wmde@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [15:11:50] !log lucaswerkmeister-wmde@deploy1003 helmfile [codfw] START helmfile.d/services/mw-experimental: apply [15:12:28] !log lucaswerkmeister-wmde@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply [15:13:04] (03CR) 10Atsuko: "checking with brouberol and btullis about it (maybe we should clean it up)" [puppet] - 10https://gerrit.wikimedia.org/r/1266275 (https://phabricator.wikimedia.org/T421860) (owner: 10Atsuko) [15:13:54] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1266290|Wikifunctions: Switch cache from mcrouter-wikifunctions to special access (T411807)]] (duration: 12m 53s) [15:13:55] (03PS2) 10Arnaudb: gerrit: update upstream_response_timeout for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1266181 (https://phabricator.wikimedia.org/T421827) [15:13:57] T411807: WF memcached service is dc-local but used for dc-global content - https://phabricator.wikimedia.org/T411807 [15:18:14] (03CR) 10Arnaudb: [C:03+2] gerrit: update upstream_response_timeout for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1266181 (https://phabricator.wikimedia.org/T421827) (owner: 10Arnaudb) [15:19:10] claime: if you’re around – is there a way to run maintenance scripts on mw-experimental? [15:19:35] (the latest comments in https://phabricator.wikimedia.org/T341560 sound like that might not be possible yet :/) [15:20:55] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T421714, prepare newly-reimaged host) xfer wikidata from wcqs1001.eqiad.wmnet -> wcqs1003.eqiad.wmnet, repooling both afterwards [15:20:59] T421714: Data platform: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421714 [15:21:17] ah, sounds like T405688 is the more specific task for it [15:21:18] T405688: Support shell to mw-experimental pod - https://phabricator.wikimedia.org/T405688 [15:21:34] (03Merged) 10jenkins-bot: CI: Send User-Agent when fetching data from gitiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266185 (owner: 10JMeybohm) [15:21:46] Lucas_WMDE: there's a script in a paste on that one :) [15:22:13] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T421714, prepare newly-reimaged host) xfer wikidata from wcqs1001.eqiad.wmnet -> wcqs1003.eqiad.wmnet, repooling both afterwards [15:22:13] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T421714, prepare newly-reimaged host) xfer commons from wcqs1001.eqiad.wmnet -> wcqs1003.eqiad.wmnet, repooling both afterwards [15:22:13] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T421714, prepare newly-reimaged host) xfer commons from wcqs1001.eqiad.wmnet -> wcqs1003.eqiad.wmnet, repooling both afterwards [15:22:35] * Lucas_WMDE tries that script [15:23:38] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T421714, prepare newly-reimaged host) xfer commons from wcqs1001.eqiad.wmnet -> wcqs1003.eqiad.wmnet, repooling both afterwards [15:23:39] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T421714, prepare newly-reimaged host) xfer commons from wcqs1001.eqiad.wmnet -> wcqs1003.eqiad.wmnet, repooling both afterwards [15:24:07] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T421714, prepare newly-reimaged host) xfer commons from wcqs1001.eqiad.wmnet -> wcqs1003.eqiad.wmnet, repooling both afterwards [15:24:45] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [15:24:48] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [15:26:15] claime: no mwscript-k8s in that script’s shell either :/ [15:26:18] (also no foreachwiki) [15:26:40] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [15:26:43] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [15:26:47] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_ulsfo - 3.2 upgrade (T421402) [15:26:50] T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402 [15:32:17] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_ulsfo - 3.2 upgrade (T421402) [15:32:27] T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402 [15:32:35] (03CR) 10Btullis: admin/data: promoted atsuko to ops (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1266275 (https://phabricator.wikimedia.org/T421860) (owner: 10Atsuko) [15:34:13] FIRING: JobUnavailable: Reduced availability for job jmx_wcqs_blazegraph in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:21] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1266275 (https://phabricator.wikimedia.org/T421860) (owner: 10Atsuko) [15:38:37] (left a comment on the task to that effect) [15:38:56] (03PS1) 10Fabfur: cache::haproxy: rename deprecated instructions in haproxy 3.2 [puppet] - 10https://gerrit.wikimedia.org/r/1266301 (https://phabricator.wikimedia.org/T422030) [15:39:22] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:42:39] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1266301 (https://phabricator.wikimedia.org/T422030) (owner: 10Fabfur) [15:44:22] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:45:31] (03CR) 10Btullis: [C:03+2] "Merging on Atsuko's behalf." [puppet] - 10https://gerrit.wikimedia.org/r/1266275 (https://phabricator.wikimedia.org/T421860) (owner: 10Atsuko) [15:45:41] (03CR) 10Btullis: [C:03+2] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1266275 (https://phabricator.wikimedia.org/T421860) (owner: 10Atsuko) [15:46:11] (03PS2) 10Fabfur: cache::haproxy: rename deprecated instructions in haproxy 3.2 [puppet] - 10https://gerrit.wikimedia.org/r/1266301 (https://phabricator.wikimedia.org/T422030) [15:46:23] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1266301 (https://phabricator.wikimedia.org/T422030) (owner: 10Fabfur) [15:47:40] (03CR) 10Vgutierrez: cache::haproxy: rename deprecated instructions in haproxy 3.2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1266301 (https://phabricator.wikimedia.org/T422030) (owner: 10Fabfur) [15:48:31] (03CR) 10Vgutierrez: [C:03+1] cache::haproxy: rename deprecated instructions in haproxy 3.2 [puppet] - 10https://gerrit.wikimedia.org/r/1266301 (https://phabricator.wikimedia.org/T422030) (owner: 10Fabfur) [15:48:57] (03CR) 10Vgutierrez: [C:03+1] cache::haproxy: rename deprecated instructions in haproxy 3.2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1266301 (https://phabricator.wikimedia.org/T422030) (owner: 10Fabfur) [15:56:12] (03PS3) 10Fabfur: cache::haproxy: rename deprecated instructions in haproxy 3.2 [puppet] - 10https://gerrit.wikimedia.org/r/1266301 (https://phabricator.wikimedia.org/T422030) [15:56:29] (03CR) 10Fabfur: cache::haproxy: rename deprecated instructions in haproxy 3.2 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1266301 (https://phabricator.wikimedia.org/T422030) (owner: 10Fabfur) [15:58:25] Lucas_WMDE: foreachwikiindblist is basically `for wiki in $(php /srv/mediawiki/multiversion/bin/expanddblist private); do echo -e "------\n$wiki\n-----"; php /srv/mediawiki/multiversion/MWScript.php Version.php --wiki="$wiki"; done` [16:00:13] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1266301 (https://phabricator.wikimedia.org/T422030) (owner: 10Fabfur) [16:00:16] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1266301 (https://phabricator.wikimedia.org/T422030) (owner: 10Fabfur) [16:01:46] jouncebot: nowandnext [16:01:47] No deployments scheduled for the next 0 hour(s) and 58 minute(s) [16:01:47] In 0 hour(s) and 58 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T1700) [16:02:09] (03CR) 10Fabfur: [C:03+2] cache::haproxy: rename deprecated instructions in haproxy 3.2 [puppet] - 10https://gerrit.wikimedia.org/r/1266301 (https://phabricator.wikimedia.org/T422030) (owner: 10Fabfur) [16:02:43] (03PS1) 10Urbanecm: Set the default for UserEmailConfirmationUseHTML to true [core] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266309 (https://phabricator.wikimedia.org/T411147) [16:02:48] (03CR) 10Urbanecm: [C:03+2] Set the default for UserEmailConfirmationUseHTML to true [core] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266309 (https://phabricator.wikimedia.org/T411147) (owner: 10Urbanecm) [16:03:08] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: FY2526 Q3:rack/setup/install restbase2039 - https://phabricator.wikimedia.org/T416538#11778935 (10Jhancock.wm) [16:05:04] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: FY2526 Q3:rack/setup/install restbase2039 - https://phabricator.wikimedia.org/T416538#11778957 (10Jhancock.wm) this server is having the issue found in T418929 where we can't add the root user because of hardware changes [16:07:35] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q3:rack/setup/install kafka-logging200[6-8] - https://phabricator.wikimedia.org/T418931#11778969 (10Jhancock.wm) [16:08:18] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q3:rack/setup/install kafka-logging200[6-8] - https://phabricator.wikimedia.org/T418931#11778972 (10Jhancock.wm) turns out these servers are also having the same issue as these servers https://phabricator.wikimedia.org/T418929 so got a little to figure out if... [16:09:13] FIRING: [3x] JobUnavailable: Reduced availability for job jmx_wcqs_blazegraph in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:09:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266309 (https://phabricator.wikimedia.org/T411147) (owner: 10Urbanecm) [16:09:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260011 (https://phabricator.wikimedia.org/T411147) (owner: 10Urbanecm) [16:09:55] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [16:10:38] 06SRE, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Configure dse-k8s-worker10[20-23] - https://phabricator.wikimedia.org/T421465#11778979 (10Jclark-ctr) [16:10:41] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Degraded RAID on an-worker1213 - https://phabricator.wikimedia.org/T420812#11778980 (10BTullis) 05Resolved→03Open a:05VRiley-WMF→03BTullis Hi @VRiley-WMF - Apologies for the delay in getting back to you. We haven't had a c... [16:11:41] (03PS1) 10Mmartorana: config: Enable EmailConfirmationBanner on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266314 (https://phabricator.wikimedia.org/T421366) [16:13:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266314 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana) [16:13:42] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding conf2007 to codfw - jhancock@cumin2002" [16:13:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding conf2007 to codfw - jhancock@cumin2002" [16:13:47] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:14:05] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host conf2007 [16:14:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host conf2007 [16:14:47] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host conf2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:15:54] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host conf2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:17:58] (03Merged) 10jenkins-bot: Set the default for UserEmailConfirmationUseHTML to true [core] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266309 (https://phabricator.wikimedia.org/T411147) (owner: 10Urbanecm) [16:18:14] (03CR) 10Urbanecm: [C:03+2] cleanup: Remove UserEmailConfirmationUseHTML (defaults to true) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260011 (https://phabricator.wikimedia.org/T411147) (owner: 10Urbanecm) [16:18:39] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install conf200[7-9] - https://phabricator.wikimedia.org/T418914#11779025 (10Jhancock.wm) [16:19:11] (03Merged) 10jenkins-bot: cleanup: Remove UserEmailConfirmationUseHTML (defaults to true) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260011 (https://phabricator.wikimedia.org/T411147) (owner: 10Urbanecm) [16:19:38] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1266309|Set the default for UserEmailConfirmationUseHTML to true (T411147)]], [[gerrit:1260011|cleanup: Remove UserEmailConfirmationUseHTML (defaults to true) (T411147)]] [16:19:41] T411147: Remove emailability code from GrowthExperiments - https://phabricator.wikimedia.org/T411147 [16:19:49] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install conf200[7-9] - https://phabricator.wikimedia.org/T418914#11779030 (10Jhancock.wm) we're having the issue that was documented in https://phabricator.wikimedia.org/T418929 with these servers. still working... [16:20:34] * Lucas_WMDE is done with mw-experimental btw [16:21:09] urbanecm: thanks, I went with something similar at https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1218858/7#message-f8a5d7c9d14007456a8e87d96daa41266b158f3f (though I didn’t know expanddblist is available in the repo so I just got the list from https://noc.wikimedia.org/conf/dblists/all.dblist) [16:21:36] !log urbanecm@deploy1003 urbanecm: Backport for [[gerrit:1266309|Set the default for UserEmailConfirmationUseHTML to true (T411147)]], [[gerrit:1260011|cleanup: Remove UserEmailConfirmationUseHTML (defaults to true) (T411147)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:22:22] (03CR) 10Dzahn: [C:03+1] Update Cumin alias for contint to also cover the spun-off Trixie role [puppet] - 10https://gerrit.wikimedia.org/r/1266215 (owner: 10Muehlenhoff) [16:22:23] jouncebot: nowandnext [16:22:23] No deployments scheduled for the next 0 hour(s) and 37 minute(s) [16:22:23] In 0 hour(s) and 37 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T1700) [16:22:33] Anyone after you? [16:22:42] Dreamy_Jazz: probably you? :D [16:22:46] :D [16:23:17] (03PS1) 10Dreamy Jazz: hCaptcha: Add log and counter when all SiteVerify attempts fail [extensions/ConfirmEdit] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266316 (https://phabricator.wikimedia.org/T421678) [16:23:30] (03PS1) 10Dreamy Jazz: hCaptcha: Add log and counter when all SiteVerify attempts fail [extensions/ConfirmEdit] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266317 (https://phabricator.wikimedia.org/T421678) [16:24:54] !log urbanecm@deploy1003 urbanecm: Continuing with sync [16:25:13] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 139423488 and 15 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:26:33] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [16:26:37] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [16:27:13] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 58696 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:28:54] (03CR) 10CI reject: [V:04-1] hCaptcha: Add log and counter when all SiteVerify attempts fail [extensions/ConfirmEdit] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266317 (https://phabricator.wikimedia.org/T421678) (owner: 10Dreamy Jazz) [16:29:08] !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1266309|Set the default for UserEmailConfirmationUseHTML to true (T411147)]], [[gerrit:1260011|cleanup: Remove UserEmailConfirmationUseHTML (defaults to true) (T411147)]] (duration: 09m 31s) [16:29:12] T411147: Remove emailability code from GrowthExperiments - https://phabricator.wikimedia.org/T411147 [16:30:05] Dreamy_Jazz: over to you [16:30:08] Thanks [16:30:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266317 (https://phabricator.wikimedia.org/T421678) (owner: 10Dreamy Jazz) [16:30:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266316 (https://phabricator.wikimedia.org/T421678) (owner: 10Dreamy Jazz) [16:32:55] (03PS2) 10Daniel Kinzler: rest gateway: define authed-user class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266237 (https://phabricator.wikimedia.org/T420280) [16:33:39] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-restart-haproxy rolling restart of HAProxy on A:cp-drmrs - New configuration/test (T421402) [16:33:42] T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402 [16:34:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266317 (https://phabricator.wikimedia.org/T421678) (owner: 10Dreamy Jazz) [16:34:13] FIRING: [3x] JobUnavailable: Reduced availability for job jmx_wcqs_blazegraph in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:34:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266316 (https://phabricator.wikimedia.org/T421678) (owner: 10Dreamy Jazz) [16:34:58] (03PS2) 10Daniel Kinzler: rest gateway: introduce policy for abstractwiki/wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265333 (https://phabricator.wikimedia.org/T421581) [16:36:16] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T421714, prepare newly-reimaged host) xfer commons from wcqs1001.eqiad.wmnet -> wcqs1003.eqiad.wmnet, repooling both afterwards [16:36:19] T421714: Data platform: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421714 [16:37:30] (03PS3) 10Btullis: Add analytics-fr-tech system user and corresponding groups [puppet] - 10https://gerrit.wikimedia.org/r/1251146 (https://phabricator.wikimedia.org/T417213) [16:39:07] jouncebot: nowandnext [16:39:07] No deployments scheduled for the next 0 hour(s) and 20 minute(s) [16:39:07] In 0 hour(s) and 20 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T1700) [16:39:13] RESOLVED: [3x] JobUnavailable: Reduced availability for job jmx_wcqs_blazegraph in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:41:43] (03Merged) 10jenkins-bot: hCaptcha: Add log and counter when all SiteVerify attempts fail [extensions/ConfirmEdit] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266317 (https://phabricator.wikimedia.org/T421678) (owner: 10Dreamy Jazz) [16:41:46] (03Merged) 10jenkins-bot: hCaptcha: Add log and counter when all SiteVerify attempts fail [extensions/ConfirmEdit] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266316 (https://phabricator.wikimedia.org/T421678) (owner: 10Dreamy Jazz) [16:41:53] (03PS3) 10Daniel Kinzler: rest gateway: define authed-user class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266237 (https://phabricator.wikimedia.org/T420280) [16:42:13] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1266317|hCaptcha: Add log and counter when all SiteVerify attempts fail (T421678)]], [[gerrit:1266316|hCaptcha: Add log and counter when all SiteVerify attempts fail (T421678)]] [16:42:16] T421678: hCaptcha: Retry SiteVerify API requests when http error occurs - https://phabricator.wikimedia.org/T421678 [16:43:35] 10ops-codfw, 06DC-Ops: Alert for device lsw1-c7-codfw.mgmt.codfw.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T422058 (10phaultfinder) 03NEW [16:44:13] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1266317|hCaptcha: Add log and counter when all SiteVerify attempts fail (T421678)]], [[gerrit:1266316|hCaptcha: Add log and counter when all SiteVerify attempts fail (T421678)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:49:33] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [16:52:38] 10ops-codfw, 06DC-Ops: Alert for device lsw1-b4-codfw.mgmt.codfw.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T422061 (10phaultfinder) 03NEW [16:53:44] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1266317|hCaptcha: Add log and counter when all SiteVerify attempts fail (T421678)]], [[gerrit:1266316|hCaptcha: Add log and counter when all SiteVerify attempts fail (T421678)]] (duration: 11m 30s) [16:53:47] T421678: hCaptcha: Retry SiteVerify API requests when http error occurs - https://phabricator.wikimedia.org/T421678 [16:54:35] (03CR) 10Bartosz Dziewoński: [C:03+1] rest gateway: introduce policy for abstractwiki/wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265333 (https://phabricator.wikimedia.org/T421581) (owner: 10Daniel Kinzler) [16:57:35] jouncebot: nowandnext [16:57:35] No deployments scheduled for the next 0 hour(s) and 2 minute(s) [16:57:35] In 0 hour(s) and 2 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T1700) [16:58:45] FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [16:59:42] Amir1: i might have a backport [16:59:46] but also infra... [17:00:00] I asked in -sre to see if people are using it [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T1700) [17:00:08] 06SRE, 10SRE-Access-Requests: Yubikey-SSH-FIDO for Tiziano Fogli (tappof / BACKUP) - https://phabricator.wikimedia.org/T422020#11779328 (10hnowlan) 05Open→03Resolved [17:00:26] urbanecm: I'd say go for it [17:00:38] o/ [17:00:50] Amir1: swfrench-wmf just said they're using it? [17:00:57] (I'm happy to wait) [17:01:09] so, I do have some work planned, but as long as it's not l10n update you have planned, please go ahead :) [17:01:09] ah okay, right. So let's wait [17:01:27] (I have a bit of prep to do first that can happen in parallel) [17:02:59] not an i18n update, but it appears to be conflicting [17:03:03] * urbanecm is disappearing [17:03:13] ah, got it [17:03:22] alright, I'll continue with my plans then [17:03:32] (03PS1) 10Snwachukwu: Media Aanlytics Production Image Version Change [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266323 (https://phabricator.wikimedia.org/T415202) [17:03:46] (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1178657 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French) [17:03:48] (03CR) 10Scott French: [C:03+2] hieradata: disable and remove unused image-suggestion listener [puppet] - 10https://gerrit.wikimedia.org/r/1178657 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French) [17:07:18] swfrench-wmf: when you're done, would you mind giving me a ping? Gods of thumbnails will be grateful [17:08:19] Amir1: yes, can do [17:08:20] (they have a god? i should be more afraid of them now...) [17:08:27] :) [17:09:44] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-restart-haproxy (exit_code=0) rolling restart of HAProxy on A:cp-drmrs - New configuration/test (T421402) [17:09:47] T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402 [17:10:20] Yes, it's Lorax the guardian of the trees. Looking disapprovingly to all of the CPU cycles being wasted due to cache fragmentation [17:12:35] !log swfrench@deploy1003 Started scap sync-world: helmfile-only deployment to remove unused image-suggestion listener - T368096 [17:12:39] T368096: mediawiki: migrate from image-suggestion to data-gateway - https://phabricator.wikimedia.org/T368096 [17:18:06] !log swfrench@deploy1003 Finished scap sync-world: helmfile-only deployment to remove unused image-suggestion listener - T368096 (duration: 07m 25s) [17:18:09] T368096: mediawiki: migrate from image-suggestion to data-gateway - https://phabricator.wikimedia.org/T368096 [17:21:12] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/mw-cron: apply [17:21:23] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply [17:21:29] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [17:21:38] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [17:21:59] 06SRE, 06Traffic: Deprecate low-traffic proxoid service and O:hcaptcha_proxy for the older hcaptcha proxy setup - https://phabricator.wikimedia.org/T411097#11779456 (10BCornwall) 05In progress→03Resolved a:03BCornwall [17:23:35] !Deploying Refinery at fa28ad8 for change 1250005 / T415202 Extend mediarequest Cassandra loads with poster/plays for video-requests API [17:23:36] T415202: Introduce a new AQS endpoint to expose video plays - https://phabricator.wikimedia.org/T415202 [17:23:58] Amir1: I think all of mediawiki-touching is complete. all yours! [17:24:19] Thank you <3 [17:24:22] FIRING: [6x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:26:57] (03PS1) 10Ladsgroup: Refix thumb steps for the poster image of videos [extensions/TimedMediaHandler] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266326 (https://phabricator.wikimedia.org/T414805) [17:27:09] (03PS1) 10Ladsgroup: Refix thumb steps for the poster image of videos [extensions/TimedMediaHandler] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266327 (https://phabricator.wikimedia.org/T414805) [17:27:15] (03CR) 10Ladsgroup: [C:03+2] Refix thumb steps for the poster image of videos [extensions/TimedMediaHandler] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266326 (https://phabricator.wikimedia.org/T414805) (owner: 10Ladsgroup) [17:27:18] (03CR) 10Ladsgroup: [C:03+2] Refix thumb steps for the poster image of videos [extensions/TimedMediaHandler] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266327 (https://phabricator.wikimedia.org/T414805) (owner: 10Ladsgroup) [17:30:34] !log ebysans@deploy1003 Started deploy [analytics/refinery@fa28ad8] (hadoop-test): Extend mediarequest Cassandra loads with poster/plays for video-requests API T415202 TEST [analytics/refinery@fa28ad83] [17:30:39] T415202: Introduce a new AQS endpoint to expose video plays - https://phabricator.wikimedia.org/T415202 [17:31:30] (03CR) 10Scott French: [C:03+2] service: move image-suggestion to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/1198575 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French) [17:31:50] (03CR) 10Mforns: [C:03+1] Media Aanlytics Production Image Version Change [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266323 (https://phabricator.wikimedia.org/T415202) (owner: 10Snwachukwu) [17:32:25] 06SRE, 06Traffic: Deprecate low-traffic proxoid service and O:hcaptcha_proxy for the older hcaptcha proxy setup - https://phabricator.wikimedia.org/T411097#11779507 (10ssingh) Thanks for taking care of this @BCornwall! [17:32:26] !log ebysans@deploy1003 Finished deploy [analytics/refinery@fa28ad8] (hadoop-test): Extend mediarequest Cassandra loads with poster/plays for video-requests API T415202 TEST [analytics/refinery@fa28ad83] (duration: 01m 52s) [17:32:33] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1009.eqiad.wmnet with OS bullseye [17:32:52] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cloudelastic1009 [17:33:02] !log bking@cumin2002 START - Cookbook sre.dns.netbox [17:33:23] !log ebysans@deploy1003 Started deploy [analytics/refinery@fa28ad8]: Extend mediarequest Cassandra loads with poster/plays for video-requests API T415202 [analytics/refinery@fa28ad83] [17:37:38] !log ebysans@deploy1003 Finished deploy [analytics/refinery@fa28ad8]: Extend mediarequest Cassandra loads with poster/plays for video-requests API T415202 [analytics/refinery@fa28ad83] (duration: 04m 15s) [17:37:43] T415202: Introduce a new AQS endpoint to expose video plays - https://phabricator.wikimedia.org/T415202 [17:38:01] !log ebysans@deploy1003 Started deploy [analytics/refinery@fa28ad8] (thin): Extend mediarequest Cassandra loads with poster/plays for video-requests API T415202 [analytics/refinery@fa28ad83] [17:38:41] bking@cumin2002 reimage (PID 3684606) is awaiting input [17:39:55] !log ebysans@deploy1003 Finished deploy [analytics/refinery@fa28ad8] (thin): Extend mediarequest Cassandra loads with poster/plays for video-requests API T415202 [analytics/refinery@fa28ad83] (duration: 01m 53s) [17:41:20] (03Merged) 10jenkins-bot: Refix thumb steps for the poster image of videos [extensions/TimedMediaHandler] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266326 (https://phabricator.wikimedia.org/T414805) (owner: 10Ladsgroup) [17:41:22] (03Merged) 10jenkins-bot: Refix thumb steps for the poster image of videos [extensions/TimedMediaHandler] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266327 (https://phabricator.wikimedia.org/T414805) (owner: 10Ladsgroup) [17:42:29] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cloudelastic1009 - bking@cumin2002" [17:42:34] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cloudelastic1009 - bking@cumin2002" [17:42:34] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:42:35] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cloudelastic1009.eqiad.wmnet 30.32.64.10.in-addr.arpa 0.3.0.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [17:42:39] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudelastic1009.eqiad.wmnet 30.32.64.10.in-addr.arpa 0.3.0.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [17:42:39] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudelastic1009 [17:43:27] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudelastic1009 [17:43:27] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cloudelastic1009 [17:46:35] (03CR) 10ArielGlenn: rest gateway: define authed-user class (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266237 (https://phabricator.wikimedia.org/T420280) (owner: 10Daniel Kinzler) [17:47:59] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1266326|Refix thumb steps for the poster image of videos (T414805)]], [[gerrit:1266327|Refix thumb steps for the poster image of videos (T414805)]] [17:48:01] T414805: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805 [17:50:02] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1266326|Refix thumb steps for the poster image of videos (T414805)]], [[gerrit:1266327|Refix thumb steps for the poster image of videos (T414805)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:51:16] (03CR) 10Ottomata: [C:03+2] Media Aanlytics Production Image Version Change [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266323 (https://phabricator.wikimedia.org/T415202) (owner: 10Snwachukwu) [17:51:32] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:52:05] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [17:53:18] (03Merged) 10jenkins-bot: Media Aanlytics Production Image Version Change [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266323 (https://phabricator.wikimedia.org/T415202) (owner: 10Snwachukwu) [17:55:19] 06SRE, 10Infrastructure Security: 2FA for SSH access to the production cluster - https://phabricator.wikimedia.org/T116750#11779615 (10herron) [17:55:42] 06SRE, 10Infrastructure Security: Consider "inner" and "outer" ssh keys to reduce taps through the day - https://phabricator.wikimedia.org/T422068#11779619 (10herron) [17:56:17] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1266326|Refix thumb steps for the poster image of videos (T414805)]], [[gerrit:1266327|Refix thumb steps for the poster image of videos (T414805)]] (duration: 08m 18s) [17:56:20] T414805: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805 [17:57:13] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 264965128 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:58:13] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3628800 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:59:49] (03PS3) 10Scott French: wmnet: remove image-suggestion k8s ingress CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/1198584 (https://phabricator.wikimedia.org/T368096) [18:01:33] !log Deployed refinery using scap, then deployed onto hdfs [18:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:26] !log aokoth@cumin1003 START - Cookbook sre.vrts.upgrade on VRTS host vrts1003.eqiad.wmnet [18:03:00] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1009.eqiad.wmnet with reason: host reimage [18:04:17] !log aokoth@cumin1003 END (PASS) - Cookbook sre.vrts.upgrade (exit_code=0) on VRTS host vrts1003.eqiad.wmnet [18:05:56] FIRING: GitlabPackagePullerFailedOnRun: Package puller has some run errors that needs investigation. - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DGitlabPackagePullerFailedOnRun [18:08:28] 10ops-eqiad, 06SRE, 06DC-Ops: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11779670 (10VRiley-WMF) As requested, I went into supermicro support to create a service ticket with supermicro. It seems that John has created a ticket for th... [18:10:00] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1009.eqiad.wmnet with reason: host reimage [18:10:08] !log ebysans@deploy1003 helmfile [eqiad] START helmfile.d/services/media-analytics: apply [18:10:21] !log ebysans@deploy1003 helmfile [eqiad] DONE helmfile.d/services/media-analytics: apply [18:10:37] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install fransw100[23] - https://phabricator.wikimedia.org/T417295#11779683 (10Jgreen) a:05Jgreen→03Jclark-ctr @Jclark-ctr can you check the network cables for fransw1002? The first network interface doesn't appear to have link. [18:10:46] !log ebysans@deploy1003 helmfile [codfw] START helmfile.d/services/media-analytics: apply [18:11:00] !log ebysans@deploy1003 helmfile [codfw] DONE helmfile.d/services/media-analytics: apply [18:14:01] (03CR) 10ArielGlenn: rest gateway: introduce policy for abstractwiki/wikifunctions (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265333 (https://phabricator.wikimedia.org/T421581) (owner: 10Daniel Kinzler) [18:16:01] 10ops-eqiad, 06SRE, 06DC-Ops: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet - https://phabricator.wikimedia.org/T421007#11779697 (10VRiley-WMF) Dell has come back with the following on this ticket "If the issue is configuration or firmware related, the drive should format normally once correct... [18:19:27] FYI, I'll continue with some further image-suggestion cleanup in the background (no impact expected) [18:42:27] (03PS3) 10Scott French: image-suggestion: remove service configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198580 (https://phabricator.wikimedia.org/T368096) [18:42:36] (03PS3) 10Scott French: deployment_server: absent image-suggestion k8s creds config [puppet] - 10https://gerrit.wikimedia.org/r/1198576 (https://phabricator.wikimedia.org/T368096) [18:42:43] (03PS3) 10Scott French: deployment_server: remove absented image-suggestion k8s creds config [puppet] - 10https://gerrit.wikimedia.org/r/1198577 (https://phabricator.wikimedia.org/T368096) [18:48:28] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[5667] - https://phabricator.wikimedia.org/T418903#11779792 (10VRiley-WMF) a:03VRiley-WMF [18:54:22] FIRING: [6x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:59:07] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1009.eqiad.wmnet with OS bullseye [19:02:05] (03CR) 10Blake: [C:03+1] deployment_server: absent image-suggestion k8s creds config [puppet] - 10https://gerrit.wikimedia.org/r/1198576 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French) [19:02:30] (03CR) 10Blake: [C:03+1] deployment_server: remove absented image-suggestion k8s creds config [puppet] - 10https://gerrit.wikimedia.org/r/1198577 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French) [19:03:55] (03CR) 10Blake: [C:03+1] image-suggestion: remove service configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198580 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French) [19:05:16] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 543552448 and 47 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:13:24] (03CR) 10Bartosz Dziewoński: "This resolved T412520." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1263878 (owner: 10Daniel Kinzler) [19:14:16] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 28128 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:31:24] (03PS1) 10Bking: opensearch: handle IP changes for software firewall [puppet] - 10https://gerrit.wikimedia.org/r/1266372 (https://phabricator.wikimedia.org/T421714) [19:31:50] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1266372 (https://phabricator.wikimedia.org/T421714) (owner: 10Bking) [19:43:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [19:48:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [19:49:45] Amir1: are you around? I something's not quite right with your TMH patches [19:49:58] I filed https://phabricator.wikimedia.org/T422074 [19:50:11] thanks, dancy! [19:52:39] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1010.eqiad.wmnet with OS bullseye [19:53:00] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cloudelastic1010 [19:53:50] !log bking@cumin2002 START - Cookbook sre.dns.netbox [19:58:00] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cloudelastic1010 - bking@cumin2002" [19:58:05] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cloudelastic1010 - bking@cumin2002" [19:58:05] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:58:06] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cloudelastic1010.eqiad.wmnet 24.48.64.10.in-addr.arpa 4.2.0.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [19:58:09] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudelastic1010.eqiad.wmnet 24.48.64.10.in-addr.arpa 4.2.0.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [19:58:10] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudelastic1010 [20:00:01] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudelastic1010 [20:00:01] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cloudelastic1010 [20:00:02] (03CR) 10Muehlenhoff: opensearch: handle IP changes for software firewall (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1266372 (https://phabricator.wikimedia.org/T421714) (owner: 10Bking) [20:00:04] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T2000). [20:00:04] manfredi: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:21] I am here [20:01:01] hi manfredi! do you need a deployer? [20:01:10] yes please [20:01:18] i can deploy for you - 1 sec [20:01:24] thanks! [20:01:45] (03PS2) 10Mmartorana: config: Enable EmailConfirmationBanner on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266314 (https://phabricator.wikimedia.org/T421366) [20:02:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266314 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana) [20:03:04] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1008 is CRITICAL: CRITICAL - elasticsearch inactive shards 281 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 768, active_shards: 1258, relocating_shards: 1, initializing_shards: 13, unassigned_shards: 268, delayed_unassigned_shards [20:03:04] ber_of_pending_tasks: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 81.74139051332034 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:03:04] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1007 is CRITICAL: CRITICAL - elasticsearch inactive shards 281 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 768, active_shards: 1258, relocating_shards: 1, initializing_shards: 13, unassigned_shards: 268, delayed_unassigned_shards [20:03:04] ber_of_pending_tasks: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 81.74139051332034 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:03:06] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 280 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 768, active_shards: 1259, relocating_shards: 1, initializing_shards: 13, unassigned_shards: 267, delayed_unassigned_shards [20:03:06] ber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 81.80636777128005 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:03:08] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch inactive shards 280 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 768, active_shards: 1259, relocating_shards: 1, initializing_shards: 13, unassigned_shards: 267, delayed_unassigned_shards [20:03:08] ber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 81.80636777128005 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:04:27] (03Merged) 10jenkins-bot: config: Enable EmailConfirmationBanner on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266314 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana) [20:04:52] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1266314|config: Enable EmailConfirmationBanner on mediawikiwiki (T421366)]] [20:04:55] T421366: Test Kitchen Experiment setup to measure the impact of the banner - https://phabricator.wikimedia.org/T421366 [20:06:53] !log cjming@deploy1003 mmartorana, cjming: Backport for [[gerrit:1266314|config: Enable EmailConfirmationBanner on mediawikiwiki (T421366)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:07:05] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 768, active_shards: 1313, relocating_shards: 1, initializing_shards: 13, unassigned_shards: 213, delayed_unassigned_shards: 0, number_of_pending_t [20:07:05] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.31513970110461 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:07:05] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1008 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 768, active_shards: 1313, relocating_shards: 1, initializing_shards: 13, unassigned_shards: 213, delayed_unassigned_shards: 0, number_of_pending_t [20:07:05] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.31513970110461 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:07:07] manfredi: on mwdebug if you want to test [20:07:07] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 768, active_shards: 1313, relocating_shards: 1, initializing_shards: 13, unassigned_shards: 213, delayed_unassigned_shards: 0, number_of_pending_t [20:07:07] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.31513970110461 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:07:09] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 768, active_shards: 1314, relocating_shards: 1, initializing_shards: 12, unassigned_shards: 213, delayed_unassigned_shards: 0, number_of_pending_t [20:07:09] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.38011695906432 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:07:21] ok [20:07:28] manfredi: lmk when i can sync - standing by [20:09:18] All good. Go ahead please [20:09:22] great! [20:09:26] !log cjming@deploy1003 mmartorana, cjming: Continuing with sync [20:13:40] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1266314|config: Enable EmailConfirmationBanner on mediawikiwiki (T421366)]] (duration: 08m 47s) [20:13:43] T421366: Test Kitchen Experiment setup to measure the impact of the banner - https://phabricator.wikimedia.org/T421366 [20:13:51] manfredi: should be live! [20:14:01] (03PS4) 10Daniel Kinzler: rest gateway: define authed-user class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266237 (https://phabricator.wikimedia.org/T420280) [20:14:13] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:14:15] (03CR) 10Daniel Kinzler: rest gateway: define authed-user class (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266237 (https://phabricator.wikimedia.org/T420280) (owner: 10Daniel Kinzler) [20:14:25] (03PS3) 10Daniel Kinzler: rest gateway: introduce policy for abstractwiki/wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265333 (https://phabricator.wikimedia.org/T421581) [20:14:55] cjming: thank you! I appreciate it [20:15:08] np :) [20:16:17] that was it for the queue - i'll hang around for a few minutes in case anyone else shows up for the window [20:19:13] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:19:50] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1010.eqiad.wmnet with reason: host reimage [20:21:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:23:42] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1010.eqiad.wmnet with reason: host reimage [20:26:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:26:25] (03PS1) 10Eevans: restbase: upgrade to Cassandra 4.1.11 [puppet] - 10https://gerrit.wikimedia.org/r/1266387 (https://phabricator.wikimedia.org/T418417) [20:26:28] (03PS1) 10Eevans: aqs: upgrade to Cassandra 4.1.11 [puppet] - 10https://gerrit.wikimedia.org/r/1266388 (https://phabricator.wikimedia.org/T418417) [20:26:31] (03PS1) 10Eevans: sessionstore: upgrade to Cassandra 4.1.11 [puppet] - 10https://gerrit.wikimedia.org/r/1266389 (https://phabricator.wikimedia.org/T418417) [20:27:22] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1266387 (https://phabricator.wikimedia.org/T418417) (owner: 10Eevans) [20:27:28] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1266388 (https://phabricator.wikimedia.org/T418417) (owner: 10Eevans) [20:27:33] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1266389 (https://phabricator.wikimedia.org/T418417) (owner: 10Eevans) [20:29:55] (03PS4) 10Daniel Kinzler: rest gateway: introduce policy for abstractwiki/wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265333 (https://phabricator.wikimedia.org/T421581) [20:31:30] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:31:45] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:36:30] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:38:41] (03PS1) 10Ottomata: mw-page-html-content-change-enrich - apply some tuning configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266393 (https://phabricator.wikimedia.org/T421216) [20:39:46] (03PS2) 10Ottomata: mw-page-html-content-change-enrich - apply some tuning configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266393 (https://phabricator.wikimedia.org/T421216) [20:40:00] (03PS3) 10Ottomata: mw-page-html-content-change-enrich - apply some tuning configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266393 (https://phabricator.wikimedia.org/T421216) [20:40:25] (03PS4) 10Ottomata: mw-page-html-content-change-enrich - apply some tuning configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266393 (https://phabricator.wikimedia.org/T421216) [20:42:18] (03PS5) 10Ottomata: mw-page-html-content-change-enrich - apply some tuning configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266393 (https://phabricator.wikimedia.org/T421216) [20:43:56] !log brett@cumin2002 START - Cookbook sre.loadbalancer.admin rebooting P{lvs3010.esams.wmnet} and A:liberica [20:44:43] (03CR) 10Ottomata: [C:03+2] mw-page-html-content-change-enrich - apply some tuning configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266393 (https://phabricator.wikimedia.org/T421216) (owner: 10Ottomata) [20:46:52] (03Merged) 10jenkins-bot: mw-page-html-content-change-enrich - apply some tuning configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266393 (https://phabricator.wikimedia.org/T421216) (owner: 10Ottomata) [20:47:47] !log brett@cumin2002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs3010.esams.wmnet} and A:liberica [20:49:05] (03CR) 10Cwhite: opensearch: handle IP changes for software firewall (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1266372 (https://phabricator.wikimedia.org/T421714) (owner: 10Bking) [20:49:22] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:50:24] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [20:50:29] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [20:52:24] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [20:52:28] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [20:53:28] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [20:53:33] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [20:54:22] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:54:39] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T421970#11780185 (10phaultfinder) [20:57:15] (03PS1) 10Ahmon Dancy: buildkitd: Bump buildkit image to wmf-v0.29.0 [puppet] - 10https://gerrit.wikimedia.org/r/1266395 (https://phabricator.wikimedia.org/T415284) [20:58:46] FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [20:59:22] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:00:04] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T2100) [21:04:22] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:07:29] (03PS2) 10Bking: opensearch: handle IP changes for software firewall [puppet] - 10https://gerrit.wikimedia.org/r/1266372 (https://phabricator.wikimedia.org/T421714) [21:07:45] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1010.eqiad.wmnet with OS bullseye [21:08:49] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1266372 (https://phabricator.wikimedia.org/T421714) (owner: 10Bking) [21:11:50] (03CR) 10Bking: opensearch: handle IP changes for software firewall (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1266372 (https://phabricator.wikimedia.org/T421714) (owner: 10Bking) [21:14:00] !log Reboot lvs1013, lvs1014, lvs1015, and lvs1017 for kernel updates [21:14:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:22] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:21:25] (03PS1) 10Scott French: Only set the thumb step when width is given [extensions/TimedMediaHandler] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266406 (https://phabricator.wikimedia.org/T422074) [21:21:54] (03PS1) 10Scott French: Only set the thumb step when width is given [extensions/TimedMediaHandler] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266407 (https://phabricator.wikimedia.org/T422074) [21:23:43] jouncebot: nowandnext [21:23:43] For the next 0 hour(s) and 36 minute(s): Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T2100) [21:23:43] In 0 hour(s) and 36 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T2200) [21:24:22] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:28:32] FYI, CI permitting, I'll be deploying backports of https://gerrit.wikimedia.org/r/1266382 shortly for T422074 [21:28:33] T422074: PHP Warning: Undefined array key "physicalWidth" - https://phabricator.wikimedia.org/T422074 [21:30:31] (03CR) 10Cwhite: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1266372 (https://phabricator.wikimedia.org/T421714) (owner: 10Bking) [21:30:38] (03CR) 10Papaul: [C:03+2] Add BGP sessions from mr1-eqiad to cr1/2.eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/1265533 (https://phabricator.wikimedia.org/T421238) (owner: 10Papaul) [21:32:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy1003 using scap backport" [extensions/TimedMediaHandler] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266406 (https://phabricator.wikimedia.org/T422074) (owner: 10Scott French) [21:32:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy1003 using scap backport" [extensions/TimedMediaHandler] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266407 (https://phabricator.wikimedia.org/T422074) (owner: 10Scott French) [21:33:54] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1027.eqiad.wmnet with OS bullseye [21:34:18] (03Merged) 10jenkins-bot: Only set the thumb step when width is given [extensions/TimedMediaHandler] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266406 (https://phabricator.wikimedia.org/T422074) (owner: 10Scott French) [21:34:27] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host wdqs1027 [21:34:34] (03Merged) 10jenkins-bot: Only set the thumb step when width is given [extensions/TimedMediaHandler] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266407 (https://phabricator.wikimedia.org/T422074) (owner: 10Scott French) [21:34:58] !log bking@cumin2002 START - Cookbook sre.dns.netbox [21:35:00] !log swfrench@deploy1003 Started scap sync-world: Backport for [[gerrit:1266406|Only set the thumb step when width is given (T422074)]], [[gerrit:1266407|Only set the thumb step when width is given (T422074)]] [21:35:03] T422074: PHP Warning: Undefined array key "physicalWidth" - https://phabricator.wikimedia.org/T422074 [21:36:56] !log swfrench@deploy1003 swfrench: Backport for [[gerrit:1266406|Only set the thumb step when width is given (T422074)]], [[gerrit:1266407|Only set the thumb step when width is given (T422074)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:38:04] !log swfrench@deploy1003 swfrench: Continuing with sync [21:38:17] that'll do it [21:40:18] 06SRE: my phab-cli test task - https://phabricator.wikimedia.org/T422088#11780341 (10jijiki) [21:40:40] bking@cumin2002 reimage (PID 3772654) is awaiting input [21:42:15] !log swfrench@deploy1003 Finished scap sync-world: Backport for [[gerrit:1266406|Only set the thumb step when width is given (T422074)]], [[gerrit:1266407|Only set the thumb step when width is given (T422074)]] (duration: 07m 15s) [21:42:18] T422074: PHP Warning: Undefined array key "physicalWidth" - https://phabricator.wikimedia.org/T422074 [21:51:32] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [21:53:56] (03PS1) 10Papaul: Remove "emporary replace ospf" to test bgp [homer/public] - 10https://gerrit.wikimedia.org/r/1266429 (https://phabricator.wikimedia.org/T421238) [21:57:45] (03PS2) 10Papaul: Remove temporary "replace ospf" to test bgp [homer/public] - 10https://gerrit.wikimedia.org/r/1266429 (https://phabricator.wikimedia.org/T421238) [21:59:33] (03CR) 10Clare Ming: [C:04-1] "punting on this for now until we think through implications of this some more" [puppet] - 10https://gerrit.wikimedia.org/r/1265525 (https://phabricator.wikimedia.org/T408186) (owner: 10Clare Ming) [21:59:49] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T421970#11780384 (10phaultfinder) [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T2200) [22:00:54] (03CR) 10Papaul: [C:03+2] Remove temporary "replace ospf" to test bgp [homer/public] - 10https://gerrit.wikimedia.org/r/1266429 (https://phabricator.wikimedia.org/T421238) (owner: 10Papaul) [22:01:03] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wdqs1027 - bking@cumin2002" [22:01:08] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wdqs1027 - bking@cumin2002" [22:01:08] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:01:09] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache wdqs1027.eqiad.wmnet 98.32.64.10.in-addr.arpa 8.9.0.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [22:01:13] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wdqs1027.eqiad.wmnet 98.32.64.10.in-addr.arpa 8.9.0.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [22:01:13] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wdqs1027 [22:03:54] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-internal-scholarly_443: Servers wdqs1027.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:04:06] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-internal-scholarly_443: Servers wdqs1027.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:05:15] ^^ expected [22:05:25] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wdqs1027 [22:05:26] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wdqs1027 [22:06:11] FIRING: GitlabPackagePullerFailedOnRun: Package puller has some run errors that needs investigation. - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DGitlabPackagePullerFailedOnRun [22:07:12] !log bking@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=wdqs-internal-scholarly,name=codfw [22:09:10] (03PS1) 10Ladsgroup: Deferred: Fix function to get virtual domain [core] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266442 (https://phabricator.wikimedia.org/T421914) [22:09:24] (03PS1) 10Ladsgroup: Deferred: Fix function to get virtual domain [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266443 (https://phabricator.wikimedia.org/T421914) [22:09:37] jouncebot: nowandnext [22:09:37] For the next 0 hour(s) and 50 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T2200) [22:09:37] In 7 hour(s) and 50 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T0600) [22:09:37] In 7 hour(s) and 50 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T0600) [22:09:39] FIRING: CoreBGPDown: Core BGP session down between cr1-eqiad and (2620:0:861:fe04::1) - group Management - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr1-eqiad:9804&var-bgp_group=Management&var-bgp_neighbor= - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:09:53] (03CR) 10Ladsgroup: [C:03+2] Deferred: Fix function to get virtual domain [core] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266442 (https://phabricator.wikimedia.org/T421914) (owner: 10Ladsgroup) [22:09:57] (03CR) 10Ladsgroup: [C:03+2] Deferred: Fix function to get virtual domain [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266443 (https://phabricator.wikimedia.org/T421914) (owner: 10Ladsgroup) [22:14:39] FIRING: [6x] CoreBGPDown: Core BGP session down between cr1-eqiad and (2620:0:861:fe04::1) - group Management - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:15:01] ^that is me [22:24:13] (03Merged) 10jenkins-bot: Deferred: Fix function to get virtual domain [core] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266442 (https://phabricator.wikimedia.org/T421914) (owner: 10Ladsgroup) [22:24:22] (03Merged) 10jenkins-bot: Deferred: Fix function to get virtual domain [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266443 (https://phabricator.wikimedia.org/T421914) (owner: 10Ladsgroup) [22:26:24] PROBLEM - Host asw1-eqsin is DOWN: PING CRITICAL - Packet loss = 100% [22:27:22] PROBLEM - Host ps1-603-eqsin is DOWN: PING CRITICAL - Packet loss = 100% [22:27:24] PROBLEM - Host ps1-604-eqsin is DOWN: PING CRITICAL - Packet loss = 100% [22:27:32] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1266443|Deferred: Fix function to get virtual domain (T421914 T398709)]], [[gerrit:1266442|Deferred: Fix function to get virtual domain (T421914 T398709)]] [22:27:36] T421914: Test links virtual domain split on testcommonswiki - https://phabricator.wikimedia.org/T421914 [22:27:37] T398709: FY2025-26 WE 6.4.1: Move links tables of commons to a dedicated cluster - https://phabricator.wikimedia.org/T398709 [22:28:52] PROBLEM - Host mr1-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [22:29:05] me ^ [22:29:29] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1266443|Deferred: Fix function to get virtual domain (T421914 T398709)]], [[gerrit:1266442|Deferred: Fix function to get virtual domain (T421914 T398709)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:29:39] FIRING: [10x] CoreBGPDown: Core BGP session down between cr1-eqiad and (2620:0:861:fe04::1) - group Management - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:29:56] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [22:31:00] Amir1: let me know when you are done deploying! thanks in advance! [22:31:19] sure! almost done [22:32:08] RECOVERY - Host ps1-603-eqsin is UP: PING OK - Packet loss = 0%, RTA = 254.30 ms [22:32:08] RECOVERY - Host asw1-eqsin is UP: PING OK - Packet loss = 0%, RTA = 250.63 ms [22:32:08] RECOVERY - Host ps1-604-eqsin is UP: PING OK - Packet loss = 0%, RTA = 246.82 ms [22:33:16] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1027.eqiad.wmnet with reason: host reimage [22:33:54] RECOVERY - Host mr1-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 249.39 ms [22:34:08] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1266443|Deferred: Fix function to get virtual domain (T421914 T398709)]], [[gerrit:1266442|Deferred: Fix function to get virtual domain (T421914 T398709)]] (duration: 06m 37s) [22:34:13] T421914: Test links virtual domain split on testcommonswiki - https://phabricator.wikimedia.org/T421914 [22:34:13] T398709: FY2025-26 WE 6.4.1: Move links tables of commons to a dedicated cluster - https://phabricator.wikimedia.org/T398709 [22:34:39] FIRING: [8x] CoreBGPDown: Core BGP session down between cr1-eqiad and mr1-eqiad (208.80.154.204) - group Management - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:35:16] Jdlrobson: I'm done. The floor is yours! [22:38:12] thanks Amir1 [22:38:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265482 (https://phabricator.wikimedia.org/T420348) (owner: 10LorenMora) [22:38:42] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1027.eqiad.wmnet with reason: host reimage [22:39:12] (03Merged) 10jenkins-bot: Legal Footer Link Deploys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265482 (https://phabricator.wikimedia.org/T420348) (owner: 10LorenMora) [22:39:37] !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1265482|Legal Footer Link Deploys (T420348)]] [22:39:40] T420348: Footer link deployments - arwiki, cawiki, fawiki, rowiki, ruwiki, trwiki - https://phabricator.wikimedia.org/T420348 [22:41:37] !log jdlrobson@deploy1003 lmora, jdlrobson: Backport for [[gerrit:1265482|Legal Footer Link Deploys (T420348)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:43:53] !log jdlrobson@deploy1003 lmora, jdlrobson: Continuing with sync [22:48:02] !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1265482|Legal Footer Link Deploys (T420348)]] (duration: 08m 25s) [22:48:06] T420348: Footer link deployments - arwiki, cawiki, fawiki, rowiki, ruwiki, trwiki - https://phabricator.wikimedia.org/T420348 [22:48:11] !log removed unused image-suggestion service in codfw - T368096 [22:48:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:14] T368096: mediawiki: migrate from image-suggestion to data-gateway - https://phabricator.wikimedia.org/T368096 [22:48:15] all done! [22:58:20] !log removed unused image-suggestion service in eqiad - T368096 [22:58:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:24] T368096: mediawiki: migrate from image-suggestion to data-gateway - https://phabricator.wikimedia.org/T368096 [22:59:03] (03PS3) 10KineticPelagic: REST: Publish ReadingLists v0 module in REST Sandbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264856 (https://phabricator.wikimedia.org/T419619) [23:03:00] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1027.eqiad.wmnet with OS bullseye [23:03:01] (03PS4) 10KineticPelagic: REST: Publish ReadingLists v0 module in REST Sandbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264856 (https://phabricator.wikimedia.org/T419619) [23:19:39] RESOLVED: [5x] CoreBGPDown: Core BGP session down between cr1-eqiad and mr1-eqiad (208.80.154.204) - group Management - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [23:20:11] (03PS1) 10Papaul: The peering IP's were wrong update all IP's [homer/public] - 10https://gerrit.wikimedia.org/r/1266476 (https://phabricator.wikimedia.org/T421238) [23:23:07] (03CR) 10Papaul: [C:03+2] The peering IP's were wrong update all IP's [homer/public] - 10https://gerrit.wikimedia.org/r/1266476 (https://phabricator.wikimedia.org/T421238) (owner: 10Papaul) [23:28:31] RESOLVED: [2x] Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [23:41:57] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1266482 [23:42:00] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1266482 (owner: 10TrainBranchBot) [23:54:10] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1266482 (owner: 10TrainBranchBot)