[00:00:33] <wikibugs>	 (03PS1) 10Ladsgroup: Pass whether the image is svg to adjustThumbWidthForSteps [extensions/Popups] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1265629 (https://phabricator.wikimedia.org/T414805)
[00:00:50] <wikibugs>	 (03PS1) 10Ladsgroup: util.js: Allow passing isVectorized to adjustThumbWidthForSteps [core] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1265630 (https://phabricator.wikimedia.org/T414805)
[00:00:57] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1265623|LinksUpdate: Consolidate links virtual domains (T421914)]], [[gerrit:1265624|LinksUpdate: Consolidate links virtual domains (T421914)]]
[00:00:57] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Pass whether the image is svg to adjustThumbWidthForSteps [extensions/Popups] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1265629 (https://phabricator.wikimedia.org/T414805) (owner: 10Ladsgroup)
[00:01:03] <stashbot>	 T421914: Test links virtual domain split on testcommonswiki - https://phabricator.wikimedia.org/T421914
[00:01:14] <wikibugs>	 (03PS1) 10Ladsgroup: util.js: Allow passing isVectorized to adjustThumbWidthForSteps [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1265631 (https://phabricator.wikimedia.org/T414805)
[00:01:25] <wikibugs>	 (03PS1) 10Ladsgroup: Pass whether the image is svg to adjustThumbWidthForSteps [extensions/Popups] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1265632 (https://phabricator.wikimedia.org/T414805)
[00:03:02] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1265623|LinksUpdate: Consolidate links virtual domains (T421914)]], [[gerrit:1265624|LinksUpdate: Consolidate links virtual domains (T421914)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[00:03:36] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Continuing with sync
[00:07:47] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1265623|LinksUpdate: Consolidate links virtual domains (T421914)]], [[gerrit:1265624|LinksUpdate: Consolidate links virtual domains (T421914)]] (duration: 06m 50s)
[00:07:50] <stashbot>	 T421914: Test links virtual domain split on testcommonswiki - https://phabricator.wikimedia.org/T421914
[00:09:10] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[00:09:22] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[00:10:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.3% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[00:10:39] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[00:12:01] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] util.js: Allow passing isVectorized to adjustThumbWidthForSteps [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1265631 (https://phabricator.wikimedia.org/T414805) (owner: 10Ladsgroup)
[00:12:06] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] util.js: Allow passing isVectorized to adjustThumbWidthForSteps [core] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1265630 (https://phabricator.wikimedia.org/T414805) (owner: 10Ladsgroup)
[00:15:03] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Pass whether the image is svg to adjustThumbWidthForSteps [extensions/Popups] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1265629 (https://phabricator.wikimedia.org/T414805) (owner: 10Ladsgroup)
[00:15:07] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Pass whether the image is svg to adjustThumbWidthForSteps [extensions/Popups] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1265632 (https://phabricator.wikimedia.org/T414805) (owner: 10Ladsgroup)
[00:15:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.98% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[00:16:16] <wikibugs>	 (03CR) 10Ladsgroup: [C:04-1] "requires rebase since it's checked out minified 😞 For later then" [extensions/Popups] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1265629 (https://phabricator.wikimedia.org/T414805) (owner: 10Ladsgroup)
[00:17:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.24% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[00:22:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.24% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[00:23:18] <wikibugs>	 (03Merged) 10jenkins-bot: util.js: Allow passing isVectorized to adjustThumbWidthForSteps [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1265631 (https://phabricator.wikimedia.org/T414805) (owner: 10Ladsgroup)
[00:23:27] <wikibugs>	 (03Merged) 10jenkins-bot: util.js: Allow passing isVectorized to adjustThumbWidthForSteps [core] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1265630 (https://phabricator.wikimedia.org/T414805) (owner: 10Ladsgroup)
[00:24:10] <jinxer-wm>	 RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[00:24:22] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[00:25:31] <wikibugs>	 (03Merged) 10jenkins-bot: Pass whether the image is svg to adjustThumbWidthForSteps [extensions/Popups] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1265632 (https://phabricator.wikimedia.org/T414805) (owner: 10Ladsgroup)
[00:25:39] <jinxer-wm>	 RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[00:26:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 21.05% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[00:27:18] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1265631|util.js: Allow passing isVectorized to adjustThumbWidthForSteps (T414805 T411013 T421589)]], [[gerrit:1265630|util.js: Allow passing isVectorized to adjustThumbWidthForSteps (T414805 T411013 T421589)]], [[gerrit:1265632|Pass whether the image is svg to adjustThumbWidthForSteps (T414805 T411013 T421589)]]
[00:27:25] <stashbot>	 T414805: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805
[00:27:26] <stashbot>	 T411013: Popups should use standard thumbnail sizes - https://phabricator.wikimedia.org/T411013
[00:27:26] <stashbot>	 T421589: Page Previews uses low quality thumbnails - https://phabricator.wikimedia.org/T421589
[00:29:11] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1265631|util.js: Allow passing isVectorized to adjustThumbWidthForSteps (T414805 T411013 T421589)]], [[gerrit:1265630|util.js: Allow passing isVectorized to adjustThumbWidthForSteps (T414805 T411013 T421589)]], [[gerrit:1265632|Pass whether the image is svg to adjustThumbWidthForSteps (T414805 T411013 T421589)]] synced to the testservers (see https://wikitech.wiki
[00:29:11] <logmsgbot>	 media.org/wiki/Mwdebug). Changes can now be verified there.
[00:32:10] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[00:35:49] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Continuing with sync
[00:37:10] <jinxer-wm>	 RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[00:39:59] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1265631|util.js: Allow passing isVectorized to adjustThumbWidthForSteps (T414805 T411013 T421589)]], [[gerrit:1265630|util.js: Allow passing isVectorized to adjustThumbWidthForSteps (T414805 T411013 T421589)]], [[gerrit:1265632|Pass whether the image is svg to adjustThumbWidthForSteps (T414805 T411013 T421589)]] (duration: 12m 40s)
[00:40:05] <stashbot>	 T414805: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805
[00:40:06] <stashbot>	 T411013: Popups should use standard thumbnail sizes - https://phabricator.wikimedia.org/T411013
[00:40:06] <stashbot>	 T421589: Page Previews uses low quality thumbnails - https://phabricator.wikimedia.org/T421589
[00:42:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 927.1ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[00:48:36] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Pass whether the image is svg to adjustThumbWidthForSteps [extensions/Popups] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1265629 (https://phabricator.wikimedia.org/T414805) (owner: 10Ladsgroup)
[00:49:58] <wikibugs>	 (03Merged) 10jenkins-bot: Pass whether the image is svg to adjustThumbWidthForSteps [extensions/Popups] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1265629 (https://phabricator.wikimedia.org/T414805) (owner: 10Ladsgroup)
[00:52:16] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 811.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[00:53:30] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1265629|Pass whether the image is svg to adjustThumbWidthForSteps (T414805 T411013 T421589)]]
[00:53:37] <stashbot>	 T414805: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805
[00:53:38] <stashbot>	 T411013: Popups should use standard thumbnail sizes - https://phabricator.wikimedia.org/T411013
[00:53:38] <stashbot>	 T421589: Page Previews uses low quality thumbnails - https://phabricator.wikimedia.org/T421589
[00:54:17] <jinxer-wm>	 RESOLVED: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[00:55:27] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1265629|Pass whether the image is svg to adjustThumbWidthForSteps (T414805 T411013 T421589)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[00:56:26] <logmsgbot>	 !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[00:56:30] <logmsgbot>	 !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[00:57:15] <jinxer-wm>	 FIRING: [7x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[00:57:52] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Continuing with sync
[01:02:05] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1265629|Pass whether the image is svg to adjustThumbWidthForSteps (T414805 T411013 T421589)]] (duration: 08m 35s)
[01:02:12] <stashbot>	 T414805: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805
[01:02:12] <stashbot>	 T411013: Popups should use standard thumbnail sizes - https://phabricator.wikimedia.org/T411013
[01:02:12] <stashbot>	 T421589: Page Previews uses low quality thumbnails - https://phabricator.wikimedia.org/T421589
[01:02:15] <jinxer-wm>	 RESOLVED: [7x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[01:04:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 971.3ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[01:08:15] <jinxer-wm>	 FIRING: [3x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[01:09:16] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 802.1ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[01:11:54] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1265654
[01:11:54] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1265654 (owner: 10TrainBranchBot)
[01:12:38] <logmsgbot>	 !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[01:12:43] <logmsgbot>	 !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[01:12:57] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s3 on clouddb1022 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[01:12:57] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s3 on clouddb1022 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[01:13:15] <jinxer-wm>	 RESOLVED: [3x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[01:21:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.1% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[01:23:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-ext releases routed via main (k8s) 1.6s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[01:25:47] <wikibugs>	 (03PS1) 10Krinkle: robots.php: Change Beta Cluster override from prepend to replace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265672
[01:28:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-ext releases routed via main (k8s) 1.6s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[01:28:47] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1265654 (owner: 10TrainBranchBot)
[01:48:33] <wikibugs>	 (03PS1) 10Dr0ptp4kt: Edit modules/admin/data/data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1265675
[01:51:33] <logmsgbot>	 !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[01:51:36] <logmsgbot>	 !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[01:54:01] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11776109 (10AWesterinen) 05Resolved→03Open I believe that the problem is my two different accounts (I am unsure how I e...
[01:58:35] <wikibugs>	 (03PS2) 10Dr0ptp4kt: Edit modules/admin/data/data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1265675
[02:00:22] <wikibugs>	 (03PS3) 10Dr0ptp4kt: Update deployment key for dr0ptp4kt [puppet] - 10https://gerrit.wikimedia.org/r/1265675
[02:00:48] <logmsgbot>	 !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image
[02:05:14] <wikibugs>	 (03CR) 10Ottomata: [C:03+2] Update Media-analytics helmfile.d global-staging to use cassandra Staging Hosts. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265555 (https://phabricator.wikimedia.org/T415202) (owner: 10Snwachukwu)
[02:07:11] <logmsgbot>	 !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 23s)
[02:07:46] <wikibugs>	 (03Merged) 10jenkins-bot: Update Media-analytics helmfile.d global-staging to use cassandra Staging Hosts. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265555 (https://phabricator.wikimedia.org/T415202) (owner: 10Snwachukwu)
[02:09:13] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:14:38] <logmsgbot>	 !log ebysans@deploy1003 helmfile [staging] START helmfile.d/services/media-analytics: apply
[02:24:43] <logmsgbot>	 !log ebysans@deploy1003 helmfile [staging] DONE helmfile.d/services/media-analytics: apply
[02:33:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.92% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[02:34:13] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:38:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 21.87% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[02:50:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.12% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[02:55:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.12% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[03:09:13] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:14:13] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:19:16] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: Update ULSFO LVS service IP's - https://phabricator.wikimedia.org/T418971#11776183 (10Papaul) @SLyngshede-WMF  thank you very much.
[03:25:45] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11776198 (10Papaul)
[04:24:22] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[04:29:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:09:13] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:14:13] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:17:19] <wikibugs>	 (03CR) 10ArielGlenn: [C:03+1] "Nice cleanup, one typo noted." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260763 (https://phabricator.wikimedia.org/T419796) (owner: 10Bartosz Dziewoński)
[05:17:57] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s3 on clouddb1022 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:17:57] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s3 on clouddb1022 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:26:17] <marostegui>	 !log Drop global_block_whitelist on closed wikis T420525
[05:26:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:26:20] <stashbot>	 T420525: Drop global_block_whitelist from closed wikis - https://phabricator.wikimedia.org/T420525
[05:30:57] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s3 on clouddb1022 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:33:07] <marostegui>	 !log Drop empty ores_classification and ores_model on closed wikis T420093
[05:33:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:33:10] <stashbot>	 T420093: Drop ORES tables from wikis without ORES - https://phabricator.wikimedia.org/T420093
[05:52:26] <wikibugs>	 (03PS1) 101F616EMO: arbcom_zhwiki: Enable SecurePoll without PII rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265959 (https://phabricator.wikimedia.org/T419309)
[05:56:03] <marostegui>	 !log Drop empty tables cusi_case, cusi_user, and cusi_signal on wikis not listed at checkuser-suggested-investigations.dblist  T421353
[05:56:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:56:06] <stashbot>	 T421353: Drop cusi_case, cusi_signal, and cusi_user tables from wikis where they are unused - https://phabricator.wikimedia.org/T421353
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T0600)
[06:14:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1265580 (owner: 10Eevans)
[06:17:26] <wikibugs>	 (03PS1) 10Marostegui: clouddb1022: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1265990
[06:17:42] <wikibugs>	 (03PS3) 10Muehlenhoff: ncredir: Switch to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1250517
[06:22:02] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] clouddb1022: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1265990 (owner: 10Marostegui)
[06:23:24] <logmsgbot>	 ayounsi@cumin1003 reimage (PID 699330) is awaiting input
[06:29:56] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1366.eqiad.wmnet with OS trixie
[06:30:13] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1366
[06:30:55] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox
[06:34:22] <wikibugs>	 (03CR) 10ArielGlenn: [C:03+1] "Looks fine, though I am a bit out of the loop on the precedence of the various classes any more." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1263878 (owner: 10Daniel Kinzler)
[06:34:47] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1366 - ayounsi@cumin1003"
[06:34:57] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1366 - ayounsi@cumin1003"
[06:34:57] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[06:34:57] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache wikikube-worker1366.eqiad.wmnet 200.48.64.10.in-addr.arpa 0.0.2.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[06:35:01] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1366.eqiad.wmnet 200.48.64.10.in-addr.arpa 0.0.2.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[06:35:02] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1366
[06:37:12] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[06:37:18] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[06:38:27] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1366
[06:38:27] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1366
[06:45:00] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Add BGP sessions from mr1-eqiad to cr1/2.eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/1265533 (https://phabricator.wikimedia.org/T421238) (owner: 10Papaul)
[06:50:16] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1366.eqiad.wmnet with reason: host reimage
[06:52:23] <jinxer-wm>	 FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target   - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn
[06:54:10] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1366.eqiad.wmnet with reason: host reimage
[07:00:04] <jouncebot>	 Amir1, Urbanecm, and awight: UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T0700). Please do the needful.
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:07:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[07:08:04] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: increase parallel prefilling and concurrent decoding to improve gpt isvc performance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266023 (https://phabricator.wikimedia.org/T418350)
[07:09:01] <wikibugs>	 (03CR) 10Elukey: [C:03+2] elasticsearch: fix test for non-utc timezones [software/spicerack] - 10https://gerrit.wikimedia.org/r/1265466 (owner: 10Elukey)
[07:09:08] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[07:09:11] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[07:10:35] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1366.eqiad.wmnet with OS trixie
[07:14:47] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[07:14:51] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[07:15:18] <wikibugs>	 (03Abandoned) 10Elukey: First pass of ruff check --fix [software/spicerack] - 10https://gerrit.wikimedia.org/r/1265476 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey)
[07:15:58] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps1011.eqiad.wmnet
[07:17:08] <wikibugs>	 (03PS1) 10Brouberol: analytics/hadoop: allow fr-tech-users/admins to submi/manage jobs from the production queue [puppet] - 10https://gerrit.wikimedia.org/r/1266033 (https://phabricator.wikimedia.org/T417213)
[07:19:18] <wikibugs>	 (03CR) 10CI reject: [V:04-1] analytics/hadoop: allow fr-tech-users/admins to submi/manage jobs from the production queue [puppet] - 10https://gerrit.wikimedia.org/r/1266033 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol)
[07:20:46] <wikibugs>	 (03PS2) 10Brouberol: analytics/hadoop: allow fr-tech-users/admins to submi/manage YARN jobs [puppet] - 10https://gerrit.wikimedia.org/r/1266033 (https://phabricator.wikimedia.org/T417213)
[07:22:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps1011.eqiad.wmnet
[07:22:55] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:24:11] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[07:24:14] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[07:26:12] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[07:26:14] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[07:26:51] <moritzm>	 !log installing postgresql security updates
[07:26:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:26:56] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[07:27:00] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[07:27:33] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps1012.eqiad.wmnet
[07:31:09] <wikibugs>	 (03PS2) 10Arnaudb: gerrit: update timeouts for gitiles [puppet] - 10https://gerrit.wikimedia.org/r/1265448 (https://phabricator.wikimedia.org/T421904)
[07:32:46] <wikibugs>	 (03PS1) 10Arnaudb: gerrit: increase packetGitWindowSize [puppet] - 10https://gerrit.wikimedia.org/r/1266044 (https://phabricator.wikimedia.org/T421904)
[07:34:01] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] bitu: Remove inactive approver [puppet] - 10https://gerrit.wikimedia.org/r/1265490 (owner: 10Muehlenhoff)
[07:34:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps1012.eqiad.wmnet
[07:34:58] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1367.eqiad.wmnet with OS trixie
[07:35:26] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1367
[07:35:37] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox
[07:36:01] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[07:36:05] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[07:39:18] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1367 - ayounsi@cumin1003"
[07:39:23] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1367 - ayounsi@cumin1003"
[07:39:23] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[07:39:23] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache wikikube-worker1367.eqiad.wmnet 201.48.64.10.in-addr.arpa 1.0.2.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[07:39:27] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1367.eqiad.wmnet 201.48.64.10.in-addr.arpa 1.0.2.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[07:39:28] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1367
[07:40:03] <wikibugs>	 (03CR) 10Muehlenhoff: ci: Add 'Signed-by' keyfile reference to thirdparty APT repo config (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1260766 (https://phabricator.wikimedia.org/T418109) (owner: 10Dzahn)
[07:40:05] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 140623440 and 36 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[07:40:51] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1367
[07:40:51] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1367
[07:41:05] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3656 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[07:44:46] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to superset dashboard for mpostoronca - https://phabricator.wikimedia.org/T421471#11776454 (10MPostoronca-WMF) Hi @OKryva-WMF, could you please approve this request? Thank you
[07:46:36] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1156.eqiad.wmnet with reason: Maintenance
[07:46:56] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[07:47:04] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1156 (T419635)', diff saved to https://phabricator.wikimedia.org/P90130 and previous config saved to /var/cache/conftool/dbconfig/20260401-074704-fceratto.json
[07:47:08] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[07:48:27] <wikibugs>	 (03CR) 10JMeybohm: "`envoy_cluster_update*` is 5 series, `envoy_dns*` is 6." [puppet] - 10https://gerrit.wikimedia.org/r/1261485 (https://phabricator.wikimedia.org/T421343) (owner: 10JMeybohm)
[07:49:13] <wikibugs>	 (03CR) 10JMeybohm: "If that happens to be too much, we can probably get away with just the `envoy_cluster_update` ones." [puppet] - 10https://gerrit.wikimedia.org/r/1261485 (https://phabricator.wikimedia.org/T421343) (owner: 10JMeybohm)
[07:49:29] <wikibugs>	 (03PS5) 10Daniel Kinzler: rest gateway: add support for centralauthtoken [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259242 (https://phabricator.wikimedia.org/T420280)
[07:51:15] <wikibugs>	 (03PS2) 10Arnaudb: gerrit: increase packedGitWindowSize [puppet] - 10https://gerrit.wikimedia.org/r/1266044 (https://phabricator.wikimedia.org/T421904)
[07:52:36] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1367.eqiad.wmnet with reason: host reimage
[07:54:10] <wikibugs>	 (03PS2) 10ArielGlenn: rest-gateway: add values for auth-newuser rate limiting class for feature patch [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260774 (https://phabricator.wikimedia.org/T419796)
[07:56:42] <wikibugs>	 (03PS4) 10Daniel Kinzler: rest-gateway: Refactor request classification for readability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260763 (https://phabricator.wikimedia.org/T419796) (owner: 10Bartosz Dziewoński)
[07:56:43] <wikibugs>	 (03PS3) 10Daniel Kinzler: rest gateway: rate limiting for InstantCommons [deployment-charts] - 10https://gerrit.wikimedia.org/r/1263878
[07:57:02] <wikibugs>	 (03CR) 10Daniel Kinzler: rest-gateway: Refactor request classification for readability (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260763 (https://phabricator.wikimedia.org/T419796) (owner: 10Bartosz Dziewoński)
[07:57:41] <wikibugs>	 (03CR) 10Daniel Kinzler: rest gateway: rate limiting for InstantCommons (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1263878 (owner: 10Daniel Kinzler)
[07:57:43] <wikibugs>	 (03CR) 10Ozge: [C:03+1] ml-services: increase parallel prefilling and concurrent decoding to improve gpt isvc performance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266023 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[07:58:41] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] ml-services: increase parallel prefilling and concurrent decoding to improve gpt isvc performance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266023 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[07:59:11] <wikibugs>	 (03PS1) 10Tiziano Fogli: thanos/store: add a scrape target for the ruler instance [puppet] - 10https://gerrit.wikimedia.org/r/1266067 (https://phabricator.wikimedia.org/T412924)
[07:59:34] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1367.eqiad.wmnet with reason: host reimage
[07:59:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1262055 (owner: 10Elukey)
[08:00:04] <jouncebot>	 jnuche and hashar: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T0800)
[08:00:27] <jnuche>	 hi, the train will be rolling out soon
[08:00:41] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: increase parallel prefilling and concurrent decoding to improve gpt isvc performance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266023 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[08:01:00] <wikibugs>	 (03PS2) 10Tiziano Fogli: thanos/store: add a scrape target for the ruler instance [puppet] - 10https://gerrit.wikimedia.org/r/1266067 (https://phabricator.wikimedia.org/T412924)
[08:01:04] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 1088981560 and 90 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[08:01:36] <wikibugs>	 (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1266067 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli)
[08:01:48] <wikibugs>	 10SRE-swift-storage, 10Ceph, 06Data-Persistence: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11776495 (10MatthewVernon)
[08:03:04] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 137336 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[08:03:04] <wikibugs>	 (03CR) 10CI reject: [V:04-1] thanos/store: add a scrape target for the ruler instance [puppet] - 10https://gerrit.wikimedia.org/r/1266067 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli)
[08:03:13] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[08:04:18] <wikibugs>	 (03PS3) 10Tiziano Fogli: thanos/store: add a scrape target for the ruler instance [puppet] - 10https://gerrit.wikimedia.org/r/1266067 (https://phabricator.wikimedia.org/T412924)
[08:05:34] <wikibugs>	 (03CR) 10Elukey: [C:03+2] profile::base::certificates: rename Puppet Internal CA's path [puppet] - 10https://gerrit.wikimedia.org/r/1262055 (owner: 10Elukey)
[08:05:39] <wikibugs>	 10SRE-swift-storage, 10Ceph, 06Data-Persistence: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11776499 (10MatthewVernon)
[08:06:21] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 to 1.46.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266075 (https://phabricator.wikimedia.org/T420480)
[08:06:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Initiated by jnuche@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266075 (https://phabricator.wikimedia.org/T420480) (owner: 10TrainBranchBot)
[08:06:45] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T419635)', diff saved to https://phabricator.wikimedia.org/P90132 and previous config saved to /var/cache/conftool/dbconfig/20260401-080644-fceratto.json
[08:06:48] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[08:07:00] <moritzm>	 !log upgrading Envoy on the Puppet servers to 1.35.9 T419637 T410975
[08:07:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:07:04] <stashbot>	 T419637: Upgrade Envoy to v1.35.9 - https://phabricator.wikimedia.org/T419637
[08:07:04] <stashbot>	 T410975: Upgrade Envoy to v1.35.7 - https://phabricator.wikimedia.org/T410975
[08:07:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[08:07:18] <wikibugs>	 (03Merged) 10jenkins-bot: group1 to 1.46.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266075 (https://phabricator.wikimedia.org/T420480) (owner: 10TrainBranchBot)
[08:10:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps1013.eqiad.wmnet
[08:11:25] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421704#11776508 (10MLechvien-WMF)
[08:12:58] <wikibugs>	 (03PS6) 10Daniel Kinzler: rest gateway: add second Lua filter for header handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250675 (https://phabricator.wikimedia.org/T418969)
[08:13:13] <wikibugs>	 (03CR) 10Daniel Kinzler: rest gateway: add second Lua filter for header handling (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250675 (https://phabricator.wikimedia.org/T418969) (owner: 10Daniel Kinzler)
[08:13:19] <wikibugs>	 10SRE-swift-storage, 10Ceph, 06Data-Persistence: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11776514 (10jcrespo) > we're now asking service owners to re-image their existing baremetal servers  We don't reimage backups hosts....
[08:14:18] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 to 1.46.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266136 (https://phabricator.wikimedia.org/T420480)
[08:14:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Initiated by jnuche@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266136 (https://phabricator.wikimedia.org/T420480) (owner: 10TrainBranchBot)
[08:14:42] <wikibugs>	 (03PS1) 10MVernon: swift: drain 3 eqiad backends for reimage to per-rack VLAN [puppet] - 10https://gerrit.wikimedia.org/r/1266138 (https://phabricator.wikimedia.org/T421719)
[08:15:13] <wikibugs>	 (03Merged) 10jenkins-bot: group0 to 1.46.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266136 (https://phabricator.wikimedia.org/T420480) (owner: 10TrainBranchBot)
[08:16:11] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1367.eqiad.wmnet with OS trixie
[08:16:21] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to superset dashboard for mpostoronca - https://phabricator.wikimedia.org/T421471#11776521 (10OKryva-WMF) Approve.
[08:16:53] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P90134 and previous config saved to /var/cache/conftool/dbconfig/20260401-081652-fceratto.json
[08:17:13] <wikibugs>	 (03PS3) 10Arnaudb: gerrit: update timeouts for gitiles [puppet] - 10https://gerrit.wikimedia.org/r/1265448 (https://phabricator.wikimedia.org/T421904)
[08:17:26] <wikibugs>	 (03PS3) 10Arnaudb: gerrit: increase packedGitWindowSize [puppet] - 10https://gerrit.wikimedia.org/r/1266044 (https://phabricator.wikimedia.org/T421904)
[08:18:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps1013.eqiad.wmnet
[08:20:53] <wikibugs>	 (03CR) 10Jcrespo: [C:03+1] swift: drain 3 eqiad backends for reimage to per-rack VLAN [puppet] - 10https://gerrit.wikimedia.org/r/1266138 (https://phabricator.wikimedia.org/T421719) (owner: 10MVernon)
[08:21:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps1014.eqiad.wmnet
[08:21:35] <wikibugs>	 (03CR) 10MVernon: [C:03+2] swift: drain 3 eqiad backends for reimage to per-rack VLAN [puppet] - 10https://gerrit.wikimedia.org/r/1266138 (https://phabricator.wikimedia.org/T421719) (owner: 10MVernon)
[08:21:38] <wikibugs>	 (03PS3) 10Daniel Kinzler: rest-gateway: add values for auth-newuser rate limiting class for feature patch [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260774 (https://phabricator.wikimedia.org/T419796) (owner: 10ArielGlenn)
[08:21:41] <logmsgbot>	 !log jnuche@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.46.0-wmf.22  refs T420480
[08:21:43] <stashbot>	 T420480: 1.46.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T420480
[08:22:32] <wikibugs>	 (03PS4) 10Daniel Kinzler: rest-gateway: add values for auth-newuser rate limiting class for feature patch [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260774 (https://phabricator.wikimedia.org/T419796) (owner: 10ArielGlenn)
[08:23:02] <wikibugs>	 (03CR) 10Daniel Kinzler: rest-gateway: add values for auth-newuser rate limiting class for feature patch (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260774 (https://phabricator.wikimedia.org/T419796) (owner: 10ArielGlenn)
[08:23:56] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1368.eqiad.wmnet with OS trixie
[08:24:23] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1368
[08:24:37] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[08:25:18] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[08:27:01] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P90135 and previous config saved to /var/cache/conftool/dbconfig/20260401-082701-fceratto.json
[08:27:26] <logmsgbot>	 ayounsi@cumin1003 reimage (PID 744305) is awaiting input
[08:28:21] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps1014.eqiad.wmnet
[08:28:42] <wikibugs>	 (03PS6) 10Daniel Kinzler: rest gateway: add support for centralauthtoken [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259242 (https://phabricator.wikimedia.org/T420280)
[08:29:11] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Nice, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1266033 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol)
[08:29:35] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] analytics/hadoop: allow fr-tech-users/admins to submi/manage YARN jobs [puppet] - 10https://gerrit.wikimedia.org/r/1266033 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol)
[08:30:24] <wikibugs>	 (03PS1) 10Jcrespo: installserver: Treat any attempt to reimage backup hosts as an error [puppet] - 10https://gerrit.wikimedia.org/r/1266148 (https://phabricator.wikimedia.org/T420506)
[08:30:41] <wikibugs>	 (03PS2) 10Jcrespo: installserver: Treat any attempt to reimage backup hosts as an error [puppet] - 10https://gerrit.wikimedia.org/r/1266148 (https://phabricator.wikimedia.org/T420506)
[08:31:53] <wikibugs>	 (03PS3) 10Jcrespo: installserver: Treat any attempt to reimage backup hosts as an error [puppet] - 10https://gerrit.wikimedia.org/r/1266148 (https://phabricator.wikimedia.org/T420506)
[08:36:24] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdw) failed in ms-be1069 - https://phabricator.wikimedia.org/T421986 (10MatthewVernon) 03NEW
[08:36:35] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdw) failed in ms-be1069 - https://phabricator.wikimedia.org/T421986#11776589 (10MatthewVernon) p:05Triage→03High
[08:37:09] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T419635)', diff saved to https://phabricator.wikimedia.org/P90136 and previous config saved to /var/cache/conftool/dbconfig/20260401-083709-fceratto.json
[08:37:13] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[08:37:26] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1162.eqiad.wmnet with reason: Maintenance
[08:37:34] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1162 (T419635)', diff saved to https://phabricator.wikimedia.org/P90137 and previous config saved to /var/cache/conftool/dbconfig/20260401-083733-fceratto.json
[08:38:09] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox
[08:38:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[08:40:47] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T419635)', diff saved to https://phabricator.wikimedia.org/P90138 and previous config saved to /var/cache/conftool/dbconfig/20260401-084047-fceratto.json
[08:41:17] <wikibugs>	 06SRE, 07SRE-Unowned, 07Sustainability (Incident Followup): Noise in #wikimedia-operations is making incident response more difficult - https://phabricator.wikimedia.org/T417163#11776614 (10MLechvien-WMF)
[08:42:27] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[08:42:30] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[08:43:45] <logmsgbot>	 ayounsi@cumin1003 reimage (PID 744305) is awaiting input
[08:43:46] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-druid1003.eqiad.wmnet with OS bookworm
[08:44:07] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.move-vlan for host an-druid1003
[08:44:07] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host an-druid1003
[08:44:17] <moritzm>	 !log installing Apache security updates
[08:44:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:45:48] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[08:45:51] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[08:49:14] <jnuche>	 PSA: Train currently blocked at: T421988
[08:49:14] <stashbot>	 T421988: Failing deployment checks: URLs in Location header exepcted to be absolute, but relative found - https://phabricator.wikimedia.org/T421988
[08:49:23] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 07ci-test-error, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 07Kubernetes: Unusual CI failure for aux-k8s when changing dse-k8s cert-manager values - https://phabricator.wikimedia.org/T421362#11776667 (10MLechvien-WMF) Routing this to #infrastructure-foundations as...
[08:50:09] <wikibugs>	 (03CR) 10MVernon: [C:03+1] installserver: Treat any attempt to reimage backup hosts as an error [puppet] - 10https://gerrit.wikimedia.org/r/1266148 (https://phabricator.wikimedia.org/T420506) (owner: 10Jcrespo)
[08:50:54] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P90139 and previous config saved to /var/cache/conftool/dbconfig/20260401-085053-fceratto.json
[08:52:25] <wikibugs>	 10SRE-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11776677 (10BTullis)
[08:52:44] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1368 - ayounsi@cumin1003"
[08:52:49] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1368 - ayounsi@cumin1003"
[08:52:49] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:52:50] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache wikikube-worker1368.eqiad.wmnet 202.48.64.10.in-addr.arpa 2.0.2.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[08:52:55] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1368.eqiad.wmnet 202.48.64.10.in-addr.arpa 2.0.2.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[08:52:55] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1368
[08:53:24] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1368
[08:53:25] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1368
[08:54:25] <wikibugs>	 (03PS4) 10Tiziano Fogli: thanos/store: add a scrape target for the ruler instance [puppet] - 10https://gerrit.wikimedia.org/r/1266067 (https://phabricator.wikimedia.org/T412924)
[08:54:31] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[08:54:33] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[08:57:11] <wikibugs>	 (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1266067 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli)
[08:57:54] <Amir1>	 !log  mwscript-k8s --dblist=all -- purgeUserOptions.php --login-age 5 skin (T406724)
[08:57:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:57:58] <stashbot>	 T406724: Clean up watchlist and user properties of users if they don't log in for certain time - https://phabricator.wikimedia.org/T406724
[09:00:11] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-druid1003.eqiad.wmnet with reason: host reimage
[09:01:02] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P90141 and previous config saved to /var/cache/conftool/dbconfig/20260401-090101-fceratto.json
[09:03:39] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST events) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-dse&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:03:44] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-druid1003.eqiad.wmnet with reason: host reimage
[09:05:33] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1368.eqiad.wmnet with reason: host reimage
[09:05:50] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[09:05:54] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[09:08:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] bitu: Remove inactive approver [puppet] - 10https://gerrit.wikimedia.org/r/1265490 (owner: 10Muehlenhoff)
[09:09:28] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1368.eqiad.wmnet with reason: host reimage
[09:11:10] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T419635)', diff saved to https://phabricator.wikimedia.org/P90142 and previous config saved to /var/cache/conftool/dbconfig/20260401-091109-fceratto.json
[09:11:13] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[09:11:26] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1182.eqiad.wmnet with reason: Maintenance
[09:11:34] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1182 (T419635)', diff saved to https://phabricator.wikimedia.org/P90143 and previous config saved to /var/cache/conftool/dbconfig/20260401-091134-fceratto.json
[09:14:13] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[09:14:16] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[09:20:54] <wikibugs>	 10ops-eqiad, 06DC-Ops: Inbound errors on interface cr1-eqiad:ae2 (asw2-b-eqiad:ae1) - https://phabricator.wikimedia.org/T421989 (10phaultfinder) 03NEW
[09:21:49] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[09:21:52] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[09:22:05] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11776765 (10Aklapper) Phabricator itself has no influence on other systems. Per https://phabricator.wikimedia.org/p/AWester...
[09:23:27] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1258956 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg)
[09:26:35] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1368.eqiad.wmnet with OS trixie
[09:27:19] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1369.eqiad.wmnet with OS trixie
[09:27:47] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1369
[09:27:56] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox
[09:28:56] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T419635)', diff saved to https://phabricator.wikimedia.org/P90146 and previous config saved to /var/cache/conftool/dbconfig/20260401-092855-fceratto.json
[09:28:59] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[09:29:25] <wikibugs>	 06SRE: IP Block/Throttling relief request: urbipedia.org - Bot attack mitigated - https://phabricator.wikimedia.org/T421650#11776818 (10hnowlan) Is the UA you've provided the one used by your InstantCommons? The internal recommendation is to use the latest maintenance release of InstantCommons as older versions...
[09:31:02] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-druid1003.eqiad.wmnet with OS bookworm
[09:32:03] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1369 - ayounsi@cumin1003"
[09:32:05] <wikibugs>	 (03Abandoned) 10Arnaudb: gerrit: increase packedGitWindowSize [puppet] - 10https://gerrit.wikimedia.org/r/1266044 (https://phabricator.wikimedia.org/T421904) (owner: 10Arnaudb)
[09:32:09] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1369 - ayounsi@cumin1003"
[09:32:09] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:32:09] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache wikikube-worker1369.eqiad.wmnet 203.48.64.10.in-addr.arpa 3.0.2.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[09:32:13] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1369.eqiad.wmnet 203.48.64.10.in-addr.arpa 3.0.2.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[09:32:14] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1369
[09:32:36] <wikibugs>	 06SRE: IP Block/Throttling relief request: urbipedia.org - Bot attack mitigated - https://phabricator.wikimedia.org/T421650#11776844 (10hnowlan) Upon reviewing our logs, every 429 for Urbipedia that I see is for the user agent `QuickInstantCommons/1.5 MediaWiki/1.39.5; Urbipedia` - addressing this UA will most l...
[09:32:41] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1369
[09:32:41] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1369
[09:32:54] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] installserver: Treat any attempt to reimage backup hosts as an error [puppet] - 10https://gerrit.wikimedia.org/r/1266148 (https://phabricator.wikimedia.org/T420506) (owner: 10Jcrespo)
[09:33:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[09:39:04] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P90147 and previous config saved to /var/cache/conftool/dbconfig/20260401-093903-fceratto.json
[09:41:01] <wikibugs>	 (03PS1) 10Hnowlan: admin: add mpostoronca to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1266170 (https://phabricator.wikimedia.org/T421471)
[09:42:23] <jinxer-wm>	 RESOLVED: SLOBudgetBurn: Standalone event system success rate is below 99.9% target   - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn
[09:42:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265388 (owner: 10Muehlenhoff)
[09:43:39] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST events) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-dse&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:44:25] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1369.eqiad.wmnet with reason: host reimage
[09:45:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1266170 (https://phabricator.wikimedia.org/T421471) (owner: 10Hnowlan)
[09:45:22] <wikibugs>	 (03PS1) 10Marostegui: Revert "clouddb1022: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1266177
[09:45:56] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] admin: add mpostoronca to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1266170 (https://phabricator.wikimedia.org/T421471) (owner: 10Hnowlan)
[09:47:26] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "clouddb1022: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1266177 (owner: 10Marostegui)
[09:47:35] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to superset dashboard for mpostoronca - https://phabricator.wikimedia.org/T421471#11776929 (10hnowlan) 05Open→03In progress Your access has been added - the change should be live within the next 30 or so minutes.
[09:49:12] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P90148 and previous config saved to /var/cache/conftool/dbconfig/20260401-094912-fceratto.json
[09:50:28] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1369.eqiad.wmnet with reason: host reimage
[09:50:54] <wikibugs>	 (03PS1) 10Arnaudb: gerrit: update upstream_response_timeout for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1266181 (https://phabricator.wikimedia.org/T421827)
[09:51:17] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[09:53:04] <wikibugs>	 (03PS1) 10JMeybohm: CI: Send User-Agent when fetching data from gitiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266185
[09:53:20] <logmsgbot>	 !log jmm@deploy1003 helmfile [staging] START helmfile.d/services/proton: apply
[09:54:02] <wikibugs>	 (03CR) 10CI reject: [V:04-1] CI: Send User-Agent when fetching data from gitiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266185 (owner: 10JMeybohm)
[09:54:14] <logmsgbot>	 !log jmm@deploy1003 helmfile [staging] DONE helmfile.d/services/proton: apply
[09:54:42] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11776945 (10MoritzMuehlenhoff)
[09:54:56] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[09:55:00] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[09:55:39] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[09:55:42] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[09:57:39] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-druid1004.eqiad.wmnet with OS bookworm
[09:58:03] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.move-vlan for host an-druid1004
[09:58:03] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host an-druid1004
[09:58:06] <wikibugs>	 (03PS1) 10Muehlenhoff: Failover irc.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1266187
[09:59:20] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T419635)', diff saved to https://phabricator.wikimedia.org/P90149 and previous config saved to /var/cache/conftool/dbconfig/20260401-095920-fceratto.json
[09:59:23] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[09:59:37] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1188.eqiad.wmnet with reason: Maintenance
[09:59:44] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1188 (T419635)', diff saved to https://phabricator.wikimedia.org/P90150 and previous config saved to /var/cache/conftool/dbconfig/20260401-095943-fceratto.json
[10:00:05] <jouncebot>	 dusen and effie: May I have your attention please! MediaWiki infrastructure (UTC mid-day). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T1000)
[10:03:21] <wikibugs>	 (03PS1) 10Jforrester: MemcachedWrapper: Hash key when longer than 250 characters [extensions/WikiLambda] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266190
[10:03:59] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[10:04:02] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[10:06:42] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[10:06:45] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[10:06:49] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: update gpt isvc image to one that supports configurable tensor_parallel_size flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266195 (https://phabricator.wikimedia.org/T418350)
[10:06:56] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1369.eqiad.wmnet with OS trixie
[10:06:58] <logmsgbot>	 !log jmm@deploy1003 helmfile [codfw] START helmfile.d/services/proton: apply
[10:08:07] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[10:08:10] <logmsgbot>	 !log jmm@deploy1003 helmfile [codfw] DONE helmfile.d/services/proton: apply
[10:08:10] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[10:09:07] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: Refactor request classification for readability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260763 (https://phabricator.wikimedia.org/T419796) (owner: 10Bartosz Dziewoński)
[10:09:12] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: rate limiting for InstantCommons [deployment-charts] - 10https://gerrit.wikimedia.org/r/1263878 (owner: 10Daniel Kinzler)
[10:09:16] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] "Unpopular opinion: We should shut our IRC service down 😄" [dns] - 10https://gerrit.wikimedia.org/r/1266187 (owner: 10Muehlenhoff)
[10:09:16] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: add second Lua filter for header handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250675 (https://phabricator.wikimedia.org/T418969) (owner: 10Daniel Kinzler)
[10:09:20] <logmsgbot>	 !log jmm@deploy1003 helmfile [eqiad] START helmfile.d/services/proton: apply
[10:10:06] <wikibugs>	 (03PS5) 10Daniel Kinzler: rest-gateway: add values for new rate limiting class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260774 (https://phabricator.wikimedia.org/T419796) (owner: 10ArielGlenn)
[10:10:18] <wikibugs>	 (03PS6) 10Daniel Kinzler: rest-gateway: add values for new rate limiting classes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260774 (https://phabricator.wikimedia.org/T419796) (owner: 10ArielGlenn)
[10:10:22] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: add values for new rate limiting classes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260774 (https://phabricator.wikimedia.org/T419796) (owner: 10ArielGlenn)
[10:10:30] <wikibugs>	 (03CR) 10Trueg: [C:03+2] wdqs-queryhammer: Deployment fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1258956 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg)
[10:10:51] <wikibugs>	 (03CR) 10Muehlenhoff: "Noted, but not in scope for the current reboot :-)" [dns] - 10https://gerrit.wikimedia.org/r/1266187 (owner: 10Muehlenhoff)
[10:10:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Failover irc.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1266187 (owner: 10Muehlenhoff)
[10:11:16] <wikibugs>	 (03CR) 10Ozge: [C:03+1] ml-services: update gpt isvc image to one that supports configurable tensor_parallel_size flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266195 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[10:11:26] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: Refactor request classification for readability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260763 (https://phabricator.wikimedia.org/T419796) (owner: 10Bartosz Dziewoński)
[10:11:30] <logmsgbot>	 !log jmm@dns1004 START - running authdns-update
[10:11:38] <wikibugs>	 (03Merged) 10jenkins-bot: rest gateway: rate limiting for InstantCommons [deployment-charts] - 10https://gerrit.wikimedia.org/r/1263878 (owner: 10Daniel Kinzler)
[10:11:53] <wikibugs>	 (03Merged) 10jenkins-bot: rest gateway: add second Lua filter for header handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250675 (https://phabricator.wikimedia.org/T418969) (owner: 10Daniel Kinzler)
[10:12:23] <logmsgbot>	 !log jmm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/proton: apply
[10:12:35] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: add values for new rate limiting classes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260774 (https://phabricator.wikimedia.org/T419796) (owner: 10ArielGlenn)
[10:12:45] <wikibugs>	 (03Merged) 10jenkins-bot: wdqs-queryhammer: Deployment fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1258956 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg)
[10:13:08] <logmsgbot>	 !log jmm@dns1004 END - running authdns-update
[10:13:53] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-druid1004.eqiad.wmnet with reason: host reimage
[10:16:43] <wikibugs>	 (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1264257 (owner: 10PipelineBot)
[10:17:59] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T419635)', diff saved to https://phabricator.wikimedia.org/P90151 and previous config saved to /var/cache/conftool/dbconfig/20260401-101758-fceratto.json
[10:18:02] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[10:18:02] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[10:18:06] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[10:18:40] <wikibugs>	 (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1264257 (owner: 10PipelineBot)
[10:19:01] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1370.eqiad.wmnet with OS trixie
[10:19:31] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-druid1004.eqiad.wmnet with reason: host reimage
[10:19:32] <wikibugs>	 (03PS7) 10Daniel Kinzler: rest gateway: add support for centralauthtoken [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259242 (https://phabricator.wikimedia.org/T420280)
[10:19:40] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1370
[10:19:47] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox
[10:20:42] <wikibugs>	 06SRE, 06ServiceOps new, 07Datacenter-Switchover: Increased rate of badtoken errors / session store issues due to datacenter switchover? - https://phabricator.wikimedia.org/T421168#11777087 (10MLechvien-WMF) 05Open→03Declined We discussed with the team and don't see a link between the DC switchover a...
[10:21:54] <wikibugs>	 (03PS1) 10Daniel Kinzler: rest gateway: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266204
[10:22:06] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266204 (owner: 10Daniel Kinzler)
[10:24:22] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1370 - ayounsi@cumin1003"
[10:24:27] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1370 - ayounsi@cumin1003"
[10:24:27] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:24:28] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache wikikube-worker1370.eqiad.wmnet 204.48.64.10.in-addr.arpa 4.0.2.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[10:24:31] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1370.eqiad.wmnet 204.48.64.10.in-addr.arpa 4.0.2.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[10:24:32] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1370
[10:24:37] <wikibugs>	 (03PS2) 10Majavah: P:opensearch::cirrus::test: Convert port to an integer [puppet] - 10https://gerrit.wikimedia.org/r/1260719
[10:24:48] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1370
[10:24:48] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1370
[10:26:32] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[10:26:36] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[10:27:34] <wikibugs>	 (03PS8) 10Daniel Kinzler: rest gateway: add support for centralauthtoken [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259242 (https://phabricator.wikimedia.org/T420280)
[10:28:06] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:opensearch::cirrus::test: Convert port to an integer [puppet] - 10https://gerrit.wikimedia.org/r/1260719 (owner: 10Majavah)
[10:28:08] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P90152 and previous config saved to /var/cache/conftool/dbconfig/20260401-102807-fceratto.json
[10:29:26] <duesen>	 ...this is taking a long time to merge...
[10:29:30] <wikibugs>	 (03PS2) 10Majavah: nftables: Fix issues around virtual resource dependencies [puppet] - 10https://gerrit.wikimedia.org/r/1260721
[10:29:30] <wikibugs>	 (03PS14) 10Majavah: firewall: Declare resources for both providers [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089)
[10:29:30] <wikibugs>	 (03PS14) 10Majavah: P:wmcs::instance: Convert to firewall wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1211652 (https://phabricator.wikimedia.org/T411089)
[10:29:31] <wikibugs>	 (03PS1) 10Majavah: P:base: Make nftables::set resources always defined [puppet] - 10https://gerrit.wikimedia.org/r/1266205
[10:29:41] <duesen>	 i'm still not seeing the new chart on the deployment host
[10:31:05] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] ml-services: update gpt isvc image to one that supports configurable tensor_parallel_size flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266195 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[10:31:22] <wikibugs>	 (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) (owner: 10Majavah)
[10:32:59] <duesen>	 Raine: looks like Zuul is stuck, ithas been 10 minutes and https://integration.wikimedia.org/zuul/?#q=1266204  says "queued"... 
[10:33:35] <Raine>	 fascinating
[10:33:41] <Raine>	 (that's code for "wtf")
[10:33:55] <wikibugs>	 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11777176 (10Marostegui) For db* related hosts (including pc*, es* and dbproxy*) will be tricky as this also requires changi...
[10:34:32] <wikibugs>	 (03PS1) 10Muehlenhoff: Rebuild against latest package versions in Bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1266210
[10:35:39] <duesen>	 Raine: it also sais: "Queue lengths: 0 events, 0 results."
[10:35:46] <duesen>	 I'll try and re-trigger.
[10:36:21] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266204 (owner: 10Daniel Kinzler)
[10:36:22] <p858snake|cloud>	 zuul can get jammed up sometimes iirc, you might just need to ping in releng to get someone to look
[10:36:38] <Raine>	 yeah, exactly, ping #wikimedia-releng
[10:36:48] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1370.eqiad.wmnet with reason: host reimage
[10:38:12] <duesen>	 *sigh*
[10:38:17] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P90154 and previous config saved to /var/cache/conftool/dbconfig/20260401-103816-fceratto.json
[10:38:25] <duesen>	 I actually need to get this done including testing in the next 60 minutes...
[10:38:42] <duesen>	 or we have to revert the four patches
[10:38:46] <wikibugs>	 (03Merged) 10jenkins-bot: rest gateway: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266204 (owner: 10Daniel Kinzler)
[10:39:30] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update gpt isvc image to one that supports configurable tensor_parallel_size flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266195 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[10:39:51] <duesen>	 oh! oh! it went through!
[10:40:34] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1370.eqiad.wmnet with reason: host reimage
[10:40:45] <logmsgbot>	 !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[10:40:45] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host irc1003.wikimedia.org
[10:41:00] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] Rebuild against latest package versions in Bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1266210 (owner: 10Muehlenhoff)
[10:41:59] <logmsgbot>	 !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[10:44:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc1003.wikimedia.org
[10:46:35] <logmsgbot>	 !log daniel@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[10:47:06] <logmsgbot>	 !log daniel@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[10:47:09] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-druid1004.eqiad.wmnet with OS bookworm
[10:47:34] <moritzm>	 !log installing libpng1.6 security updates on Trixie/Bookworm
[10:47:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:47:35] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[10:47:39] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[10:48:19] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[10:48:24] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T419635)', diff saved to https://phabricator.wikimedia.org/P90156 and previous config saved to /var/cache/conftool/dbconfig/20260401-104823-fceratto.json
[10:48:27] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[10:48:39] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1197.eqiad.wmnet with reason: Maintenance
[10:48:47] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1197 (T419635)', diff saved to https://phabricator.wikimedia.org/P90157 and previous config saved to /var/cache/conftool/dbconfig/20260401-104847-fceratto.json
[10:49:23] <wikibugs>	 (03CR) 10CI reject: [V:04-1] MemcachedWrapper: Hash key when longer than 250 characters [extensions/WikiLambda] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266190 (owner: 10Jforrester)
[10:49:51] <wikibugs>	 (03PS1) 10Gkyziridis: ml-services: Deploy rr-multilingual gpu model and eventstream in prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266212 (https://phabricator.wikimedia.org/T415892)
[10:51:01] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T419635)', diff saved to https://phabricator.wikimedia.org/P90158 and previous config saved to /var/cache/conftool/dbconfig/20260401-105059-fceratto.json
[10:51:05] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 70775336 and 7 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[10:52:05] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3782224 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[10:52:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Rebuild against latest package versions in Bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1266210 (owner: 10Muehlenhoff)
[10:55:39] <wikibugs>	 (03PS1) 10Blake: mw-web: downsize for multi-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266213 (https://phabricator.wikimedia.org/T413974)
[10:56:02] <wikibugs>	 (03PS3) 10Majavah: nftables: Fix issues around virtual resource dependencies [puppet] - 10https://gerrit.wikimedia.org/r/1260721
[10:56:02] <wikibugs>	 (03PS2) 10Majavah: P:base: Make nftables::set resources always defined [puppet] - 10https://gerrit.wikimedia.org/r/1266205
[10:56:02] <wikibugs>	 (03PS15) 10Majavah: firewall: Declare resources for both providers [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089)
[10:56:03] <wikibugs>	 (03PS15) 10Majavah: P:wmcs::instance: Convert to firewall wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1211652 (https://phabricator.wikimedia.org/T411089)
[10:56:36] <wikibugs>	 (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1260721 (owner: 10Majavah)
[10:57:00] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1370.eqiad.wmnet with OS trixie
[10:57:00] <wikibugs>	 (03PS2) 10Blake: mw-web: downsize for multi-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266213 (https://phabricator.wikimedia.org/T413974)
[10:57:11] <wikibugs>	 (03PS3) 10Blake: mw-web: downsize for multi-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266213 (https://phabricator.wikimedia.org/T413974)
[10:58:02] <logmsgbot>	 !log daniel@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[10:58:45] <logmsgbot>	 !log daniel@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[10:58:54] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: add support for centralauthtoken (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259242 (https://phabricator.wikimedia.org/T420280) (owner: 10Daniel Kinzler)
[11:00:05] <jouncebot>	 mvolz: Time to do the Services – Citoid / Zotero deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T1100).
[11:01:10] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P90159 and previous config saved to /var/cache/conftool/dbconfig/20260401-110109-fceratto.json
[11:01:12] <wikibugs>	 (03Merged) 10jenkins-bot: rest gateway: add support for centralauthtoken [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259242 (https://phabricator.wikimedia.org/T420280) (owner: 10Daniel Kinzler)
[11:05:51] <logmsgbot>	 !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[11:06:30] <logmsgbot>	 !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[11:07:24] <wikibugs>	 (03PS1) 10Muehlenhoff: Update Cumin alias for contint to also cover the spun-off Trixie role [puppet] - 10https://gerrit.wikimedia.org/r/1266215
[11:09:46] <logmsgbot>	 !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply
[11:10:08] <logmsgbot>	 !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply
[11:11:18] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P90160 and previous config saved to /var/cache/conftool/dbconfig/20260401-111117-fceratto.json
[11:11:44] <logmsgbot>	 !log mvolz@deploy1003 helmfile [codfw] START helmfile.d/services/citoid: apply
[11:12:12] <logmsgbot>	 !log mvolz@deploy1003 helmfile [codfw] DONE helmfile.d/services/citoid: apply
[11:14:34] <wikibugs>	 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11777268 (10BTullis) OK, thanks for all of the input so far....
[11:15:46] <wikibugs>	 (03PS1) 10Effie Mouzeli: (WIP) fix fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266216
[11:16:35] <logmsgbot>	 !log daniel@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[11:17:14] <logmsgbot>	 !log daniel@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[11:17:58] <logmsgbot>	 !log mvolz@deploy1003 helmfile [eqiad] START helmfile.d/services/citoid: apply
[11:18:31] <logmsgbot>	 !log mvolz@deploy1003 helmfile [eqiad] DONE helmfile.d/services/citoid: apply
[11:21:26] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T419635)', diff saved to https://phabricator.wikimedia.org/P90161 and previous config saved to /var/cache/conftool/dbconfig/20260401-112125-fceratto.json
[11:21:29] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[11:21:42] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1225.eqiad.wmnet with reason: Maintenance
[11:22:51] <wikibugs>	 (03PS2) 10JMeybohm: CI: Send User-Agent when fetching data from gitiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266185
[11:23:07] <wikibugs>	 (03CR) 10CI reject: [V:04-1] (WIP) fix fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266216 (owner: 10Effie Mouzeli)
[11:23:45] <wikibugs>	 (03CR) 10CI reject: [V:04-1] CI: Send User-Agent when fetching data from gitiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266185 (owner: 10JMeybohm)
[11:27:30] <logmsgbot>	 !log daniel@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[11:27:52] <logmsgbot>	 !log daniel@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[11:27:56] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11777326 (10MoritzMuehlenhoff)
[11:28:09] <wikibugs>	 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11777327 (10BTullis)
[11:29:39] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+1] ml-services: Deploy rr-multilingual gpu model and eventstream in prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266212 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis)
[11:30:31] <wikibugs>	 (03PS2) 10Effie Mouzeli: (WIP) fix fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266216
[11:30:58] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: ml-services: Deploy rr-multilingual gpu model and eventstream in prod. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266212 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis)
[11:32:40] <wikibugs>	 (03PS3) 10Effie Mouzeli: (WIP) fix fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266216
[11:33:05] <moritzm>	 !log installing tomcat10 security updates
[11:33:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:35:49] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1229.eqiad.wmnet with reason: Maintenance
[11:35:57] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1229 (T419635)', diff saved to https://phabricator.wikimedia.org/P90162 and previous config saved to /var/cache/conftool/dbconfig/20260401-113556-fceratto.json
[11:36:00] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[11:41:24] <wikibugs>	 (03CR) 10CI reject: [V:04-1] (WIP) fix fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266216 (owner: 10Effie Mouzeli)
[11:42:58] <wikibugs>	 (03PS1) 10Jforrester: Extend queue processing times for abstract fragments [extensions/WikiLambda] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266219 (https://phabricator.wikimedia.org/T421581)
[11:44:41] <wikibugs>	 (03CR) 10Jforrester: "recheck" [extensions/WikiLambda] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266190 (owner: 10Jforrester)
[11:47:04] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2026-03-25-072715-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1264221 (owner: 10KartikMistry)
[11:48:24] <moritzm>	 !log upgrading Envoy on the idp-test servers to 1.35.9 T419637 T410975
[11:48:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:48:31] <stashbot>	 T419637: Upgrade Envoy to v1.35.9 - https://phabricator.wikimedia.org/T419637
[11:48:31] <stashbot>	 T410975: Upgrade Envoy to v1.35.7 - https://phabricator.wikimedia.org/T410975
[11:49:09] <wikibugs>	 (03PS1) 10Kosta Harlan: Revert "SuggestedInvestigations: Import session into signal matching job" [extensions/CheckUser] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266222 (https://phabricator.wikimedia.org/T421062)
[11:49:21] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2026-03-25-072715-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1264221 (owner: 10KartikMistry)
[11:49:25] <wikibugs>	 (03PS4) 10Effie Mouzeli: (WIP) fix fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266216
[11:51:15] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T419635)', diff saved to https://phabricator.wikimedia.org/P90164 and previous config saved to /var/cache/conftool/dbconfig/20260401-115114-fceratto.json
[11:51:17] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[11:55:52] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: ml-services: Deploy rr-multilingual gpu model and eventstream in prod. (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266212 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis)
[11:56:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[11:57:58] <wikibugs>	 (03PS2) 10Gkyziridis: ml-services: Deploy rr-multilingual gpu model and eventstream in prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266212 (https://phabricator.wikimedia.org/T415892)
[11:58:18] <wikibugs>	 (03PS5) 10Effie Mouzeli: (WIP) fix fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266216
[11:58:30] <jinxer-wm>	 FIRING: Outbound discards: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[11:58:31] <kart_>	 Deploying cxserver..
[11:59:23] <wikibugs>	 (03PS3) 10Gkyziridis: ml-services: Deploy rr-multilingual gpu model and eventstream in prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266212 (https://phabricator.wikimedia.org/T415892)
[12:00:07] <wikibugs>	 (03PS6) 10Effie Mouzeli: (WIP) fix fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266216
[12:01:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[12:01:23] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P90166 and previous config saved to /var/cache/conftool/dbconfig/20260401-120122-fceratto.json
[12:02:10] <wikibugs>	 (03PS1) 10Kosta Harlan: hCaptcha: Retry SiteVerify API on HTTP error and adjust timeout [extensions/ConfirmEdit] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266226 (https://phabricator.wikimedia.org/T421678)
[12:02:19] <logmsgbot>	 !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/cxserver: apply
[12:02:20] <wikibugs>	 (03PS1) 10Kosta Harlan: hCaptcha: Retry SiteVerify API on HTTP error and adjust timeout [extensions/ConfirmEdit] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266227 (https://phabricator.wikimedia.org/T421678)
[12:02:25] <wikibugs>	 (03PS1) 10Gkyziridis: EventStreamConfig: Add rr-multilingual prediction_change stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266228 (https://phabricator.wikimedia.org/T415892)
[12:02:53] <logmsgbot>	 !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[12:03:10] <wikibugs>	 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11777443 (10BTullis)
[12:06:26] <wikibugs>	 (03CR) 10CI reject: [V:04-1] hCaptcha: Retry SiteVerify API on HTTP error and adjust timeout [extensions/ConfirmEdit] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266226 (https://phabricator.wikimedia.org/T421678) (owner: 10Kosta Harlan)
[12:07:03] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdt) failed in ms-be1065 - https://phabricator.wikimedia.org/T422011 (10MatthewVernon) 03NEW
[12:07:10] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdt) failed in ms-be1065 - https://phabricator.wikimedia.org/T422011#11777457 (10MatthewVernon) p:05Triage→03High
[12:07:12] <wikibugs>	 (03CR) 10Gkyziridis: ml-services: Deploy rr-multilingual gpu model and eventstream in prod. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266212 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis)
[12:09:10] <wikibugs>	 (03PS2) 10Kosta Harlan: Revert "SuggestedInvestigations: Import session into signal matching job" [extensions/CheckUser] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266223 (https://phabricator.wikimedia.org/T421062)
[12:09:28] <wikibugs>	 (03CR) 10AikoChou: ml-services: Deploy rr-multilingual gpu model and eventstream in prod. (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266212 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis)
[12:11:31] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P90167 and previous config saved to /var/cache/conftool/dbconfig/20260401-121130-fceratto.json
[12:11:32] <logmsgbot>	 !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-lab1002.eqiad.wmnet
[12:11:33] <logmsgbot>	 !log kartik@deploy1003 helmfile [codfw] START helmfile.d/services/cxserver: apply
[12:12:05] <logmsgbot>	 !log kartik@deploy1003 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[12:12:54] <logmsgbot>	 !log kartik@deploy1003 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[12:13:26] <logmsgbot>	 !log kartik@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[12:13:29] <wikibugs>	 (03CR) 10AikoChou: [C:03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266228 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis)
[12:15:00] <wikibugs>	 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11777492 (10BTullis) In terms of manager approvals, @HShaikh...
[12:15:18] <kostajh>	 jouncebot: nowandnext
[12:15:18] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 44 minute(s)
[12:15:18] <jouncebot>	 In 0 hour(s) and 44 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T1300)
[12:15:38] <kostajh>	 I'd like to start on a few MW backports now, unless there's an objection
[12:15:56] <wikibugs>	 (03PS1) 10Muehlenhoff: thumbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266229
[12:17:00] <logmsgbot>	 !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-lab1002.eqiad.wmnet
[12:17:06] <kart_>	 !log Updated cxserver to 2026-03-25-072715-production
[12:17:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:20:06] <wikibugs>	 (03PS7) 10Effie Mouzeli: (WIP) fix fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266216
[12:21:39] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T419635)', diff saved to https://phabricator.wikimedia.org/P90168 and previous config saved to /var/cache/conftool/dbconfig/20260401-122138-fceratto.json
[12:21:41] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[12:21:55] <kostajh>	 ok, I will get started
[12:21:56] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1233.eqiad.wmnet with reason: Maintenance
[12:22:04] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1233 (T419635)', diff saved to https://phabricator.wikimedia.org/P90169 and previous config saved to /var/cache/conftool/dbconfig/20260401-122203-fceratto.json
[12:22:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/CheckUser] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266223 (https://phabricator.wikimedia.org/T421062) (owner: 10Kosta Harlan)
[12:22:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/CheckUser] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266222 (https://phabricator.wikimedia.org/T421062) (owner: 10Kosta Harlan)
[12:24:37] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[12:25:02] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "SuggestedInvestigations: Import session into signal matching job" [extensions/CheckUser] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266223 (https://phabricator.wikimedia.org/T421062) (owner: 10Kosta Harlan)
[12:25:16] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "SuggestedInvestigations: Import session into signal matching job" [extensions/CheckUser] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266222 (https://phabricator.wikimedia.org/T421062) (owner: 10Kosta Harlan)
[12:25:59] <logmsgbot>	 !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1266223|Revert "SuggestedInvestigations: Import session into signal matching job" (T421062)]], [[gerrit:1266222|Revert "SuggestedInvestigations: Import session into signal matching job" (T421062)]]
[12:28:03] <logmsgbot>	 !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1266223|Revert "SuggestedInvestigations: Import session into signal matching job" (T421062)]], [[gerrit:1266222|Revert "SuggestedInvestigations: Import session into signal matching job" (T421062)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[12:28:28] <wikibugs>	 (03PS1) 10Hashar: gerrit: replace ProxyTimeout by ProxyPass ttl [puppet] - 10https://gerrit.wikimedia.org/r/1266231 (https://phabricator.wikimedia.org/T421904)
[12:29:20] <logmsgbot>	 !log kharlan@deploy1003 kharlan: Continuing with sync
[12:29:49] <wikibugs>	 (03CR) 10Kosta Harlan: "recheck" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266226 (https://phabricator.wikimedia.org/T421678) (owner: 10Kosta Harlan)
[12:30:47] <wikibugs>	 (03PS8) 10Effie Mouzeli: (WIP) fix fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266216
[12:31:22] <wikibugs>	 (03PS8) 10Slyngshede: P:idp Allow enabling of gauth mfa / TOTP [puppet] - 10https://gerrit.wikimedia.org/r/1254176 (https://phabricator.wikimedia.org/T372892)
[12:33:34] <logmsgbot>	 !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1266223|Revert "SuggestedInvestigations: Import session into signal matching job" (T421062)]], [[gerrit:1266222|Revert "SuggestedInvestigations: Import session into signal matching job" (T421062)]] (duration: 07m 34s)
[12:34:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266227 (https://phabricator.wikimedia.org/T421678) (owner: 10Kosta Harlan)
[12:34:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266226 (https://phabricator.wikimedia.org/T421678) (owner: 10Kosta Harlan)
[12:34:24] <wikibugs>	 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11777592 (10BTullis) In the meantime, we will need an SSH key...
[12:36:16] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Update 'Location:' header tests for MediaWiki changes [puppet] - 10https://gerrit.wikimedia.org/r/1266232 (https://phabricator.wikimedia.org/T421988)
[12:36:40] <wikibugs>	 (03CR) 10Jforrester: REST: Publish ReadingLists v0 module in REST Sandbox (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264856 (https://phabricator.wikimedia.org/T419619) (owner: 10KineticPelagic)
[12:37:29] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T419635)', diff saved to https://phabricator.wikimedia.org/P90170 and previous config saved to /var/cache/conftool/dbconfig/20260401-123728-fceratto.json
[12:37:32] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[12:37:56] <MatmaRex>	 jnuche: hi. sorry for breaking the tests, i had no idea we're testing this. do we need to do anything special to deploy this? given that there's now a mutual dependency between the puppet patch i just wrote and the train D:
[12:39:04] <MatmaRex>	 jnuche: i could submit a separate patch to remove these test cases first, then we roll out the train, then we add them back with corrections. let me know if that would be useful
[12:40:06] <jnuche>	 MatmaRex: no worries, thanks for looking into it. Presumably once your puppet patch gets merged it will be eventually applied to the deploy server? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1266232
[12:40:18] <jnuche>	 at which point we should be able to deploy the train
[12:40:46] <kostajh>	 I'll be done backporting my patches in ~10-15 minutes btw
[12:40:58] <MatmaRex>	 sure, that's fine by me if that works
[12:41:16] <jnuche>	 kostajh: ack, thx
[12:41:26] <MatmaRex>	 jnuche: my concern is that deploying the puppet patch will result in test failures too, until we also deploy the train. you're saying that's okay?
[12:43:04] <jnuche>	 MatmaRe: maybe I'm missing something, but my understanding is that the workflow is: 1) We merge your patch 2) `puppet run` runs on the box every 30m and applies the test changes 3) Train can now continue
[12:43:05] <wikibugs>	 (03CR) 10Gkyziridis: ml-services: Deploy rr-multilingual gpu model and eventstream in prod. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266212 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis)
[12:43:19] <jnuche>	 the deployment tooling is not involved in that workflow
[12:43:32] <jnuche>	 we can always modify the tests on disk by hand if we don't want to wait for puppet to run
[12:44:11] <MatmaRex>	 jnuche: okay, cool. that makes sense to me, i'm just not very familiar with the workflow. please ship it at your leisure :)
[12:44:56] <wikibugs>	 (03PS9) 10Effie Mouzeli: Update fixtures and remove mw-parsoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266216 (https://phabricator.wikimedia.org/T420468)
[12:45:34] <jnuche>	 MatmaRex: well, now we need someone with +2 for the puppet repo :D
[12:46:19] <wikibugs>	 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11777636 (10BTullis) I will manually add @RThomas-WMF to the...
[12:46:53] <wikibugs>	 (03Merged) 10jenkins-bot: hCaptcha: Retry SiteVerify API on HTTP error and adjust timeout [extensions/ConfirmEdit] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266227 (https://phabricator.wikimedia.org/T421678) (owner: 10Kosta Harlan)
[12:46:55] <wikibugs>	 (03Merged) 10jenkins-bot: hCaptcha: Retry SiteVerify API on HTTP error and adjust timeout [extensions/ConfirmEdit] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266226 (https://phabricator.wikimedia.org/T421678) (owner: 10Kosta Harlan)
[12:47:10] <wikibugs>	 10SRE-Access-Requests: Yubikey-SSH-FIDO for Tiziano Fogli (tappof / BACKUP) - https://phabricator.wikimedia.org/T422020 (10tappof) 03NEW
[12:47:26] <logmsgbot>	 !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1266227|hCaptcha: Retry SiteVerify API on HTTP error and adjust timeout (T421678)]], [[gerrit:1266226|hCaptcha: Retry SiteVerify API on HTTP error and adjust timeout (T421678)]]
[12:47:28] <wikibugs>	 (03CR) 10Ottomata: [C:03+1] EventStreamConfig: Add rr-multilingual prediction_change stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266228 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis)
[12:47:29] <stashbot>	 T421678: hCaptcha: Retry SiteVerify API requests when http error occurs - https://phabricator.wikimedia.org/T421678
[12:47:37] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P90171 and previous config saved to /var/cache/conftool/dbconfig/20260401-124736-fceratto.json
[12:48:10] <wikibugs>	 (03CR) 10Majavah: [C:03+2] Update 'Location:' header tests for MediaWiki changes [puppet] - 10https://gerrit.wikimedia.org/r/1266232 (https://phabricator.wikimedia.org/T421988) (owner: 10Bartosz Dziewoński)
[12:48:34] <wikibugs>	 (03PS1) 10Tiziano Fogli: ssh: FIDO Backup key for Tiziano Fogli [puppet] - 10https://gerrit.wikimedia.org/r/1266234 (https://phabricator.wikimedia.org/T422020)
[12:48:49] <wikibugs>	 (03CR) 10Gkyziridis: "Lets wait for Ottomata to review it as well, and then I will schedule a deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266228 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis)
[12:49:00] <wikibugs>	 (03PS2) 10Hashar: gerrit: replace ProxyTimeout by ProxyPass ttl [puppet] - 10https://gerrit.wikimedia.org/r/1266231 (https://phabricator.wikimedia.org/T246763)
[12:49:24] <logmsgbot>	 !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1266227|hCaptcha: Retry SiteVerify API on HTTP error and adjust timeout (T421678)]], [[gerrit:1266226|hCaptcha: Retry SiteVerify API on HTTP error and adjust timeout (T421678)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[12:49:41] <jnuche>	 taavi just mered the patch, thanks a lot
[12:49:52] <jnuche>	 s/mered/merged/
[12:50:01] <MatmaRex>	 thanks
[12:51:02] <wikibugs>	 (03PS1) 10Btullis: Record LDAP membership of the wmf group for renilthomas [puppet] - 10https://gerrit.wikimedia.org/r/1266235 (https://phabricator.wikimedia.org/T421214)
[12:51:05] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "Confirmed key out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1265675 (owner: 10Dr0ptp4kt)
[12:51:11] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Update deployment key for dr0ptp4kt [puppet] - 10https://gerrit.wikimedia.org/r/1265675 (owner: 10Dr0ptp4kt)
[12:52:30] <wikibugs>	 (03PS1) 10Muehlenhoff: use_linux612_on_bookworm: Bump kernel to 6.12.74 [puppet] - 10https://gerrit.wikimedia.org/r/1266236
[12:52:34] <logmsgbot>	 !log kharlan@deploy1003 kharlan: Continuing with sync
[12:53:07] <wikibugs>	 (03PS1) 10Daniel Kinzler: rest gateway: defined authed-user class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266237 (https://phabricator.wikimedia.org/T420280)
[12:53:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1266235 (https://phabricator.wikimedia.org/T421214) (owner: 10Btullis)
[12:53:21] <wikibugs>	 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 13Patch-For-Review: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11777722 (10BTullis)
[12:53:42] <taavi>	 jnuche: MatmaRex: and deployed
[12:53:47] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Record LDAP membership of the wmf group for renilthomas [puppet] - 10https://gerrit.wikimedia.org/r/1266235 (https://phabricator.wikimedia.org/T421214) (owner: 10Btullis)
[12:54:28] <jnuche>	 taavi: thanks once more!
[12:54:47] <jnuche>	 I can see the changes on the disk
[12:54:51] <wikibugs>	 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 13Patch-For-Review: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11777742 (10BTullis)
[12:55:08] <jnuche>	 kostajh: please ping me once you're done with your backports
[12:55:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader1004.wikimedia.org
[12:56:31] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1266236 (owner: 10Muehlenhoff)
[12:56:36] <kostajh>	 jnuche: will do
[12:56:41] <wikibugs>	 (03CR) 10Muehlenhoff: "This is the kernel running on dse-k8s" [puppet] - 10https://gerrit.wikimedia.org/r/1266236 (owner: 10Muehlenhoff)
[12:56:47] <logmsgbot>	 !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1266227|hCaptcha: Retry SiteVerify API on HTTP error and adjust timeout (T421678)]], [[gerrit:1266226|hCaptcha: Retry SiteVerify API on HTTP error and adjust timeout (T421678)]] (duration: 09m 21s)
[12:56:49] <stashbot>	 T421678: hCaptcha: Retry SiteVerify API requests when http error occurs - https://phabricator.wikimedia.org/T421678
[12:56:53] <kostajh>	 jnuche: done
[12:57:04] <jnuche>	 kostajh: ty
[12:57:23] <jnuche>	 jouncebot: nowandnext
[12:57:23] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 2 minute(s)
[12:57:23] <jouncebot>	 In 0 hour(s) and 2 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T1300)
[12:57:45] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P90173 and previous config saved to /var/cache/conftool/dbconfig/20260401-125744-fceratto.json
[12:57:47] <jnuche>	 alright, train rolling out again
[12:57:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1266234 (https://phabricator.wikimedia.org/T422020) (owner: 10Tiziano Fogli)
[12:58:04] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 to 1.46.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266240 (https://phabricator.wikimedia.org/T420480)
[12:58:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Initiated by jnuche@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266240 (https://phabricator.wikimedia.org/T420480) (owner: 10TrainBranchBot)
[12:58:26] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] ssh: FIDO Backup key for Tiziano Fogli [puppet] - 10https://gerrit.wikimedia.org/r/1266234 (https://phabricator.wikimedia.org/T422020) (owner: 10Tiziano Fogli)
[12:58:30] <jinxer-wm>	 FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[12:58:45] <wikibugs>	 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 13Patch-For-Review: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11777753 (10KMontalva-WMF) Thanks @BTul...
[12:59:11] <wikibugs>	 (03CR) 10BPirkle: [C:03+1] REST: Publish ReadingLists v0 module in REST Sandbox (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264856 (https://phabricator.wikimedia.org/T419619) (owner: 10KineticPelagic)
[12:59:53] <wikibugs>	 (03Merged) 10jenkins-bot: group1 to 1.46.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266240 (https://phabricator.wikimedia.org/T420480) (owner: 10TrainBranchBot)
[13:00:01] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Thanks. Yep, we will schedule a rolling reboot of both clusters." [puppet] - 10https://gerrit.wikimedia.org/r/1266236 (owner: 10Muehlenhoff)
[13:00:04] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T1300).
[13:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[13:00:11] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader1004.wikimedia.org
[13:00:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] use_linux612_on_bookworm: Bump kernel to 6.12.74 [puppet] - 10https://gerrit.wikimedia.org/r/1266236 (owner: 10Muehlenhoff)
[13:01:09] <wikibugs>	 (03CR) 10AikoChou: ml-services: Deploy rr-multilingual gpu model and eventstream in prod. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266212 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis)
[13:02:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus-nginx-exporter.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:02:49] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:02:49] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:04:36] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C:03+1] rest gateway: defined authed-user class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266237 (https://phabricator.wikimedia.org/T420280) (owner: 10Daniel Kinzler)
[13:05:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader2004.wikimedia.org
[13:06:34] <wikibugs>	 (03PS1) 10Muehlenhoff: Failover URL downloaders [dns] - 10https://gerrit.wikimedia.org/r/1266242
[13:06:39] <logmsgbot>	 !log jnuche@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.46.0-wmf.22  refs T420480
[13:06:42] <stashbot>	 T420480: 1.46.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T420480
[13:06:47] <Lucas_WMDE>	 o/
[13:06:55] * Lucas_WMDE also sees nothing to deploy
[13:07:49] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:07:49] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:07:49] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:07:54] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T419635)', diff saved to https://phabricator.wikimedia.org/P90174 and previous config saved to /var/cache/conftool/dbconfig/20260401-130753-fceratto.json
[13:07:57] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[13:07:59] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1239.eqiad.wmnet with reason: Maintenance
[13:09:39] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] gerrit: replace ProxyTimeout by ProxyPass ttl [puppet] - 10https://gerrit.wikimedia.org/r/1266231 (https://phabricator.wikimedia.org/T246763) (owner: 10Hashar)
[13:09:46] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader2004.wikimedia.org
[13:11:26] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] hiera: upgrade haproxy to version 3.2 on magru [puppet] - 10https://gerrit.wikimedia.org/r/1262060 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur)
[13:12:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-nginx-exporter.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:12:32] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 02 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266228 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis)
[13:16:19] <wikibugs>	 (03PS1) 10Muehlenhoff: Use use_linux612_on_bookworm for ml-lab role [puppet] - 10https://gerrit.wikimedia.org/r/1266244
[13:19:52] <logmsgbot>	 !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_magru - 3.2 upgrade (T421402)
[13:19:53] <wikibugs>	 (03PS1) 10Eevans: cassandra_dev: add media_analytics role & grants [puppet] - 10https://gerrit.wikimedia.org/r/1266247 (https://phabricator.wikimedia.org/T420008)
[13:19:56] <stashbot>	 T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402
[13:20:55] <logmsgbot>	 !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_magru - 3.2 upgrade (T421402)
[13:21:03] <fabfur>	 !log upgrading magru to haproxy 3.2 (T421402) 
[13:21:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:22] <wikibugs>	 (03CR) 10Eevans: [C:03+2] cassandra_dev: add media_analytics role & grants [puppet] - 10https://gerrit.wikimedia.org/r/1266247 (https://phabricator.wikimedia.org/T420008) (owner: 10Eevans)
[13:21:43] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1254.eqiad.wmnet with reason: Maintenance
[13:21:49] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1254 (T419635)', diff saved to https://phabricator.wikimedia.org/P90176 and previous config saved to /var/cache/conftool/dbconfig/20260401-132149-fceratto.json
[13:21:52] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[13:22:50] <moritzm>	 !log purge prometheus-nginx-exporter from url downloaders, remnants of early hcapcha rollout
[13:22:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:42] <wikibugs>	 (03PS3) 10Fabfur: hiera: upgrade haproxy to version 3.2 on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1262061 (https://phabricator.wikimedia.org/T421402)
[13:23:47] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1262061 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur)
[13:23:58] <wikibugs>	 (03PS1) 10Kamila Součková: shellbox-icu72: Add ClusterIP to TLS cert SANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266250 (https://phabricator.wikimedia.org/T419274)
[13:24:44] <wikibugs>	 (03PS1) 10Arnaudb: gerrit: add Cache-Control for Gitiles with mod_proxy [puppet] - 10https://gerrit.wikimedia.org/r/1266238 (https://phabricator.wikimedia.org/T409422)
[13:24:57] <wikibugs>	 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 13Patch-For-Review: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11777910 (10E.Enabulele) Hello @BTullis...
[13:25:48] <wikibugs>	 (03CR) 10Klausman: [C:03+1] Use use_linux612_on_bookworm for ml-lab role [puppet] - 10https://gerrit.wikimedia.org/r/1266244 (owner: 10Muehlenhoff)
[13:26:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Use use_linux612_on_bookworm for ml-lab role [puppet] - 10https://gerrit.wikimedia.org/r/1266244 (owner: 10Muehlenhoff)
[13:27:25] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: prometheus-nginx-exporter.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:28:43] <wikibugs>	 (03CR) 10Eevans: [C:03+2] admin: add FIDO key for eevans (spare) [puppet] - 10https://gerrit.wikimedia.org/r/1265580 (owner: 10Eevans)
[13:28:46] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Nokia: BGP policy for unicast bgp sw_external outside peerings [homer/public] - 10https://gerrit.wikimedia.org/r/1262197 (https://phabricator.wikimedia.org/T408892) (owner: 10Cathal Mooney)
[13:28:46] <wikibugs>	 (03CR) 10Kamila Součková: "I am very, very, very sorry :')" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266250 (https://phabricator.wikimedia.org/T419274) (owner: 10Kamila Součková)
[13:29:40] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[13:30:03] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[13:30:12] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[13:30:27] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[13:30:30] <wikibugs>	 (03Merged) 10jenkins-bot: Nokia: BGP policy for unicast bgp sw_external outside peerings [homer/public] - 10https://gerrit.wikimedia.org/r/1262197 (https://phabricator.wikimedia.org/T408892) (owner: 10Cathal Mooney)
[13:31:05] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] Failover URL downloaders [dns] - 10https://gerrit.wikimedia.org/r/1266242 (owner: 10Muehlenhoff)
[13:31:36] <wikibugs>	 (03PS1) 10Elukey: sre.hosts.provision: add workaround for root user on X14 supermicros [cookbooks] - 10https://gerrit.wikimedia.org/r/1266257 (https://phabricator.wikimedia.org/T418929)
[13:32:09] <wikibugs>	 (03PS1) 10Brouberol: anlytics/hadoop: remove an-worker1148 from the topology [puppet] - 10https://gerrit.wikimedia.org/r/1266259 (https://phabricator.wikimedia.org/T417213)
[13:33:20] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[13:33:25] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[13:34:14] <wikibugs>	 (03CR) 10Btullis: [C:03+1] anlytics/hadoop: remove an-worker1148 from the topology [puppet] - 10https://gerrit.wikimedia.org/r/1266259 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol)
[13:35:25] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] anlytics/hadoop: remove an-worker1148 from the topology [puppet] - 10https://gerrit.wikimedia.org/r/1266259 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol)
[13:36:31] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1254 (T419635)', diff saved to https://phabricator.wikimedia.org/P90177 and previous config saved to /var/cache/conftool/dbconfig/20260401-133629-fceratto.json
[13:36:34] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[13:41:34] <logmsgbot>	 !log ebysans@deploy1003 helmfile [staging] START helmfile.d/services/media-analytics: apply
[13:41:42] <logmsgbot>	 !log ebysans@deploy1003 helmfile [staging] DONE helmfile.d/services/media-analytics: apply
[13:42:46] <wikibugs>	 (03PS1) 10Kamila Součková: Temporarily add shellbox-icu ClusterIP endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266264 (https://phabricator.wikimedia.org/T419049)
[13:43:16] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host kafka-logging1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[13:44:09] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-logging1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[13:45:58] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10observability, 13Patch-For-Review: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11778022 (10elukey) The workaround in the last patch needs a spicerack change for ipmi, since we assume the root user:  ` Traceback (most recent call...
[13:46:39] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P90178 and previous config saved to /var/cache/conftool/dbconfig/20260401-134638-fceratto.json
[13:46:59] <logmsgbot>	 !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-lab1002.eqiad.wmnet
[13:49:49] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] "Cool, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266216 (https://phabricator.wikimedia.org/T420468) (owner: 10Effie Mouzeli)
[13:50:25] <logmsgbot>	 !log klausman@cumin1003 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host ml-lab1002.eqiad.wmnet
[13:51:32] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[13:51:45] <logmsgbot>	 !log klausman@cumin1003 START - Cookbook sre.hosts.remove-downtime for ml-lab1002.eqiad.wmnet
[13:51:46] <logmsgbot>	 !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ml-lab1002.eqiad.wmnet
[13:56:47] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P90179 and previous config saved to /var/cache/conftool/dbconfig/20260401-135646-fceratto.json
[13:57:29] <wikibugs>	 (03CR) 10Jforrester: REST: Publish ReadingLists v0 module in REST Sandbox (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264856 (https://phabricator.wikimedia.org/T419619) (owner: 10KineticPelagic)
[13:59:03] <wikibugs>	 (03Merged) 10jenkins-bot: Update fixtures and remove mw-parsoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266216 (https://phabricator.wikimedia.org/T420468) (owner: 10Effie Mouzeli)
[13:59:12] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] wikifunctions: Slim down staging resources, and fix main staging config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261455 (owner: 10Jforrester)
[13:59:41] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wcqs1003.eqiad.wmnet with OS bullseye
[14:00:04] <jouncebot>	 Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T1400)
[14:00:05] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host wcqs1003
[14:00:16] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[14:00:54] <James_F>	 jouncebot: nowandnext
[14:00:54] <jouncebot>	 For the next 0 hour(s) and 59 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T1400)
[14:00:54] <jouncebot>	 In 0 hour(s) and 29 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T1430)
[14:01:03] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Slim down staging resources, and fix main staging config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261455 (owner: 10Jforrester)
[14:01:04] <James_F>	 Eek, let's get moving.
[14:01:15] <wikibugs>	 (03PS3) 10JMeybohm: CI: Send User-Agent when fetching data from gitiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266185
[14:02:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266190 (owner: 10Jforrester)
[14:02:14] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266219 (https://phabricator.wikimedia.org/T421581) (owner: 10Jforrester)
[14:02:33] <logmsgbot>	 !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[14:02:49] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:02:49] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:02:56] <wikibugs>	 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11778138 (10lanebecker) @BTullis approved for @HShaikh! Thanks.
[14:03:01] <logmsgbot>	 !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[14:03:02] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11778139 (10AWesterinen) I still have the error, "Service access denied due to missing privileges." I think that I need "wm...
[14:03:20] <logmsgbot>	 !log brouberol@cumin1003 START - Cookbook sre.hosts.decommission for hosts an-worker1148.eqiad.wmnet
[14:04:09] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wcqs1003 - bking@cumin2002"
[14:04:14] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wcqs1003 - bking@cumin2002"
[14:04:15] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:04:15] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache wcqs1003.eqiad.wmnet 9.32.64.10.in-addr.arpa 9.0.0.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[14:04:19] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wcqs1003.eqiad.wmnet 9.32.64.10.in-addr.arpa 9.0.0.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[14:04:20] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wcqs1003
[14:04:21] <wikibugs>	 (03PS6) 10Jforrester: wikifunctions: Bump up orchestrator resources + 2->4/4->6 CPU for evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261344 (https://phabricator.wikimedia.org/T415067) (owner: 10Elukey)
[14:04:48] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] wikifunctions: Bump up orchestrator resources + 2->4/4->6 CPU for evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261344 (https://phabricator.wikimedia.org/T415067) (owner: 10Elukey)
[14:04:59] <logmsgbot>	 !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[14:05:12] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wcqs1003
[14:05:13] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wcqs1003
[14:05:15] <logmsgbot>	 !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[14:05:22] <logmsgbot>	 !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[14:05:28] <logmsgbot>	 !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[14:06:24] <wikibugs>	 (03PS4) 10Gkyziridis: ml-services: Deploy rr-multilingual gpu model and eventstream in prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266212 (https://phabricator.wikimedia.org/T415892)
[14:06:49] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Bump up orchestrator resources + 2->4/4->6 CPU for evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261344 (https://phabricator.wikimedia.org/T415067) (owner: 10Elukey)
[14:06:55] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1254 (T419635)', diff saved to https://phabricator.wikimedia.org/P90181 and previous config saved to /var/cache/conftool/dbconfig/20260401-140654-fceratto.json
[14:06:58] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[14:07:00] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1259.eqiad.wmnet with reason: Maintenance
[14:07:08] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1259 (T419635)', diff saved to https://phabricator.wikimedia.org/P90182 and previous config saved to /var/cache/conftool/dbconfig/20260401-140707-fceratto.json
[14:07:14] <logmsgbot>	 !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[14:07:30] <logmsgbot>	 !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[14:07:47] <logmsgbot>	 !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[14:07:49] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:07:49] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:07:49] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:07:57] <wikibugs>	 (03Merged) 10jenkins-bot: MemcachedWrapper: Hash key when longer than 250 characters [extensions/WikiLambda] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266190 (owner: 10Jforrester)
[14:07:58] <wikibugs>	 (03Merged) 10jenkins-bot: Extend queue processing times for abstract fragments [extensions/WikiLambda] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266219 (https://phabricator.wikimedia.org/T421581) (owner: 10Jforrester)
[14:08:28] <logmsgbot>	 !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1266190|MemcachedWrapper: Hash key when longer than 250 characters]], [[gerrit:1266219|Extend queue processing times for abstract fragments (T421581)]]
[14:08:31] <stashbot>	 T421581: Abstract Wikipedia is not compatible with new API rate limits - https://phabricator.wikimedia.org/T421581
[14:08:46] <logmsgbot>	 !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[14:09:17] <logmsgbot>	 !log brouberol@cumin1003 START - Cookbook sre.dns.netbox
[14:09:49] <logmsgbot>	 !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[14:10:26] <logmsgbot>	 !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1266190|MemcachedWrapper: Hash key when longer than 250 characters]], [[gerrit:1266219|Extend queue processing times for abstract fragments (T421581)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:10:30] <logmsgbot>	 !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[14:11:37] <logmsgbot>	 !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_magru - 3.2 upgrade (T421402)
[14:11:39] <stashbot>	 T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402
[14:11:44] <logmsgbot>	 !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_magru - 3.2 upgrade (T421402)
[14:11:51] <volans>	 !log uploaded cumin_6.0.0 to apt.wikimedia.org bookworm-wikimedia,trixie-wikimedia
[14:11:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:12:29] <logmsgbot>	 !log jforrester@deploy1003 jforrester: Continuing with sync
[14:12:51] <wikibugs>	 (03PS5) 10Jforrester: wikifunctions: Replace check-wf-services.sh with a Python version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260738 (https://phabricator.wikimedia.org/T421243)
[14:12:57] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] wikifunctions: Replace check-wf-services.sh with a Python version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260738 (https://phabricator.wikimedia.org/T421243) (owner: 10Jforrester)
[14:13:02] <logmsgbot>	 !log brouberol@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-worker1148.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1003"
[14:13:19] <logmsgbot>	 !log brouberol@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-worker1148.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1003"
[14:13:19] <logmsgbot>	 !log brouberol@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:13:20] <logmsgbot>	 !log brouberol@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts an-worker1148.eqiad.wmnet
[14:13:29] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 07Essential-Work: hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11778207 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by brouberol@cumin1003 for hosts: `an-worker1148.eqiad.wmnet` - an-worker1148...
[14:13:40] <wikibugs>	 (03CR) 10Jforrester: "Tested with https://www.wikifunctions.org/wiki/Special:RunFunction?call=%7B%22Z1K1%22%3A%22Z7%22%2C%22Z7K1%22%3A%22Z19661%22%2C%22Z19661K1" [extensions/WikiLambda] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266190 (owner: 10Jforrester)
[14:14:39] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2026-03-25-132409 to 2026-04-01-092119 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266272 (https://phabricator.wikimedia.org/T412768)
[14:14:53] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2026-03-25-132654 to 2026-03-31-162258 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266273 (https://phabricator.wikimedia.org/T413839)
[14:15:13] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2026-03-25-132409 to 2026-04-01-092119 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266272 (https://phabricator.wikimedia.org/T412768) (owner: 10Jforrester)
[14:15:15] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Replace check-wf-services.sh with a Python version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260738 (https://phabricator.wikimedia.org/T421243) (owner: 10Jforrester)
[14:16:43] <logmsgbot>	 !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1266190|MemcachedWrapper: Hash key when longer than 250 characters]], [[gerrit:1266219|Extend queue processing times for abstract fragments (T421581)]] (duration: 08m 14s)
[14:16:46] <stashbot>	 T421581: Abstract Wikipedia is not compatible with new API rate limits - https://phabricator.wikimedia.org/T421581
[14:16:50] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] wikifunctions: Make old Bash check script call the Python one [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261456 (https://phabricator.wikimedia.org/T421243) (owner: 10Jforrester)
[14:16:59] <wikibugs>	 (03PS3) 10Jforrester: wikifunctions: Make old Bash check script call the Python one [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261456 (https://phabricator.wikimedia.org/T421243)
[14:17:04] <wikibugs>	 (03CR) 10Jforrester: "…" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261456 (https://phabricator.wikimedia.org/T421243) (owner: 10Jforrester)
[14:17:12] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2026-03-25-132409 to 2026-04-01-092119 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266272 (https://phabricator.wikimedia.org/T412768) (owner: 10Jforrester)
[14:17:18] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11778287 (10BTullis) >>! In T420053#11776109, @AWesterinen wrote: > I believe that the problem is my two different accounts...
[14:18:06] <logmsgbot>	 !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[14:19:03] <logmsgbot>	 !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[14:19:05] <wikibugs>	 (03PS5) 10Gkyziridis: ml-services: Deploy rr-multilingual gpu model and eventstream in prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266212 (https://phabricator.wikimedia.org/T415892)
[14:19:06] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Make old Bash check script call the Python one [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261456 (https://phabricator.wikimedia.org/T421243) (owner: 10Jforrester)
[14:19:36] <wikibugs>	 (03PS6) 10Gkyziridis: ml-services: Deploy rr-multilingual gpu model and eventstream in prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266212 (https://phabricator.wikimedia.org/T415892)
[14:19:59] <logmsgbot>	 !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[14:20:56] <logmsgbot>	 !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[14:21:06] <logmsgbot>	 !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[14:21:47] <logmsgbot>	 !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[14:21:58] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11778325 (10BTullis) I have created the kerberos principal. ` btullis@krb1002:~$ sudo manage_principals.py create andreawes...
[14:22:21] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2026-03-25-132654 to 2026-03-31-162258 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266273 (https://phabricator.wikimedia.org/T413839) (owner: 10Jforrester)
[14:22:29] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wcqs1003.eqiad.wmnet with reason: host reimage
[14:22:32] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1259 (T419635)', diff saved to https://phabricator.wikimedia.org/P90184 and previous config saved to /var/cache/conftool/dbconfig/20260401-142231-fceratto.json
[14:22:35] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[14:24:20] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2026-03-25-132654 to 2026-03-31-162258 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266273 (https://phabricator.wikimedia.org/T413839) (owner: 10Jforrester)
[14:25:54] <logmsgbot>	 !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[14:26:07] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 202271256 and 28 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[14:26:18] <logmsgbot>	 !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[14:26:40] <logmsgbot>	 !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[14:27:44] <logmsgbot>	 !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[14:28:09] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3256960 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[14:28:10] <logmsgbot>	 !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[14:28:18] <wikibugs>	 (03PS1) 10Atsuko: admin/data: promoted atsuko to ops [puppet] - 10https://gerrit.wikimedia.org/r/1266275 (https://phabricator.wikimedia.org/T421860)
[14:28:39] <logmsgbot>	 !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[14:28:53] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wcqs1003.eqiad.wmnet with reason: host reimage
[14:29:33] <wikibugs>	 (03CR) 10Effie Mouzeli: "woohoo" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266185 (owner: 10JMeybohm)
[14:30:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T1400)
[14:30:05] <jouncebot>	 Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T1430)
[14:32:40] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1259', diff saved to https://phabricator.wikimedia.org/P90186 and previous config saved to /var/cache/conftool/dbconfig/20260401-143239-fceratto.json
[14:36:11] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 07ci-test-error, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 07Kubernetes: Unusual CI failure for aux-k8s when changing dse-k8s cert-manager values - https://phabricator.wikimedia.org/T421362#11778449 (10JMeybohm) 05Open→03Invalid This might as well have be...
[14:37:32] <wikibugs>	 (03CR) 10AikoChou: ml-services: Deploy rr-multilingual gpu model and eventstream in prod. (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266212 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis)
[14:37:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 07Essential-Work: hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11778469 (10brouberol) @Jclark-ctr an-worker1148 is now in decommissioning status (https://netbox.wikimedia.org/dcim/devices/3661/). Over to you, with...
[14:38:52] <wikibugs>	 (03CR) 10Muehlenhoff: admin/data: promoted atsuko to ops (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1266275 (https://phabricator.wikimedia.org/T421860) (owner: 10Atsuko)
[14:39:39] <wikibugs>	 (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) (owner: 10Majavah)
[14:40:43] <wikibugs>	 (03PS1) 10Brouberol: anlytics/hadoop: fix typo in the yarn queue mapping [puppet] - 10https://gerrit.wikimedia.org/r/1266282 (https://phabricator.wikimedia.org/T417213)
[14:41:38] <wikibugs>	 (03CR) 10Btullis: [C:03+1] anlytics/hadoop: fix typo in the yarn queue mapping [puppet] - 10https://gerrit.wikimedia.org/r/1266282 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol)
[14:41:52] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] hiera: upgrade haproxy to version 3.2 on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1262061 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur)
[14:42:48] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1259', diff saved to https://phabricator.wikimedia.org/P90187 and previous config saved to /var/cache/conftool/dbconfig/20260401-144247-fceratto.json
[14:43:25] <wikibugs>	 (03PS7) 10Gkyziridis: ml-services: Deploy rr-multilingual gpu model and eventstream in prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266212 (https://phabricator.wikimedia.org/T415892)
[14:43:48] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] hiera: upgrade haproxy to version 3.2 on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1262061 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur)
[14:43:53] <wikibugs>	 (03CR) 10Gkyziridis: ml-services: Deploy rr-multilingual gpu model and eventstream in prod. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266212 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis)
[14:44:17] <fabfur>	 !log upgrading ulsfo to haproxy 3.2 (T421402) 
[14:44:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:44:20] <stashbot>	 T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402
[14:44:56] <logmsgbot>	 !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_ulsfo - 3.2 upgrade (T421402)
[14:44:57] <logmsgbot>	 !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_ulsfo - 3.2 upgrade (T421402)
[14:46:41] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] anlytics/hadoop: fix typo in the yarn queue mapping [puppet] - 10https://gerrit.wikimedia.org/r/1266282 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol)
[14:47:53] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker2004.codfw.wmnet
[14:48:27] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host db1208.eqiad.wmnet
[14:50:18] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker2004.codfw.wmnet
[14:50:43] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker2005.codfw.wmnet
[14:52:14] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 34968
[14:52:44] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 34968
[14:52:57] <logmsgbot>	 !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1259 (T419635)', diff saved to https://phabricator.wikimedia.org/P90188 and previous config saved to /var/cache/conftool/dbconfig/20260401-145256-fceratto.json
[14:52:59] <stashbot>	 T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
[14:53:01] <wikibugs>	 (03PS1) 10Majavah: Revert "dumps: web: Trust X-Client-IP from edge caches" [puppet] - 10https://gerrit.wikimedia.org/r/1266287
[14:53:13] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[14:54:23] <wikibugs>	 (03PS1) 10Majavah: Revert "Add dumps-http.discovery.wmnet" [dns] - 10https://gerrit.wikimedia.org/r/1266288
[14:54:24] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] CI: Send User-Agent when fetching data from gitiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266185 (owner: 10JMeybohm)
[14:55:47] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[14:55:50] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[14:56:06] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker2005.codfw.wmnet
[14:57:04] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] Revert "dumps: web: Trust X-Client-IP from edge caches" [puppet] - 10https://gerrit.wikimedia.org/r/1266287 (owner: 10Majavah)
[14:57:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] Revert "Add dumps-http.discovery.wmnet" [dns] - 10https://gerrit.wikimedia.org/r/1266288 (owner: 10Majavah)
[14:57:39] <wikibugs>	 (03CR) 10Majavah: [C:03+2] Revert "Add dumps-http.discovery.wmnet" [dns] - 10https://gerrit.wikimedia.org/r/1266288 (owner: 10Majavah)
[14:57:42] <logmsgbot>	 !log taavi@dns1004 START - running authdns-update
[14:58:05] <wikibugs>	 (03CR) 10Majavah: [C:03+2] Revert "dumps: web: Trust X-Client-IP from edge caches" [puppet] - 10https://gerrit.wikimedia.org/r/1266287 (owner: 10Majavah)
[14:58:21] <James_F>	 jouncebot: nowandnext
[14:58:21] <jouncebot>	 For the next 0 hour(s) and 1 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T1400)
[14:58:21] <jouncebot>	 For the next 0 hour(s) and 1 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T1430)
[14:58:21] <jouncebot>	 In 2 hour(s) and 1 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T1700)
[14:59:00] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wcqs1003.eqiad.wmnet with OS bullseye
[14:59:04] <wikibugs>	 (03PS1) 10Jforrester: Wikifunctions: Switch cache from mcrouter-wikifunctions to special access [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266290 (https://phabricator.wikimedia.org/T411807)
[14:59:23] <logmsgbot>	 !log taavi@dns1004 END - running authdns-update
[14:59:36] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266290 (https://phabricator.wikimedia.org/T411807) (owner: 10Jforrester)
[15:00:25] <logmsgbot>	 !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host db1208.eqiad.wmnet
[15:00:33] <wikibugs>	 (03Merged) 10jenkins-bot: Wikifunctions: Switch cache from mcrouter-wikifunctions to special access [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266290 (https://phabricator.wikimedia.org/T411807) (owner: 10Jforrester)
[15:00:53] <icinga-wm>	 PROBLEM - MariaDB Replica IO: matomo on db1208 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:00:53] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: matomo on db1208 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:00:55] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: matomo on db1208 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:00:55] <icinga-wm>	 PROBLEM - MariaDB read only matomo on db1208 is CRITICAL: Could not connect to localhost:3351 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[15:00:55] <icinga-wm>	 PROBLEM - mysqld processes on db1208 is CRITICAL: PROCS CRITICAL: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[15:01:00] <logmsgbot>	 !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1266290|Wikifunctions: Switch cache from mcrouter-wikifunctions to special access (T411807)]]
[15:01:03] <stashbot>	 T411807: WF memcached service is dc-local but used for dc-global content - https://phabricator.wikimedia.org/T411807
[15:02:55] <icinga-wm>	 RECOVERY - mysqld processes on db1208 is OK: PROCS OK: 2 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[15:02:59] <logmsgbot>	 !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1266290|Wikifunctions: Switch cache from mcrouter-wikifunctions to special access (T411807)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[15:03:49] <btullis>	 Sorry for the blip on db1208. That was me restarting it.
[15:03:55] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: matomo on db1208 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:03:55] <icinga-wm>	 RECOVERY - MariaDB read only matomo on db1208 is OK: Version 10.6.18-MariaDB-log, Uptime 60s, read_only: True, event_scheduler: True, 11.22 QPS, connection latency: 0.032044s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[15:04:53] <icinga-wm>	 RECOVERY - MariaDB Replica IO: matomo on db1208 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:04:53] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: matomo on db1208 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:07:50] <wikibugs>	 (03CR) 10AikoChou: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266212 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis)
[15:08:12] <wikibugs>	 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11778686 (10RThomas-WMF) Thanks @BTullis, here is my pub key...
[15:09:19] <wikibugs>	 (03PS1) 10Majavah: P:dumps::distribution::web: Rsync logs from all servers [puppet] - 10https://gerrit.wikimedia.org/r/1266291 (https://phabricator.wikimedia.org/T422042)
[15:09:35] <wikibugs>	 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Wikimedia Enterprise, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11778690 (10LDlulisa-WMF) Thanks @BTullis! Here is my public...
[15:09:42] <logmsgbot>	 !log jforrester@deploy1003 jforrester: Continuing with sync
[15:10:19] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8365/co" [puppet] - 10https://gerrit.wikimedia.org/r/1266291 (https://phabricator.wikimedia.org/T422042) (owner: 10Majavah)
[15:10:58] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply
[15:11:27] * Lucas_WMDE tries to test https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1218858 on mw-experimental
[15:11:39] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply
[15:11:50] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 helmfile [codfw] START helmfile.d/services/mw-experimental: apply
[15:12:28] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply
[15:13:04] <wikibugs>	 (03CR) 10Atsuko: "checking with brouberol and btullis about it (maybe we should clean it up)" [puppet] - 10https://gerrit.wikimedia.org/r/1266275 (https://phabricator.wikimedia.org/T421860) (owner: 10Atsuko)
[15:13:54] <logmsgbot>	 !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1266290|Wikifunctions: Switch cache from mcrouter-wikifunctions to special access (T411807)]] (duration: 12m 53s)
[15:13:55] <wikibugs>	 (03PS2) 10Arnaudb: gerrit: update upstream_response_timeout for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1266181 (https://phabricator.wikimedia.org/T421827)
[15:13:57] <stashbot>	 T411807: WF memcached service is dc-local but used for dc-global content - https://phabricator.wikimedia.org/T411807
[15:18:14] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] gerrit: update upstream_response_timeout for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1266181 (https://phabricator.wikimedia.org/T421827) (owner: 10Arnaudb)
[15:19:10] <Lucas_WMDE>	 claime: if you’re around – is there a way to run maintenance scripts on mw-experimental?
[15:19:35] <Lucas_WMDE>	 (the latest comments in https://phabricator.wikimedia.org/T341560 sound like that might not be possible yet :/)
[15:20:55] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T421714, prepare newly-reimaged host) xfer wikidata from wcqs1001.eqiad.wmnet -> wcqs1003.eqiad.wmnet, repooling both afterwards
[15:20:59] <stashbot>	 T421714: Data platform: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421714
[15:21:17] <Lucas_WMDE>	 ah, sounds like T405688 is the more specific task for it
[15:21:18] <stashbot>	 T405688: Support shell to mw-experimental pod - https://phabricator.wikimedia.org/T405688
[15:21:34] <wikibugs>	 (03Merged) 10jenkins-bot: CI: Send User-Agent when fetching data from gitiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266185 (owner: 10JMeybohm)
[15:21:46] <cdanis>	 Lucas_WMDE: there's a script in a paste on that one :)
[15:22:13] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T421714, prepare newly-reimaged host) xfer wikidata from wcqs1001.eqiad.wmnet -> wcqs1003.eqiad.wmnet, repooling both afterwards
[15:22:13] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T421714, prepare newly-reimaged host) xfer commons from wcqs1001.eqiad.wmnet -> wcqs1003.eqiad.wmnet, repooling both afterwards
[15:22:13] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T421714, prepare newly-reimaged host) xfer commons from wcqs1001.eqiad.wmnet -> wcqs1003.eqiad.wmnet, repooling both afterwards
[15:22:35] * Lucas_WMDE tries that script
[15:23:38] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T421714, prepare newly-reimaged host) xfer commons from wcqs1001.eqiad.wmnet -> wcqs1003.eqiad.wmnet, repooling both afterwards
[15:23:39] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T421714, prepare newly-reimaged host) xfer commons from wcqs1001.eqiad.wmnet -> wcqs1003.eqiad.wmnet, repooling both afterwards
[15:24:07] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T421714, prepare newly-reimaged host) xfer commons from wcqs1001.eqiad.wmnet -> wcqs1003.eqiad.wmnet, repooling both afterwards
[15:24:45] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[15:24:48] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[15:26:15] <Lucas_WMDE>	 claime: no mwscript-k8s in that script’s shell either :/
[15:26:18] <Lucas_WMDE>	 (also no foreachwiki)
[15:26:40] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[15:26:43] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[15:26:47] <logmsgbot>	 !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_ulsfo - 3.2 upgrade (T421402)
[15:26:50] <stashbot>	 T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402
[15:32:17] <logmsgbot>	 !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_ulsfo - 3.2 upgrade (T421402)
[15:32:27] <stashbot>	 T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402
[15:32:35] <wikibugs>	 (03CR) 10Btullis: admin/data: promoted atsuko to ops (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1266275 (https://phabricator.wikimedia.org/T421860) (owner: 10Atsuko)
[15:34:13] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job jmx_wcqs_blazegraph in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:34:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1266275 (https://phabricator.wikimedia.org/T421860) (owner: 10Atsuko)
[15:38:37] <Lucas_WMDE>	 (left a comment on the task to that effect)
[15:38:56] <wikibugs>	 (03PS1) 10Fabfur: cache::haproxy: rename deprecated instructions in haproxy 3.2 [puppet] - 10https://gerrit.wikimedia.org/r/1266301 (https://phabricator.wikimedia.org/T422030)
[15:39:22] <jinxer-wm>	 FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[15:42:39] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1266301 (https://phabricator.wikimedia.org/T422030) (owner: 10Fabfur)
[15:44:22] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[15:45:31] <wikibugs>	 (03CR) 10Btullis: [C:03+2] "Merging on Atsuko's behalf." [puppet] - 10https://gerrit.wikimedia.org/r/1266275 (https://phabricator.wikimedia.org/T421860) (owner: 10Atsuko)
[15:45:41] <wikibugs>	 (03CR) 10Btullis: [C:03+2] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1266275 (https://phabricator.wikimedia.org/T421860) (owner: 10Atsuko)
[15:46:11] <wikibugs>	 (03PS2) 10Fabfur: cache::haproxy: rename deprecated instructions in haproxy 3.2 [puppet] - 10https://gerrit.wikimedia.org/r/1266301 (https://phabricator.wikimedia.org/T422030)
[15:46:23] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1266301 (https://phabricator.wikimedia.org/T422030) (owner: 10Fabfur)
[15:47:40] <wikibugs>	 (03CR) 10Vgutierrez: cache::haproxy: rename deprecated instructions in haproxy 3.2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1266301 (https://phabricator.wikimedia.org/T422030) (owner: 10Fabfur)
[15:48:31] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] cache::haproxy: rename deprecated instructions in haproxy 3.2 [puppet] - 10https://gerrit.wikimedia.org/r/1266301 (https://phabricator.wikimedia.org/T422030) (owner: 10Fabfur)
[15:48:57] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] cache::haproxy: rename deprecated instructions in haproxy 3.2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1266301 (https://phabricator.wikimedia.org/T422030) (owner: 10Fabfur)
[15:56:12] <wikibugs>	 (03PS3) 10Fabfur: cache::haproxy: rename deprecated instructions in haproxy 3.2 [puppet] - 10https://gerrit.wikimedia.org/r/1266301 (https://phabricator.wikimedia.org/T422030)
[15:56:29] <wikibugs>	 (03CR) 10Fabfur: cache::haproxy: rename deprecated instructions in haproxy 3.2 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1266301 (https://phabricator.wikimedia.org/T422030) (owner: 10Fabfur)
[15:58:25] <urbanecm>	 Lucas_WMDE: foreachwikiindblist is basically `for wiki in $(php /srv/mediawiki/multiversion/bin/expanddblist private); do echo -e "------\n$wiki\n-----"; php /srv/mediawiki/multiversion/MWScript.php Version.php --wiki="$wiki"; done`
[16:00:13] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1266301 (https://phabricator.wikimedia.org/T422030) (owner: 10Fabfur)
[16:00:16] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1266301 (https://phabricator.wikimedia.org/T422030) (owner: 10Fabfur)
[16:01:46] <urbanecm>	 jouncebot: nowandnext
[16:01:47] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 58 minute(s)
[16:01:47] <jouncebot>	 In 0 hour(s) and 58 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T1700)
[16:02:09] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] cache::haproxy: rename deprecated instructions in haproxy 3.2 [puppet] - 10https://gerrit.wikimedia.org/r/1266301 (https://phabricator.wikimedia.org/T422030) (owner: 10Fabfur)
[16:02:43] <wikibugs>	 (03PS1) 10Urbanecm: Set the default for UserEmailConfirmationUseHTML to true [core] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266309 (https://phabricator.wikimedia.org/T411147)
[16:02:48] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] Set the default for UserEmailConfirmationUseHTML to true [core] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266309 (https://phabricator.wikimedia.org/T411147) (owner: 10Urbanecm)
[16:03:08] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: FY2526 Q3:rack/setup/install restbase2039 - https://phabricator.wikimedia.org/T416538#11778935 (10Jhancock.wm)
[16:05:04] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: FY2526 Q3:rack/setup/install restbase2039 - https://phabricator.wikimedia.org/T416538#11778957 (10Jhancock.wm) this server is having the issue found in T418929 where we can't add the root user because of hardware changes
[16:07:35] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q3:rack/setup/install kafka-logging200[6-8] - https://phabricator.wikimedia.org/T418931#11778969 (10Jhancock.wm)
[16:08:18] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q3:rack/setup/install kafka-logging200[6-8] - https://phabricator.wikimedia.org/T418931#11778972 (10Jhancock.wm) turns out these servers are also having the same issue as these servers https://phabricator.wikimedia.org/T418929 so got a little to figure out if...
[16:09:13] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job jmx_wcqs_blazegraph in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:09:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266309 (https://phabricator.wikimedia.org/T411147) (owner: 10Urbanecm)
[16:09:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260011 (https://phabricator.wikimedia.org/T411147) (owner: 10Urbanecm)
[16:09:55] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[16:10:38] <wikibugs>	 06SRE, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Configure dse-k8s-worker10[20-23] - https://phabricator.wikimedia.org/T421465#11778979 (10Jclark-ctr)
[16:10:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Degraded RAID on an-worker1213 - https://phabricator.wikimedia.org/T420812#11778980 (10BTullis) 05Resolved→03Open a:05VRiley-WMF→03BTullis Hi @VRiley-WMF - Apologies for the delay in getting back to you. We haven't had a c...
[16:11:41] <wikibugs>	 (03PS1) 10Mmartorana: config: Enable EmailConfirmationBanner on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266314 (https://phabricator.wikimedia.org/T421366)
[16:13:30] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266314 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana)
[16:13:42] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding conf2007 to codfw - jhancock@cumin2002"
[16:13:46] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding conf2007 to codfw - jhancock@cumin2002"
[16:13:47] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:14:05] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host conf2007
[16:14:21] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host conf2007
[16:14:47] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host conf2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[16:15:54] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host conf2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[16:17:58] <wikibugs>	 (03Merged) 10jenkins-bot: Set the default for UserEmailConfirmationUseHTML to true [core] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266309 (https://phabricator.wikimedia.org/T411147) (owner: 10Urbanecm)
[16:18:14] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] cleanup: Remove UserEmailConfirmationUseHTML (defaults to true) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260011 (https://phabricator.wikimedia.org/T411147) (owner: 10Urbanecm)
[16:18:39] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install conf200[7-9] - https://phabricator.wikimedia.org/T418914#11779025 (10Jhancock.wm)
[16:19:11] <wikibugs>	 (03Merged) 10jenkins-bot: cleanup: Remove UserEmailConfirmationUseHTML (defaults to true) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260011 (https://phabricator.wikimedia.org/T411147) (owner: 10Urbanecm)
[16:19:38] <logmsgbot>	 !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1266309|Set the default for UserEmailConfirmationUseHTML to true (T411147)]], [[gerrit:1260011|cleanup: Remove UserEmailConfirmationUseHTML (defaults to true) (T411147)]]
[16:19:41] <stashbot>	 T411147: Remove emailability code from GrowthExperiments - https://phabricator.wikimedia.org/T411147
[16:19:49] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q3:rack/setup/install conf200[7-9] - https://phabricator.wikimedia.org/T418914#11779030 (10Jhancock.wm) we're having the issue that was documented in https://phabricator.wikimedia.org/T418929 with these servers. still working...
[16:20:34] * Lucas_WMDE is done with mw-experimental btw
[16:21:09] <Lucas_WMDE>	 urbanecm: thanks, I went with something similar at https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1218858/7#message-f8a5d7c9d14007456a8e87d96daa41266b158f3f (though I didn’t know expanddblist is available in the repo so I just got the list from https://noc.wikimedia.org/conf/dblists/all.dblist)
[16:21:36] <logmsgbot>	 !log urbanecm@deploy1003 urbanecm: Backport for [[gerrit:1266309|Set the default for UserEmailConfirmationUseHTML to true (T411147)]], [[gerrit:1260011|cleanup: Remove UserEmailConfirmationUseHTML (defaults to true) (T411147)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[16:22:22] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] Update Cumin alias for contint to also cover the spun-off Trixie role [puppet] - 10https://gerrit.wikimedia.org/r/1266215 (owner: 10Muehlenhoff)
[16:22:23] <Dreamy_Jazz>	 jouncebot: nowandnext
[16:22:23] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 37 minute(s)
[16:22:23] <jouncebot>	 In 0 hour(s) and 37 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T1700)
[16:22:33] <Dreamy_Jazz>	 Anyone after you?
[16:22:42] <urbanecm>	 Dreamy_Jazz: probably you? :D
[16:22:46] <Dreamy_Jazz>	 :D
[16:23:17] <wikibugs>	 (03PS1) 10Dreamy Jazz: hCaptcha: Add log and counter when all SiteVerify attempts fail [extensions/ConfirmEdit] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266316 (https://phabricator.wikimedia.org/T421678)
[16:23:30] <wikibugs>	 (03PS1) 10Dreamy Jazz: hCaptcha: Add log and counter when all SiteVerify attempts fail [extensions/ConfirmEdit] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266317 (https://phabricator.wikimedia.org/T421678)
[16:24:54] <logmsgbot>	 !log urbanecm@deploy1003 urbanecm: Continuing with sync
[16:25:13] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 139423488 and 15 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[16:26:33] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[16:26:37] <logmsgbot>	 !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[16:27:13] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 58696 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[16:28:54] <wikibugs>	 (03CR) 10CI reject: [V:04-1] hCaptcha: Add log and counter when all SiteVerify attempts fail [extensions/ConfirmEdit] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266317 (https://phabricator.wikimedia.org/T421678) (owner: 10Dreamy Jazz)
[16:29:08] <logmsgbot>	 !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1266309|Set the default for UserEmailConfirmationUseHTML to true (T411147)]], [[gerrit:1260011|cleanup: Remove UserEmailConfirmationUseHTML (defaults to true) (T411147)]] (duration: 09m 31s)
[16:29:12] <stashbot>	 T411147: Remove emailability code from GrowthExperiments - https://phabricator.wikimedia.org/T411147
[16:30:05] <urbanecm>	 Dreamy_Jazz: over to you
[16:30:08] <Dreamy_Jazz>	 Thanks
[16:30:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266317 (https://phabricator.wikimedia.org/T421678) (owner: 10Dreamy Jazz)
[16:30:50] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266316 (https://phabricator.wikimedia.org/T421678) (owner: 10Dreamy Jazz)
[16:32:55] <wikibugs>	 (03PS2) 10Daniel Kinzler: rest gateway: define authed-user class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266237 (https://phabricator.wikimedia.org/T420280)
[16:33:39] <logmsgbot>	 !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-restart-haproxy rolling restart of HAProxy on A:cp-drmrs - New configuration/test (T421402)
[16:33:42] <stashbot>	 T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402
[16:34:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266317 (https://phabricator.wikimedia.org/T421678) (owner: 10Dreamy Jazz)
[16:34:13] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job jmx_wcqs_blazegraph in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:34:14] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266316 (https://phabricator.wikimedia.org/T421678) (owner: 10Dreamy Jazz)
[16:34:58] <wikibugs>	 (03PS2) 10Daniel Kinzler: rest gateway: introduce policy for abstractwiki/wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265333 (https://phabricator.wikimedia.org/T421581)
[16:36:16] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T421714, prepare newly-reimaged host) xfer commons from wcqs1001.eqiad.wmnet -> wcqs1003.eqiad.wmnet, repooling both afterwards
[16:36:19] <stashbot>	 T421714: Data platform: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421714
[16:37:30] <wikibugs>	 (03PS3) 10Btullis: Add analytics-fr-tech system user and corresponding groups [puppet] - 10https://gerrit.wikimedia.org/r/1251146 (https://phabricator.wikimedia.org/T417213)
[16:39:07] <Dreamy_Jazz>	 jouncebot: nowandnext
[16:39:07] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 20 minute(s)
[16:39:07] <jouncebot>	 In 0 hour(s) and 20 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T1700)
[16:39:13] <jinxer-wm>	 RESOLVED: [3x] JobUnavailable: Reduced availability for job jmx_wcqs_blazegraph in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:41:43] <wikibugs>	 (03Merged) 10jenkins-bot: hCaptcha: Add log and counter when all SiteVerify attempts fail [extensions/ConfirmEdit] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266317 (https://phabricator.wikimedia.org/T421678) (owner: 10Dreamy Jazz)
[16:41:46] <wikibugs>	 (03Merged) 10jenkins-bot: hCaptcha: Add log and counter when all SiteVerify attempts fail [extensions/ConfirmEdit] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266316 (https://phabricator.wikimedia.org/T421678) (owner: 10Dreamy Jazz)
[16:41:53] <wikibugs>	 (03PS3) 10Daniel Kinzler: rest gateway: define authed-user class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266237 (https://phabricator.wikimedia.org/T420280)
[16:42:13] <logmsgbot>	 !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1266317|hCaptcha: Add log and counter when all SiteVerify attempts fail (T421678)]], [[gerrit:1266316|hCaptcha: Add log and counter when all SiteVerify attempts fail (T421678)]]
[16:42:16] <stashbot>	 T421678: hCaptcha: Retry SiteVerify API requests when http error occurs - https://phabricator.wikimedia.org/T421678
[16:43:35] <wikibugs>	 10ops-codfw, 06DC-Ops: Alert for device lsw1-c7-codfw.mgmt.codfw.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T422058 (10phaultfinder) 03NEW
[16:44:13] <logmsgbot>	 !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1266317|hCaptcha: Add log and counter when all SiteVerify attempts fail (T421678)]], [[gerrit:1266316|hCaptcha: Add log and counter when all SiteVerify attempts fail (T421678)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[16:49:33] <logmsgbot>	 !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync
[16:52:38] <wikibugs>	 10ops-codfw, 06DC-Ops: Alert for device lsw1-b4-codfw.mgmt.codfw.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T422061 (10phaultfinder) 03NEW
[16:53:44] <logmsgbot>	 !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1266317|hCaptcha: Add log and counter when all SiteVerify attempts fail (T421678)]], [[gerrit:1266316|hCaptcha: Add log and counter when all SiteVerify attempts fail (T421678)]] (duration: 11m 30s)
[16:53:47] <stashbot>	 T421678: hCaptcha: Retry SiteVerify API requests when http error occurs - https://phabricator.wikimedia.org/T421678
[16:54:35] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C:03+1] rest gateway: introduce policy for abstractwiki/wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265333 (https://phabricator.wikimedia.org/T421581) (owner: 10Daniel Kinzler)
[16:57:35] <Amir1>	 jouncebot: nowandnext
[16:57:35] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 2 minute(s)
[16:57:35] <jouncebot>	 In 0 hour(s) and 2 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T1700)
[16:58:45] <jinxer-wm>	 FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[16:59:42] <urbanecm>	 Amir1: i might have a backport
[16:59:46] <urbanecm>	 but also infra...
[17:00:00] <Amir1>	 I asked in -sre to see if people are using it 
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T1700)
[17:00:08] <wikibugs>	 06SRE, 10SRE-Access-Requests: Yubikey-SSH-FIDO for Tiziano Fogli (tappof / BACKUP) - https://phabricator.wikimedia.org/T422020#11779328 (10hnowlan) 05Open→03Resolved
[17:00:26] <Amir1>	 urbanecm: I'd say go for it
[17:00:38] <swfrench-wmf>	 o/
[17:00:50] <urbanecm>	 Amir1: swfrench-wmf just said they're using it?
[17:00:57] <urbanecm>	 (I'm happy to wait)
[17:01:09] <swfrench-wmf>	 so, I do have some work planned, but as long as it's not l10n update you have planned, please go ahead :)
[17:01:09] <Amir1>	 ah okay, right. So let's wait
[17:01:27] <swfrench-wmf>	 (I have a bit of prep to do first that can happen in parallel)
[17:02:59] <urbanecm>	 not an i18n update, but it appears to be conflicting
[17:03:03] * urbanecm is disappearing
[17:03:13] <swfrench-wmf>	 ah, got it
[17:03:22] <swfrench-wmf>	 alright, I'll continue with my plans then
[17:03:32] <wikibugs>	 (03PS1) 10Snwachukwu: Media Aanlytics Production Image Version Change [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266323 (https://phabricator.wikimedia.org/T415202)
[17:03:46] <wikibugs>	 (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1178657 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French)
[17:03:48] <wikibugs>	 (03CR) 10Scott French: [C:03+2] hieradata: disable and remove unused image-suggestion listener [puppet] - 10https://gerrit.wikimedia.org/r/1178657 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French)
[17:07:18] <Amir1>	 swfrench-wmf: when you're done, would you mind giving me a ping? Gods of thumbnails will be grateful 
[17:08:19] <swfrench-wmf>	 Amir1: yes, can do
[17:08:20] <urbanecm>	 (they have a god? i should be more afraid of them now...)
[17:08:27] <swfrench-wmf>	 :)
[17:09:44] <logmsgbot>	 !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-restart-haproxy (exit_code=0) rolling restart of HAProxy on A:cp-drmrs - New configuration/test (T421402)
[17:09:47] <stashbot>	 T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402
[17:10:20] <Amir1>	 Yes, it's Lorax the guardian of the trees. Looking disapprovingly to all of the CPU cycles being wasted due to cache fragmentation 
[17:12:35] <logmsgbot>	 !log swfrench@deploy1003 Started scap sync-world: helmfile-only deployment to remove unused image-suggestion listener - T368096
[17:12:39] <stashbot>	 T368096: mediawiki: migrate from image-suggestion to data-gateway - https://phabricator.wikimedia.org/T368096
[17:18:06] <logmsgbot>	 !log swfrench@deploy1003 Finished scap sync-world: helmfile-only deployment to remove unused image-suggestion listener - T368096 (duration: 07m 25s)
[17:18:09] <stashbot>	 T368096: mediawiki: migrate from image-suggestion to data-gateway - https://phabricator.wikimedia.org/T368096
[17:21:12] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/mw-cron: apply
[17:21:23] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply
[17:21:29] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply
[17:21:38] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply
[17:21:59] <wikibugs>	 06SRE, 06Traffic: Deprecate low-traffic proxoid service and O:hcaptcha_proxy for the older hcaptcha proxy setup - https://phabricator.wikimedia.org/T411097#11779456 (10BCornwall) 05In progress→03Resolved a:03BCornwall
[17:23:35] <SandraEbele_>	 !Deploying Refinery at fa28ad8 for change 1250005 / T415202 Extend mediarequest Cassandra loads with poster/plays for video-requests API
[17:23:36] <stashbot>	 T415202: Introduce a new AQS endpoint to expose video plays - https://phabricator.wikimedia.org/T415202
[17:23:58] <swfrench-wmf>	 Amir1: I think all of mediawiki-touching is complete. all yours!
[17:24:19] <Amir1>	 Thank you <3
[17:24:22] <jinxer-wm>	 FIRING: [6x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[17:26:57] <wikibugs>	 (03PS1) 10Ladsgroup: Refix thumb steps for the poster image of videos [extensions/TimedMediaHandler] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266326 (https://phabricator.wikimedia.org/T414805)
[17:27:09] <wikibugs>	 (03PS1) 10Ladsgroup: Refix thumb steps for the poster image of videos [extensions/TimedMediaHandler] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266327 (https://phabricator.wikimedia.org/T414805)
[17:27:15] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Refix thumb steps for the poster image of videos [extensions/TimedMediaHandler] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266326 (https://phabricator.wikimedia.org/T414805) (owner: 10Ladsgroup)
[17:27:18] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Refix thumb steps for the poster image of videos [extensions/TimedMediaHandler] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266327 (https://phabricator.wikimedia.org/T414805) (owner: 10Ladsgroup)
[17:30:34] <logmsgbot>	 !log ebysans@deploy1003 Started deploy [analytics/refinery@fa28ad8] (hadoop-test): Extend mediarequest Cassandra loads with poster/plays for video-requests API T415202 TEST [analytics/refinery@fa28ad83]
[17:30:39] <stashbot>	 T415202: Introduce a new AQS endpoint to expose video plays - https://phabricator.wikimedia.org/T415202
[17:31:30] <wikibugs>	 (03CR) 10Scott French: [C:03+2] service: move image-suggestion to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/1198575 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French)
[17:31:50] <wikibugs>	 (03CR) 10Mforns: [C:03+1] Media Aanlytics Production Image Version Change [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266323 (https://phabricator.wikimedia.org/T415202) (owner: 10Snwachukwu)
[17:32:25] <wikibugs>	 06SRE, 06Traffic: Deprecate low-traffic proxoid service and O:hcaptcha_proxy for the older hcaptcha proxy setup - https://phabricator.wikimedia.org/T411097#11779507 (10ssingh) Thanks for taking care of this @BCornwall!
[17:32:26] <logmsgbot>	 !log ebysans@deploy1003 Finished deploy [analytics/refinery@fa28ad8] (hadoop-test): Extend mediarequest Cassandra loads with poster/plays for video-requests API T415202 TEST [analytics/refinery@fa28ad83] (duration: 01m 52s)
[17:32:33] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1009.eqiad.wmnet with OS bullseye
[17:32:52] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cloudelastic1009
[17:33:02] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[17:33:23] <logmsgbot>	 !log ebysans@deploy1003 Started deploy [analytics/refinery@fa28ad8]: Extend mediarequest Cassandra loads with poster/plays for video-requests API T415202 [analytics/refinery@fa28ad83]
[17:37:38] <logmsgbot>	 !log ebysans@deploy1003 Finished deploy [analytics/refinery@fa28ad8]: Extend mediarequest Cassandra loads with poster/plays for video-requests API T415202 [analytics/refinery@fa28ad83] (duration: 04m 15s)
[17:37:43] <stashbot>	 T415202: Introduce a new AQS endpoint to expose video plays - https://phabricator.wikimedia.org/T415202
[17:38:01] <logmsgbot>	 !log ebysans@deploy1003 Started deploy [analytics/refinery@fa28ad8] (thin): Extend mediarequest Cassandra loads with poster/plays for video-requests API T415202 [analytics/refinery@fa28ad83]
[17:38:41] <logmsgbot>	 bking@cumin2002 reimage (PID 3684606) is awaiting input
[17:39:55] <logmsgbot>	 !log ebysans@deploy1003 Finished deploy [analytics/refinery@fa28ad8] (thin): Extend mediarequest Cassandra loads with poster/plays for video-requests API T415202 [analytics/refinery@fa28ad83] (duration: 01m 53s)
[17:41:20] <wikibugs>	 (03Merged) 10jenkins-bot: Refix thumb steps for the poster image of videos [extensions/TimedMediaHandler] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266326 (https://phabricator.wikimedia.org/T414805) (owner: 10Ladsgroup)
[17:41:22] <wikibugs>	 (03Merged) 10jenkins-bot: Refix thumb steps for the poster image of videos [extensions/TimedMediaHandler] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266327 (https://phabricator.wikimedia.org/T414805) (owner: 10Ladsgroup)
[17:42:29] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cloudelastic1009 - bking@cumin2002"
[17:42:34] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cloudelastic1009 - bking@cumin2002"
[17:42:34] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:42:35] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cloudelastic1009.eqiad.wmnet 30.32.64.10.in-addr.arpa 0.3.0.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[17:42:39] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudelastic1009.eqiad.wmnet 30.32.64.10.in-addr.arpa 0.3.0.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[17:42:39] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudelastic1009
[17:43:27] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudelastic1009
[17:43:27] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cloudelastic1009
[17:46:35] <wikibugs>	 (03CR) 10ArielGlenn: rest gateway: define authed-user class (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266237 (https://phabricator.wikimedia.org/T420280) (owner: 10Daniel Kinzler)
[17:47:59] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1266326|Refix thumb steps for the poster image of videos (T414805)]], [[gerrit:1266327|Refix thumb steps for the poster image of videos (T414805)]]
[17:48:01] <stashbot>	 T414805: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805
[17:50:02] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1266326|Refix thumb steps for the poster image of videos (T414805)]], [[gerrit:1266327|Refix thumb steps for the poster image of videos (T414805)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[17:51:16] <wikibugs>	 (03CR) 10Ottomata: [C:03+2] Media Aanlytics Production Image Version Change [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266323 (https://phabricator.wikimedia.org/T415202) (owner: 10Snwachukwu)
[17:51:32] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[17:52:05] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Continuing with sync
[17:53:18] <wikibugs>	 (03Merged) 10jenkins-bot: Media Aanlytics Production Image Version Change [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266323 (https://phabricator.wikimedia.org/T415202) (owner: 10Snwachukwu)
[17:55:19] <wikibugs>	 06SRE, 10Infrastructure Security: 2FA for SSH access to the production cluster - https://phabricator.wikimedia.org/T116750#11779615 (10herron)
[17:55:42] <wikibugs>	 06SRE, 10Infrastructure Security: Consider "inner" and "outer" ssh keys to reduce taps through the day - https://phabricator.wikimedia.org/T422068#11779619 (10herron)
[17:56:17] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1266326|Refix thumb steps for the poster image of videos (T414805)]], [[gerrit:1266327|Refix thumb steps for the poster image of videos (T414805)]] (duration: 08m 18s)
[17:56:20] <stashbot>	 T414805: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805
[17:57:13] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 264965128 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[17:58:13] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3628800 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[17:59:49] <wikibugs>	 (03PS3) 10Scott French: wmnet: remove image-suggestion k8s ingress CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/1198584 (https://phabricator.wikimedia.org/T368096)
[18:01:33] <SandraEbele_>	 !log Deployed refinery using scap, then deployed onto hdfs
[18:01:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:02:26] <logmsgbot>	 !log aokoth@cumin1003 START - Cookbook sre.vrts.upgrade  on VRTS host vrts1003.eqiad.wmnet
[18:03:00] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1009.eqiad.wmnet with reason: host reimage
[18:04:17] <logmsgbot>	 !log aokoth@cumin1003 END (PASS) - Cookbook sre.vrts.upgrade (exit_code=0)  on VRTS host vrts1003.eqiad.wmnet
[18:05:56] <jinxer-wm>	 FIRING: GitlabPackagePullerFailedOnRun: Package puller has some run errors that needs investigation. - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DGitlabPackagePullerFailedOnRun
[18:08:28] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11779670 (10VRiley-WMF) As requested, I went into supermicro support to create a service ticket with supermicro. It seems that John has created a ticket for th...
[18:10:00] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1009.eqiad.wmnet with reason: host reimage
[18:10:08] <logmsgbot>	 !log ebysans@deploy1003 helmfile [eqiad] START helmfile.d/services/media-analytics: apply
[18:10:21] <logmsgbot>	 !log ebysans@deploy1003 helmfile [eqiad] DONE helmfile.d/services/media-analytics: apply
[18:10:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install fransw100[23] - https://phabricator.wikimedia.org/T417295#11779683 (10Jgreen) a:05Jgreen→03Jclark-ctr @Jclark-ctr can you check the network cables for fransw1002? The first network interface doesn't appear to have link.
[18:10:46] <logmsgbot>	 !log ebysans@deploy1003 helmfile [codfw] START helmfile.d/services/media-analytics: apply
[18:11:00] <logmsgbot>	 !log ebysans@deploy1003 helmfile [codfw] DONE helmfile.d/services/media-analytics: apply
[18:14:01] <wikibugs>	 (03CR) 10ArielGlenn: rest gateway: introduce policy for abstractwiki/wikifunctions (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265333 (https://phabricator.wikimedia.org/T421581) (owner: 10Daniel Kinzler)
[18:16:01] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet - https://phabricator.wikimedia.org/T421007#11779697 (10VRiley-WMF) Dell has come back with the following on this ticket  "If the issue is configuration or firmware related, the drive should format normally once correct...
[18:19:27] <swfrench-wmf>	 FYI, I'll continue with some further image-suggestion cleanup in the background (no impact expected)
[18:42:27] <wikibugs>	 (03PS3) 10Scott French: image-suggestion: remove service configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198580 (https://phabricator.wikimedia.org/T368096)
[18:42:36] <wikibugs>	 (03PS3) 10Scott French: deployment_server: absent image-suggestion k8s creds config [puppet] - 10https://gerrit.wikimedia.org/r/1198576 (https://phabricator.wikimedia.org/T368096)
[18:42:43] <wikibugs>	 (03PS3) 10Scott French: deployment_server: remove absented image-suggestion k8s creds config [puppet] - 10https://gerrit.wikimedia.org/r/1198577 (https://phabricator.wikimedia.org/T368096)
[18:48:28] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti105[5667] - https://phabricator.wikimedia.org/T418903#11779792 (10VRiley-WMF) a:03VRiley-WMF
[18:54:22] <jinxer-wm>	 FIRING: [6x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[18:59:07] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1009.eqiad.wmnet with OS bullseye
[19:02:05] <wikibugs>	 (03CR) 10Blake: [C:03+1] deployment_server: absent image-suggestion k8s creds config [puppet] - 10https://gerrit.wikimedia.org/r/1198576 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French)
[19:02:30] <wikibugs>	 (03CR) 10Blake: [C:03+1] deployment_server: remove absented image-suggestion k8s creds config [puppet] - 10https://gerrit.wikimedia.org/r/1198577 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French)
[19:03:55] <wikibugs>	 (03CR) 10Blake: [C:03+1] image-suggestion: remove service configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198580 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French)
[19:05:16] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 543552448 and 47 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[19:13:24] <wikibugs>	 (03CR) 10Bartosz Dziewoński: "This resolved T412520." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1263878 (owner: 10Daniel Kinzler)
[19:14:16] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 28128 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[19:31:24] <wikibugs>	 (03PS1) 10Bking: opensearch: handle IP changes for software firewall [puppet] - 10https://gerrit.wikimedia.org/r/1266372 (https://phabricator.wikimedia.org/T421714)
[19:31:50] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1266372 (https://phabricator.wikimedia.org/T421714) (owner: 10Bking)
[19:43:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[19:48:15] <jinxer-wm>	 RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[19:49:45] <swfrench-wmf>	 Amir1: are you around? I something's not quite right with your TMH patches
[19:49:58] <dancy>	 I filed https://phabricator.wikimedia.org/T422074
[19:50:11] <swfrench-wmf>	 thanks, dancy!
[19:52:39] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1010.eqiad.wmnet with OS bullseye
[19:53:00] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cloudelastic1010
[19:53:50] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[19:58:00] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cloudelastic1010 - bking@cumin2002"
[19:58:05] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cloudelastic1010 - bking@cumin2002"
[19:58:05] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:58:06] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cloudelastic1010.eqiad.wmnet 24.48.64.10.in-addr.arpa 4.2.0.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[19:58:09] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudelastic1010.eqiad.wmnet 24.48.64.10.in-addr.arpa 4.2.0.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[19:58:10] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudelastic1010
[20:00:01] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudelastic1010
[20:00:01] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cloudelastic1010
[20:00:02] <wikibugs>	 (03CR) 10Muehlenhoff: opensearch: handle IP changes for software firewall (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1266372 (https://phabricator.wikimedia.org/T421714) (owner: 10Bking)
[20:00:04] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T2000).
[20:00:04] <jouncebot>	 manfredi: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:21] <manfredi>	 I am here
[20:01:01] <cjming>	 hi manfredi! do you need a deployer?
[20:01:10] <manfredi>	 yes please
[20:01:18] <cjming>	 i can deploy for you - 1 sec
[20:01:24] <manfredi>	 thanks!
[20:01:45] <wikibugs>	 (03PS2) 10Mmartorana: config: Enable EmailConfirmationBanner on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266314 (https://phabricator.wikimedia.org/T421366)
[20:02:39] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266314 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana)
[20:03:04] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1008 is CRITICAL: CRITICAL - elasticsearch inactive shards 281 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 768, active_shards: 1258, relocating_shards: 1, initializing_shards: 13, unassigned_shards: 268, delayed_unassigned_shards
[20:03:04] <icinga-wm>	 ber_of_pending_tasks: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 81.74139051332034 https://wikitech.wikimedia.org/wiki/Search%23Administration
[20:03:04] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1007 is CRITICAL: CRITICAL - elasticsearch inactive shards 281 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 768, active_shards: 1258, relocating_shards: 1, initializing_shards: 13, unassigned_shards: 268, delayed_unassigned_shards
[20:03:04] <icinga-wm>	 ber_of_pending_tasks: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 81.74139051332034 https://wikitech.wikimedia.org/wiki/Search%23Administration
[20:03:06] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 280 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 768, active_shards: 1259, relocating_shards: 1, initializing_shards: 13, unassigned_shards: 267, delayed_unassigned_shards
[20:03:06] <icinga-wm>	 ber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 81.80636777128005 https://wikitech.wikimedia.org/wiki/Search%23Administration
[20:03:08] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch inactive shards 280 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 768, active_shards: 1259, relocating_shards: 1, initializing_shards: 13, unassigned_shards: 267, delayed_unassigned_shards
[20:03:08] <icinga-wm>	 ber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 81.80636777128005 https://wikitech.wikimedia.org/wiki/Search%23Administration
[20:04:27] <wikibugs>	 (03Merged) 10jenkins-bot: config: Enable EmailConfirmationBanner on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1266314 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana)
[20:04:52] <logmsgbot>	 !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1266314|config: Enable EmailConfirmationBanner on mediawikiwiki (T421366)]]
[20:04:55] <stashbot>	 T421366: Test Kitchen Experiment setup to measure the impact of the banner - https://phabricator.wikimedia.org/T421366
[20:06:53] <logmsgbot>	 !log cjming@deploy1003 mmartorana, cjming: Backport for [[gerrit:1266314|config: Enable EmailConfirmationBanner on mediawikiwiki (T421366)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:07:05] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 768, active_shards: 1313, relocating_shards: 1, initializing_shards: 13, unassigned_shards: 213, delayed_unassigned_shards: 0, number_of_pending_t
[20:07:05] <icinga-wm>	  number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.31513970110461 https://wikitech.wikimedia.org/wiki/Search%23Administration
[20:07:05] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1008 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 768, active_shards: 1313, relocating_shards: 1, initializing_shards: 13, unassigned_shards: 213, delayed_unassigned_shards: 0, number_of_pending_t
[20:07:05] <icinga-wm>	  number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.31513970110461 https://wikitech.wikimedia.org/wiki/Search%23Administration
[20:07:07] <cjming>	 manfredi: on mwdebug if you want to test
[20:07:07] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 768, active_shards: 1313, relocating_shards: 1, initializing_shards: 13, unassigned_shards: 213, delayed_unassigned_shards: 0, number_of_pending_t
[20:07:07] <icinga-wm>	  number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.31513970110461 https://wikitech.wikimedia.org/wiki/Search%23Administration
[20:07:09] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 768, active_shards: 1314, relocating_shards: 1, initializing_shards: 12, unassigned_shards: 213, delayed_unassigned_shards: 0, number_of_pending_t
[20:07:09] <icinga-wm>	  number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.38011695906432 https://wikitech.wikimedia.org/wiki/Search%23Administration
[20:07:21] <manfredi>	 ok
[20:07:28] <cjming>	 manfredi: lmk when i can sync - standing by
[20:09:18] <manfredi>	 All good. Go ahead please
[20:09:22] <cjming>	 great!
[20:09:26] <logmsgbot>	 !log cjming@deploy1003 mmartorana, cjming: Continuing with sync
[20:13:40] <logmsgbot>	 !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1266314|config: Enable EmailConfirmationBanner on mediawikiwiki (T421366)]] (duration: 08m 47s)
[20:13:43] <stashbot>	 T421366: Test Kitchen Experiment setup to measure the impact of the banner - https://phabricator.wikimedia.org/T421366
[20:13:51] <cjming>	 manfredi: should be live!
[20:14:01] <wikibugs>	 (03PS4) 10Daniel Kinzler: rest gateway: define authed-user class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266237 (https://phabricator.wikimedia.org/T420280)
[20:14:13] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:14:15] <wikibugs>	 (03CR) 10Daniel Kinzler: rest gateway: define authed-user class (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266237 (https://phabricator.wikimedia.org/T420280) (owner: 10Daniel Kinzler)
[20:14:25] <wikibugs>	 (03PS3) 10Daniel Kinzler: rest gateway: introduce policy for abstractwiki/wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265333 (https://phabricator.wikimedia.org/T421581)
[20:14:55] <manfredi>	 cjming: thank you! I appreciate it 
[20:15:08] <cjming>	 np :)
[20:16:17] <cjming>	 that was it for the queue - i'll hang around for a few minutes in case anyone else shows up for the window
[20:19:13] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:19:50] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1010.eqiad.wmnet with reason: host reimage
[20:21:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[20:23:42] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1010.eqiad.wmnet with reason: host reimage
[20:26:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[20:26:25] <wikibugs>	 (03PS1) 10Eevans: restbase: upgrade to Cassandra 4.1.11 [puppet] - 10https://gerrit.wikimedia.org/r/1266387 (https://phabricator.wikimedia.org/T418417)
[20:26:28] <wikibugs>	 (03PS1) 10Eevans: aqs: upgrade to Cassandra 4.1.11 [puppet] - 10https://gerrit.wikimedia.org/r/1266388 (https://phabricator.wikimedia.org/T418417)
[20:26:31] <wikibugs>	 (03PS1) 10Eevans: sessionstore: upgrade to Cassandra 4.1.11 [puppet] - 10https://gerrit.wikimedia.org/r/1266389 (https://phabricator.wikimedia.org/T418417)
[20:27:22] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1266387 (https://phabricator.wikimedia.org/T418417) (owner: 10Eevans)
[20:27:28] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1266388 (https://phabricator.wikimedia.org/T418417) (owner: 10Eevans)
[20:27:33] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1266389 (https://phabricator.wikimedia.org/T418417) (owner: 10Eevans)
[20:29:55] <wikibugs>	 (03PS4) 10Daniel Kinzler: rest gateway: introduce policy for abstractwiki/wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1265333 (https://phabricator.wikimedia.org/T421581)
[20:31:30] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[20:31:45] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[20:36:30] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[20:38:41] <wikibugs>	 (03PS1) 10Ottomata: mw-page-html-content-change-enrich - apply some tuning configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266393 (https://phabricator.wikimedia.org/T421216)
[20:39:46] <wikibugs>	 (03PS2) 10Ottomata: mw-page-html-content-change-enrich - apply some tuning configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266393 (https://phabricator.wikimedia.org/T421216)
[20:40:00] <wikibugs>	 (03PS3) 10Ottomata: mw-page-html-content-change-enrich - apply some tuning configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266393 (https://phabricator.wikimedia.org/T421216)
[20:40:25] <wikibugs>	 (03PS4) 10Ottomata: mw-page-html-content-change-enrich - apply some tuning configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266393 (https://phabricator.wikimedia.org/T421216)
[20:42:18] <wikibugs>	 (03PS5) 10Ottomata: mw-page-html-content-change-enrich - apply some tuning configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266393 (https://phabricator.wikimedia.org/T421216)
[20:43:56] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.loadbalancer.admin rebooting P{lvs3010.esams.wmnet} and A:liberica
[20:44:43] <wikibugs>	 (03CR) 10Ottomata: [C:03+2] mw-page-html-content-change-enrich - apply some tuning configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266393 (https://phabricator.wikimedia.org/T421216) (owner: 10Ottomata)
[20:46:52] <wikibugs>	 (03Merged) 10jenkins-bot: mw-page-html-content-change-enrich - apply some tuning configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266393 (https://phabricator.wikimedia.org/T421216) (owner: 10Ottomata)
[20:47:47] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs3010.esams.wmnet} and A:liberica
[20:49:05] <wikibugs>	 (03CR) 10Cwhite: opensearch: handle IP changes for software firewall (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1266372 (https://phabricator.wikimedia.org/T421714) (owner: 10Bking)
[20:49:22] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[20:50:24] <logmsgbot>	 !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[20:50:29] <logmsgbot>	 !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[20:52:24] <logmsgbot>	 !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[20:52:28] <logmsgbot>	 !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[20:53:28] <logmsgbot>	 !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[20:53:33] <logmsgbot>	 !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[20:54:22] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[20:54:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T421970#11780185 (10phaultfinder)
[20:57:15] <wikibugs>	 (03PS1) 10Ahmon Dancy: buildkitd: Bump buildkit image to wmf-v0.29.0 [puppet] - 10https://gerrit.wikimedia.org/r/1266395 (https://phabricator.wikimedia.org/T415284)
[20:58:46] <jinxer-wm>	 FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[20:59:22] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[21:00:04] <jouncebot>	 Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T2100)
[21:04:22] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[21:07:29] <wikibugs>	 (03PS2) 10Bking: opensearch: handle IP changes for software firewall [puppet] - 10https://gerrit.wikimedia.org/r/1266372 (https://phabricator.wikimedia.org/T421714)
[21:07:45] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1010.eqiad.wmnet with OS bullseye
[21:08:49] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1266372 (https://phabricator.wikimedia.org/T421714) (owner: 10Bking)
[21:11:50] <wikibugs>	 (03CR) 10Bking: opensearch: handle IP changes for software firewall (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1266372 (https://phabricator.wikimedia.org/T421714) (owner: 10Bking)
[21:14:00] <brett>	 !log Reboot lvs1013, lvs1014, lvs1015, and lvs1017 for kernel updates
[21:14:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:19:22] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[21:21:25] <wikibugs>	 (03PS1) 10Scott French: Only set the thumb step when width is given [extensions/TimedMediaHandler] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266406 (https://phabricator.wikimedia.org/T422074)
[21:21:54] <wikibugs>	 (03PS1) 10Scott French: Only set the thumb step when width is given [extensions/TimedMediaHandler] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266407 (https://phabricator.wikimedia.org/T422074)
[21:23:43] <swfrench-wmf>	 jouncebot: nowandnext
[21:23:43] <jouncebot>	 For the next 0 hour(s) and 36 minute(s): Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T2100)
[21:23:43] <jouncebot>	 In 0 hour(s) and 36 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T2200)
[21:24:22] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[21:28:32] <swfrench-wmf>	 FYI, CI permitting, I'll be deploying backports of https://gerrit.wikimedia.org/r/1266382 shortly for T422074
[21:28:33] <stashbot>	 T422074: PHP Warning: Undefined array key "physicalWidth" - https://phabricator.wikimedia.org/T422074
[21:30:31] <wikibugs>	 (03CR) 10Cwhite: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1266372 (https://phabricator.wikimedia.org/T421714) (owner: 10Bking)
[21:30:38] <wikibugs>	 (03CR) 10Papaul: [C:03+2] Add BGP sessions from mr1-eqiad to cr1/2.eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/1265533 (https://phabricator.wikimedia.org/T421238) (owner: 10Papaul)
[21:32:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy1003 using scap backport" [extensions/TimedMediaHandler] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266406 (https://phabricator.wikimedia.org/T422074) (owner: 10Scott French)
[21:32:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy1003 using scap backport" [extensions/TimedMediaHandler] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266407 (https://phabricator.wikimedia.org/T422074) (owner: 10Scott French)
[21:33:54] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1027.eqiad.wmnet with OS bullseye
[21:34:18] <wikibugs>	 (03Merged) 10jenkins-bot: Only set the thumb step when width is given [extensions/TimedMediaHandler] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266406 (https://phabricator.wikimedia.org/T422074) (owner: 10Scott French)
[21:34:27] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host wdqs1027
[21:34:34] <wikibugs>	 (03Merged) 10jenkins-bot: Only set the thumb step when width is given [extensions/TimedMediaHandler] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266407 (https://phabricator.wikimedia.org/T422074) (owner: 10Scott French)
[21:34:58] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[21:35:00] <logmsgbot>	 !log swfrench@deploy1003 Started scap sync-world: Backport for [[gerrit:1266406|Only set the thumb step when width is given (T422074)]], [[gerrit:1266407|Only set the thumb step when width is given (T422074)]]
[21:35:03] <stashbot>	 T422074: PHP Warning: Undefined array key "physicalWidth" - https://phabricator.wikimedia.org/T422074
[21:36:56] <logmsgbot>	 !log swfrench@deploy1003 swfrench: Backport for [[gerrit:1266406|Only set the thumb step when width is given (T422074)]], [[gerrit:1266407|Only set the thumb step when width is given (T422074)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:38:04] <logmsgbot>	 !log swfrench@deploy1003 swfrench: Continuing with sync
[21:38:17] <swfrench-wmf>	 that'll do it
[21:40:18] <wikibugs>	 06SRE: my phab-cli test task - https://phabricator.wikimedia.org/T422088#11780341 (10jijiki)
[21:40:40] <logmsgbot>	 bking@cumin2002 reimage (PID 3772654) is awaiting input
[21:42:15] <logmsgbot>	 !log swfrench@deploy1003 Finished scap sync-world: Backport for [[gerrit:1266406|Only set the thumb step when width is given (T422074)]], [[gerrit:1266407|Only set the thumb step when width is given (T422074)]] (duration: 07m 15s)
[21:42:18] <stashbot>	 T422074: PHP Warning: Undefined array key "physicalWidth" - https://phabricator.wikimedia.org/T422074
[21:51:32] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[21:53:56] <wikibugs>	 (03PS1) 10Papaul: Remove "emporary replace ospf" to test bgp [homer/public] - 10https://gerrit.wikimedia.org/r/1266429 (https://phabricator.wikimedia.org/T421238)
[21:57:45] <wikibugs>	 (03PS2) 10Papaul: Remove temporary "replace ospf" to test bgp [homer/public] - 10https://gerrit.wikimedia.org/r/1266429 (https://phabricator.wikimedia.org/T421238)
[21:59:33] <wikibugs>	 (03CR) 10Clare Ming: [C:04-1] "punting on this for now until we think through implications of this some more" [puppet] - 10https://gerrit.wikimedia.org/r/1265525 (https://phabricator.wikimedia.org/T408186) (owner: 10Clare Ming)
[21:59:49] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T421970#11780384 (10phaultfinder)
[22:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T2200)
[22:00:54] <wikibugs>	 (03CR) 10Papaul: [C:03+2] Remove temporary "replace ospf" to test bgp [homer/public] - 10https://gerrit.wikimedia.org/r/1266429 (https://phabricator.wikimedia.org/T421238) (owner: 10Papaul)
[22:01:03] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wdqs1027 - bking@cumin2002"
[22:01:08] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wdqs1027 - bking@cumin2002"
[22:01:08] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:01:09] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache wdqs1027.eqiad.wmnet 98.32.64.10.in-addr.arpa 8.9.0.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[22:01:13] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wdqs1027.eqiad.wmnet 98.32.64.10.in-addr.arpa 8.9.0.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[22:01:13] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wdqs1027
[22:03:54] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-internal-scholarly_443: Servers wdqs1027.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[22:04:06] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-internal-scholarly_443: Servers wdqs1027.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[22:05:15] <inflatador>	 ^^ expected
[22:05:25] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wdqs1027
[22:05:26] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wdqs1027
[22:06:11] <jinxer-wm>	 FIRING: GitlabPackagePullerFailedOnRun: Package puller has some run errors that needs investigation. - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DGitlabPackagePullerFailedOnRun
[22:07:12] <logmsgbot>	 !log bking@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=wdqs-internal-scholarly,name=codfw
[22:09:10] <wikibugs>	 (03PS1) 10Ladsgroup: Deferred: Fix function to get virtual domain [core] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266442 (https://phabricator.wikimedia.org/T421914)
[22:09:24] <wikibugs>	 (03PS1) 10Ladsgroup: Deferred: Fix function to get virtual domain [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266443 (https://phabricator.wikimedia.org/T421914)
[22:09:37] <Amir1>	 jouncebot: nowandnext
[22:09:37] <jouncebot>	 For the next 0 hour(s) and 50 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260401T2200)
[22:09:37] <jouncebot>	 In 7 hour(s) and 50 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T0600)
[22:09:37] <jouncebot>	 In 7 hour(s) and 50 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T0600)
[22:09:39] <jinxer-wm>	 FIRING: CoreBGPDown: Core BGP session down between cr1-eqiad and  (2620:0:861:fe04::1) - group Management - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr1-eqiad:9804&var-bgp_group=Management&var-bgp_neighbor= - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[22:09:53] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Deferred: Fix function to get virtual domain [core] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266442 (https://phabricator.wikimedia.org/T421914) (owner: 10Ladsgroup)
[22:09:57] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Deferred: Fix function to get virtual domain [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266443 (https://phabricator.wikimedia.org/T421914) (owner: 10Ladsgroup)
[22:14:39] <jinxer-wm>	 FIRING: [6x] CoreBGPDown: Core BGP session down between cr1-eqiad and  (2620:0:861:fe04::1) - group Management - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[22:15:01] <papaul>	 ^that is me
[22:24:13] <wikibugs>	 (03Merged) 10jenkins-bot: Deferred: Fix function to get virtual domain [core] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1266442 (https://phabricator.wikimedia.org/T421914) (owner: 10Ladsgroup)
[22:24:22] <wikibugs>	 (03Merged) 10jenkins-bot: Deferred: Fix function to get virtual domain [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1266443 (https://phabricator.wikimedia.org/T421914) (owner: 10Ladsgroup)
[22:26:24] <icinga-wm>	 PROBLEM - Host asw1-eqsin is DOWN: PING CRITICAL - Packet loss = 100%
[22:27:22] <icinga-wm>	 PROBLEM - Host ps1-603-eqsin is DOWN: PING CRITICAL - Packet loss = 100%
[22:27:24] <icinga-wm>	 PROBLEM - Host ps1-604-eqsin is DOWN: PING CRITICAL - Packet loss = 100%
[22:27:32] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1266443|Deferred: Fix function to get virtual domain (T421914 T398709)]], [[gerrit:1266442|Deferred: Fix function to get virtual domain (T421914 T398709)]]
[22:27:36] <stashbot>	 T421914: Test links virtual domain split on testcommonswiki - https://phabricator.wikimedia.org/T421914
[22:27:37] <stashbot>	 T398709: FY2025-26 WE 6.4.1: Move links tables of commons to a dedicated cluster - https://phabricator.wikimedia.org/T398709
[22:28:52] <icinga-wm>	 PROBLEM - Host mr1-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[22:29:05] <papaul>	 me ^
[22:29:29] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1266443|Deferred: Fix function to get virtual domain (T421914 T398709)]], [[gerrit:1266442|Deferred: Fix function to get virtual domain (T421914 T398709)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[22:29:39] <jinxer-wm>	 FIRING: [10x] CoreBGPDown: Core BGP session down between cr1-eqiad and  (2620:0:861:fe04::1) - group Management - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[22:29:56] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Continuing with sync
[22:31:00] <Jdlrobson>	 Amir1: let me know when you are done deploying! thanks in advance!
[22:31:19] <Amir1>	 sure! almost done
[22:32:08] <icinga-wm>	 RECOVERY - Host ps1-603-eqsin is UP: PING OK - Packet loss = 0%, RTA = 254.30 ms
[22:32:08] <icinga-wm>	 RECOVERY - Host asw1-eqsin is UP: PING OK - Packet loss = 0%, RTA = 250.63 ms
[22:32:08] <icinga-wm>	 RECOVERY - Host ps1-604-eqsin is UP: PING OK - Packet loss = 0%, RTA = 246.82 ms
[22:33:16] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1027.eqiad.wmnet with reason: host reimage
[22:33:54] <icinga-wm>	 RECOVERY - Host mr1-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 249.39 ms
[22:34:08] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1266443|Deferred: Fix function to get virtual domain (T421914 T398709)]], [[gerrit:1266442|Deferred: Fix function to get virtual domain (T421914 T398709)]] (duration: 06m 37s)
[22:34:13] <stashbot>	 T421914: Test links virtual domain split on testcommonswiki - https://phabricator.wikimedia.org/T421914
[22:34:13] <stashbot>	 T398709: FY2025-26 WE 6.4.1: Move links tables of commons to a dedicated cluster - https://phabricator.wikimedia.org/T398709
[22:34:39] <jinxer-wm>	 FIRING: [8x] CoreBGPDown: Core BGP session down between cr1-eqiad and mr1-eqiad (208.80.154.204) - group Management - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[22:35:16] <Amir1>	 Jdlrobson: I'm done. The floor is yours!
[22:38:12] <Jdlrobson>	 thanks Amir1 
[22:38:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265482 (https://phabricator.wikimedia.org/T420348) (owner: 10LorenMora)
[22:38:42] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1027.eqiad.wmnet with reason: host reimage
[22:39:12] <wikibugs>	 (03Merged) 10jenkins-bot: Legal Footer Link Deploys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1265482 (https://phabricator.wikimedia.org/T420348) (owner: 10LorenMora)
[22:39:37] <logmsgbot>	 !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1265482|Legal Footer Link Deploys (T420348)]]
[22:39:40] <stashbot>	 T420348: Footer link deployments - arwiki, cawiki,  fawiki, rowiki, ruwiki, trwiki - https://phabricator.wikimedia.org/T420348
[22:41:37] <logmsgbot>	 !log jdlrobson@deploy1003 lmora, jdlrobson: Backport for [[gerrit:1265482|Legal Footer Link Deploys (T420348)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[22:43:53] <logmsgbot>	 !log jdlrobson@deploy1003 lmora, jdlrobson: Continuing with sync
[22:48:02] <logmsgbot>	 !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1265482|Legal Footer Link Deploys (T420348)]] (duration: 08m 25s)
[22:48:06] <stashbot>	 T420348: Footer link deployments - arwiki, cawiki,  fawiki, rowiki, ruwiki, trwiki - https://phabricator.wikimedia.org/T420348
[22:48:11] <swfrench-wmf>	 !log removed unused image-suggestion service in codfw - T368096
[22:48:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:48:14] <stashbot>	 T368096: mediawiki: migrate from image-suggestion to data-gateway - https://phabricator.wikimedia.org/T368096
[22:48:15] <Jdlrobson>	 all done!
[22:58:20] <swfrench-wmf>	 !log removed unused image-suggestion service in eqiad - T368096
[22:58:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:58:24] <stashbot>	 T368096: mediawiki: migrate from image-suggestion to data-gateway - https://phabricator.wikimedia.org/T368096
[22:59:03] <wikibugs>	 (03PS3) 10KineticPelagic: REST: Publish ReadingLists v0 module in REST Sandbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264856 (https://phabricator.wikimedia.org/T419619)
[23:03:00] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1027.eqiad.wmnet with OS bullseye
[23:03:01] <wikibugs>	 (03PS4) 10KineticPelagic: REST: Publish ReadingLists v0 module in REST Sandbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1264856 (https://phabricator.wikimedia.org/T419619)
[23:19:39] <jinxer-wm>	 RESOLVED: [5x] CoreBGPDown: Core BGP session down between cr1-eqiad and mr1-eqiad (208.80.154.204) - group Management - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[23:20:11] <wikibugs>	 (03PS1) 10Papaul: The peering IP's were wrong update all IP's [homer/public] - 10https://gerrit.wikimedia.org/r/1266476 (https://phabricator.wikimedia.org/T421238)
[23:23:07] <wikibugs>	 (03CR) 10Papaul: [C:03+2] The peering IP's were wrong update all IP's [homer/public] - 10https://gerrit.wikimedia.org/r/1266476 (https://phabricator.wikimedia.org/T421238) (owner: 10Papaul)
[23:28:31] <jinxer-wm>	 RESOLVED: [2x] Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards   - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
[23:41:57] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1266482
[23:42:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1266482 (owner: 10TrainBranchBot)
[23:54:10] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1266482 (owner: 10TrainBranchBot)