[00:00:06] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/device-analytics: apply [00:00:20] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/device-analytics: apply [00:00:35] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/echostore: apply [00:00:51] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/echostore: apply [00:01:12] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [00:01:40] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [00:01:59] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/editor-analytics: apply [00:02:09] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply [00:02:31] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [00:02:42] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [00:03:17] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [00:03:29] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [00:03:57] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [00:03:59] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [00:04:08] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [00:04:17] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [00:04:21] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [00:04:22] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1021.eqiad.wmnet with OS bookworm [00:04:29] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [00:04:33] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 13Patch-For-Review: Q3:rack/setup/install dse-k8s-worker10[20-23] - https://phabricator.wikimedia.org/T414216#11747229 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host dse-k8s-work... [00:04:43] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams: apply [00:04:55] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [00:05:06] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [00:05:26] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [00:05:38] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/geo-analytics: apply [00:05:50] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/geo-analytics: apply [00:05:55] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/image-suggestion: apply [00:06:05] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [00:06:07] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/image-suggestion: apply [00:06:12] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [00:06:22] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [00:06:23] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1022.eqiad.wmnet with OS bookworm [00:06:24] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [00:06:30] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: apply [00:06:35] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 13Patch-For-Review: Q3:rack/setup/install dse-k8s-worker10[20-23] - https://phabricator.wikimedia.org/T414216#11747231 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host dse-k8s-work... [00:06:43] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: apply [00:06:49] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [00:07:01] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [00:07:29] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [00:09:59] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [00:10:21] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/media-analytics: apply [00:10:58] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/media-analytics: apply [00:11:29] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [00:12:30] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [00:12:39] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [00:13:00] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [00:13:16] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [00:13:23] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [00:13:33] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/page-analytics: apply [00:13:45] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/page-analytics: apply [00:14:08] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/proton: apply [00:14:12] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [00:15:00] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/proton: apply [00:15:06] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/push-notifications: apply [00:15:16] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [00:15:17] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/push-notifications: apply [00:15:17] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1023.eqiad.wmnet with OS bookworm [00:15:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 13Patch-For-Review: Q3:rack/setup/install dse-k8s-worker10[20-23] - https://phabricator.wikimedia.org/T414216#11747243 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host dse-k8s-work... [00:15:40] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [00:15:48] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [00:15:55] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/recommendation-api: apply [00:16:05] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/recommendation-api: apply [00:16:23] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 13Patch-For-Review: Q3:rack/setup/install dse-k8s-worker10[20-23] - https://phabricator.wikimedia.org/T414216#11747244 (10Jclark-ctr) [00:16:25] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/sessionstore: apply [00:16:28] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 13Patch-For-Review: Q3:rack/setup/install dse-k8s-worker10[20-23] - https://phabricator.wikimedia.org/T414216#11747245 (10Jclark-ctr) 05Open→03Resolved [00:16:36] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [00:16:44] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/shellbox: apply [00:17:12] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox: apply [00:17:21] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [00:17:33] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [00:17:47] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-media: apply [00:17:59] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [00:18:11] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [00:18:23] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [00:18:46] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [00:18:57] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [00:19:25] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-video: apply [00:19:51] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [00:20:00] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: apply [00:20:11] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: apply [00:20:23] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/termbox: apply [00:20:34] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/termbox: apply [00:20:43] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/toolhub: apply [00:21:07] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/toolhub: apply [00:21:27] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [00:21:44] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [00:22:09] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [00:22:29] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [00:22:37] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [00:23:11] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [00:23:16] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/zotero: apply [00:23:27] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/zotero: apply [00:33:09] !log rzl@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [00:36:25] hm, the miscweb update finished in aux-k8s-eqiad, and also started and finished in codfw [00:36:36] all's well except logmsgbot might be out to lunch [00:38:12] ah [00:42:41] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1260179 [00:42:41] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1260179 (owner: 10TrainBranchBot) [00:55:37] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1260179 (owner: 10TrainBranchBot) [00:57:27] (03PS2) 10Scardenasmolinar: PersonalDashboard: Add config for Active Discussions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256498 (https://phabricator.wikimedia.org/T420785) [01:12:37] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1260186 [01:12:37] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1260186 (owner: 10TrainBranchBot) [01:18:07] (03PS3) 10Scott French: mw-*: Use envoy drain configuration everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260096 (https://phabricator.wikimedia.org/T364245) [01:18:07] (03CR) 10Scott French: "Image bumped in I31a6e808ba3e02a0773ce9b8f89938356a6a6964." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260096 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [01:24:09] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1260186 (owner: 10TrainBranchBot) [02:02:55] 06SRE, 07Datacenter-Switchover: Increased rate of badtoken errors / session store issues due to datacenter switchover? - https://phabricator.wikimedia.org/T421168#11747359 (10Peachey88) [02:08:26] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:33:26] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:01:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:04:03] (03PS1) 10RLazarus: mw-on-k8s: Add 95% paging alert on php-fpm worker saturation [alerts] - 10https://gerrit.wikimedia.org/r/1260231 (https://phabricator.wikimedia.org/T420679) [03:10:36] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 4581.99 ms [03:11:00] RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 212.96 ms [03:13:03] FIRING: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:18:03] RESOLVED: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:51:12] FIRING: [2x] CertAlmostExpired: Certificate for service fasw2-c8a-codfw.mgmt.codfw.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:57:40] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:00:05] arnaudb : I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Gerrit / CDN deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260325T0500). [05:23:46] (03CR) 10Kevin Bazira: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259985 (https://phabricator.wikimedia.org/T421105) (owner: 10Klausman) [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260325T0600) [06:00:04] arnaudb : #bothumor My software never has bugs. It just develops random features. Rise for Gerrit / CDN. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260325T0500). [06:14:05] (03CR) 10Arnaudb: [C:03+2] gerrit: use Envoy on gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/1259944 (https://phabricator.wikimedia.org/T420909) (owner: 10Arnaudb) [06:21:18] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-parsoid_4452: Servers wikikube-worker2170.codfw.wmnet, wikikube-worker2262.codfw.wmnet, wikikube-worker2278.codfw.wmnet, wikikube-worker2148.codfw.wmnet, wikikube-worker2033.codfw.wmnet, wikikube-worker2282.codfw.wmnet, wikikube-worker2102.codfw.wmnet, wikikube-worker2191.codfw.wmnet, wikikube-worker2185.codfw.wmnet, wikikube-worker2091.codfw.wmne [06:21:18] ube-worker2071.codfw.wmnet, wikikube-worker2165.codfw.wmnet, wikikube-worker2044.codfw.wmnet, wikikube-worker2190.codfw.wmnet, wikikube-worker2215.codfw.wmnet, wikikube-worker2287.codfw.wmnet, wikikube-worker2059.codfw.wmnet, wikikube-worker2248.codfw.wmnet, wikikube-worker2273.codfw.wmnet, wikikube-worker2139.codfw.wmnet, wikikube-worker2065.codfw.wmnet, wikikube-worker2060.codfw.wmnet, wikikube-worker2281.codfw.wmnet, wikikube-worker204 [06:21:18] wmnet, wikikube-worker2159.codfw.wmnet, wikikube-worker2313.codfw.wmnet, wikikube-worker2206.codfw.wmnet, wikikube-worker2251.codfw.wmnet, wikikube-worker2062.codfw.wmnet, wikikube-work https://wikitech.wikimedia.org/wiki/PyBal [06:22:46] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-parsoid_4452: Servers wikikube-worker2262.codfw.wmnet, wikikube-worker2141.codfw.wmnet, wikikube-worker2174.codfw.wmnet, wikikube-worker2311.codfw.wmnet, wikikube-worker2046.codfw.wmnet, wikikube-worker2191.codfw.wmnet, wikikube-worker2172.codfw.wmnet, wikikube-worker2036.codfw.wmnet, wikikube-worker2252.codfw.wmnet, wikikube-worker2150.codfw.wmne [06:22:46] ube-worker2113.codfw.wmnet, wikikube-worker2136.codfw.wmnet, wikikube-worker2179.codfw.wmnet, wikikube-worker2091.codfw.wmnet, wikikube-worker2320.codfw.wmnet, wikikube-worker2108.codfw.wmnet, wikikube-worker2274.codfw.wmnet, wikikube-worker2176.codfw.wmnet, wikikube-worker2165.codfw.wmnet, wikikube-worker2044.codfw.wmnet, wikikube-worker2190.codfw.wmnet, wikikube-worker2177.codfw.wmnet, wikikube-worker2092.codfw.wmnet, wikikube-worker228 [06:22:46] wmnet, wikikube-worker2161.codfw.wmnet, wikikube-worker2157.codfw.wmnet, wikikube-worker2273.codfw.wmnet, wikikube-worker2213.codfw.wmnet, wikikube-worker2297.codfw.wmnet, wikikube-work https://wikitech.wikimedia.org/wiki/PyBal [06:33:28] (03CR) 10Arnaudb: gerrit: forward Gitiles traffic to gerrit-replica (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1259121 (https://phabricator.wikimedia.org/T420595) (owner: 10Arnaudb) [06:34:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from restbase.discovery.wmnet in esams #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=restbase.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [06:37:23] what's up restbase? [06:39:51] FIRING: [5x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from restbase.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [06:43:20] (03PS1) 10Ryan Kemper: query_service: add prom metrics for auto-restart [puppet] - 10https://gerrit.wikimedia.org/r/1260558 (https://phabricator.wikimedia.org/T242453) [06:44:51] FIRING: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from restbase.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [06:57:57] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1260558 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [07:00:04] Amir1, Urbanecm, and awight: That opportune time for a UTC morning backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260325T0700). [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:00:52] (03PS1) 10MVernon: Revert "trafficserver: Add api.w.o to gateway-check.lua.conf" [puppet] - 10https://gerrit.wikimedia.org/r/1260559 [07:01:32] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Revert "trafficserver: Add api.w.o to gateway-check.lua.conf" [puppet] - 10https://gerrit.wikimedia.org/r/1260559 (owner: 10MVernon) [07:01:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:01:40] (03CR) 10Ayounsi: [C:03+1] "lgtm as soon as CI is happy" [puppet] - 10https://gerrit.wikimedia.org/r/1260559 (owner: 10MVernon) [07:11:48] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:12:18] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:14:51] RESOLVED: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from restbase.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [07:16:20] (03PS2) 10Ryan Kemper: query_service: add prom metrics for auto-restart [puppet] - 10https://gerrit.wikimedia.org/r/1260558 (https://phabricator.wikimedia.org/T242453) [07:16:28] (03PS3) 10Ryan Kemper: query_service: add prom metrics for auto-restart [puppet] - 10https://gerrit.wikimedia.org/r/1260558 (https://phabricator.wikimedia.org/T242453) [07:16:47] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1260558 (https://phabricator.wikimedia.org/T242453) (owner: 10Ryan Kemper) [07:41:37] (03PS4) 10Jdlrobson: Deploy temporary accounts to ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247073 (https://phabricator.wikimedia.org/T413771) (owner: 10STran) [07:41:52] (03CR) 10CI reject: [V:04-1] Deploy temporary accounts to ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247073 (https://phabricator.wikimedia.org/T413771) (owner: 10STran) [07:42:20] (03CR) 10Kosta Harlan: [C:03+1] Deploy temporary accounts to ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247073 (https://phabricator.wikimedia.org/T413771) (owner: 10STran) [07:45:08] (03CR) 10Kosta Harlan: [C:03+1] Deploy temporary accounts to ruwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247073 (https://phabricator.wikimedia.org/T413771) (owner: 10STran) [07:48:17] (03PS5) 10Kosta Harlan: Deploy temporary accounts to ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247073 (https://phabricator.wikimedia.org/T413771) (owner: 10STran) [07:48:19] (03CR) 10Kosta Harlan: Deploy temporary accounts to ruwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247073 (https://phabricator.wikimedia.org/T413771) (owner: 10STran) [07:48:33] (03CR) 10jenkins-bot: Deploy temporary accounts to ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247073 (https://phabricator.wikimedia.org/T413771) (owner: 10STran) [07:51:12] FIRING: [2x] CertAlmostExpired: Certificate for service fasw2-c8a-codfw.mgmt.codfw.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:57:40] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:00:05] hashar and andre: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260325T0800) [08:00:27] o/ [08:02:45] (03PS1) 10TrainBranchBot: group1 to 1.46.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260604 (https://phabricator.wikimedia.org/T420479) [08:02:48] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by hashar@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260604 (https://phabricator.wikimedia.org/T420479) (owner: 10TrainBranchBot) [08:03:47] (03Merged) 10jenkins-bot: group1 to 1.46.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260604 (https://phabricator.wikimedia.org/T420479) (owner: 10TrainBranchBot) [08:10:20] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.46.0-wmf.21 refs T420479 [08:10:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260045 (owner: 10DCausse) [08:10:28] T420479: 1.46.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T420479 [08:11:44] there is some `[{reqId}] {exception_url} PHP Deprecated: Use of MediaWiki\Parser\ParserOutput::setOutputFlag with non-standard flag was deprecated in MediaWiki 1.45. [Called from MediaWiki\Parser\ParserOutput::initFromJson]` [08:11:51] I'll let it flow and file a task for it [08:14:21] (03CR) 10Elukey: [C:03+1] Add support for Python 3.14 [software/cumin] - 10https://gerrit.wikimedia.org/r/1260081 (owner: 10Volans) [08:15:14] (03CR) 10Elukey: [C:03+1] openstack backend: add support for a proxy [software/cumin] - 10https://gerrit.wikimedia.org/r/1260082 (https://phabricator.wikimedia.org/T420360) (owner: 10Volans) [08:21:16] (03CR) 10Hashar: "Note: the CI image does not have Python 3.14:" [software/cumin] - 10https://gerrit.wikimedia.org/r/1260081 (owner: 10Volans) [08:28:09] filed as https://phabricator.wikimedia.org/T421206 [08:28:24] it was only ~ 85 logs [08:28:56] 06SRE, 06Traffic, 07Wikimedia-Incident: Dead links on https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Process_count - https://phabricator.wikimedia.org/T421207 (10MatthewVernon) 03NEW [08:29:31] !log ayounsi@cumin1003 START - Cookbook sre.network.tls for network device fasw2-c8a-codfw [08:29:38] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device fasw2-c8a-codfw [08:29:46] !log ayounsi@cumin1003 START - Cookbook sre.network.tls for network device fasw2-c8b-codfw [08:29:53] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device fasw2-c8b-codfw [08:30:32] (03PS1) 10Awight: Remove unused feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260611 (https://phabricator.wikimedia.org/T385666) [08:30:57] FIRING: [2x] CertAlmostExpired: Certificate for service fasw2-c8a-codfw.mgmt.codfw.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:32:18] 06SRE, 10RESTBase, 07Wikimedia-Incident: Dead links at https://wikitech.wikimedia.org/wiki/RESTBase#Analytics_and_metrics - https://phabricator.wikimedia.org/T421208 (10MatthewVernon) 03NEW [08:32:25] RESOLVED: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:32:41] (03PS1) 10Elukey: Add new aux-k8s-worker200[6-9] in service [puppet] - 10https://gerrit.wikimedia.org/r/1260613 (https://phabricator.wikimedia.org/T393054) [08:33:32] (03PS1) 10Awight: [beta] Kill synthetic refs with feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260614 (https://phabricator.wikimedia.org/T421055) [08:34:08] !log elukey@puppetserver1001 conftool action : set/pooled=yes; selector: name=aux-k8s-worker1006.eqiad.wmnet,cluster=kubernetes,service=kubesvc [08:34:22] !log elukey@puppetserver1001 conftool action : set/pooled=yes; selector: name=aux-k8s-worker1007.eqiad.wmnet,cluster=kubernetes,service=kubesvc [08:34:27] !log elukey@puppetserver1001 conftool action : set/pooled=yes; selector: name=aux-k8s-worker1008.eqiad.wmnet,cluster=kubernetes,service=kubesvc [08:34:32] !log elukey@puppetserver1001 conftool action : set/pooled=yes; selector: name=aux-k8s-worker1009.eqiad.wmnet,cluster=kubernetes,service=kubesvc [08:34:40] (03CR) 10Brouberol: [C:03+1] Add new aux-k8s-worker200[6-9] in service [puppet] - 10https://gerrit.wikimedia.org/r/1260613 (https://phabricator.wikimedia.org/T393054) (owner: 10Elukey) [08:34:50] !log elukey@puppetserver1001 conftool action : set/weight=10; selector: name=aux-k8s-worker1006.eqiad.wmnet,cluster=kubernetes,service=kubesvc [08:34:55] !log elukey@puppetserver1001 conftool action : set/weight=10; selector: name=aux-k8s-worker1007.eqiad.wmnet,cluster=kubernetes,service=kubesvc [08:35:00] !log elukey@puppetserver1001 conftool action : set/weight=10; selector: name=aux-k8s-worker1008.eqiad.wmnet,cluster=kubernetes,service=kubesvc [08:35:05] !log elukey@puppetserver1001 conftool action : set/weight=10; selector: name=aux-k8s-worker1009.eqiad.wmnet,cluster=kubernetes,service=kubesvc [08:35:26] (03CR) 10Volans: "Python 3.14 is required for upstream Debian where experimental has already 3.14 and most likely that is the version that will be pushed to" [software/cumin] - 10https://gerrit.wikimedia.org/r/1260081 (owner: 10Volans) [08:35:57] RESOLVED: [2x] CertAlmostExpired: Certificate for service fasw2-c8a-codfw.mgmt.codfw.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:38:29] (03CR) 10Elukey: [C:03+2] Add new aux-k8s-worker200[6-9] in service [puppet] - 10https://gerrit.wikimedia.org/r/1260613 (https://phabricator.wikimedia.org/T393054) (owner: 10Elukey) [08:43:14] (03PS1) 10Mszwarc: Allow for demoting 2FA-less members of further 6 groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260617 (https://phabricator.wikimedia.org/T418580) [08:45:12] (03CR) 10Hashar: "What I implies is that I will had python3.14 to the image the same way I did to add python 3.13 last year with I9e64399cf8dad57ce6e40b33ec" [software/cumin] - 10https://gerrit.wikimedia.org/r/1260081 (owner: 10Volans) [08:47:22] PROBLEM - MariaDB Replica Lag: s3 on clouddb1022 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 557.82 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:49:41] (03CR) 10Thiemo Kreuz (WMDE): [C:03+1] Remove unused feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260611 (https://phabricator.wikimedia.org/T385666) (owner: 10Awight) [08:50:39] (03CR) 10Thiemo Kreuz (WMDE): [C:03+1] [beta] Kill synthetic refs with feature flag (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260614 (https://phabricator.wikimedia.org/T421055) (owner: 10Awight) [08:51:19] (03CR) 10Volans: "That's great! Thanks a lot." [software/cumin] - 10https://gerrit.wikimedia.org/r/1260081 (owner: 10Volans) [08:54:20] (03CR) 10Rubah Hitam Vukova: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260617 (https://phabricator.wikimedia.org/T418580) (owner: 10Mszwarc) [08:55:47] !log elukey@puppetserver1001 conftool action : set/weight=10; selector: name=aux-k8s-worker200[6-9].eqiad.wmnet,cluster=kubernetes,service=kubesvc [08:55:59] !log elukey@puppetserver1001 conftool action : set/pooled=yes; selector: name=aux-k8s-worker200[6-9].eqiad.wmnet,cluster=kubernetes,service=kubesvc [09:03:19] (03CR) 10Filippo Giunchedi: [C:03+1] "Neat" [software/cumin] - 10https://gerrit.wikimedia.org/r/1260082 (https://phabricator.wikimedia.org/T420360) (owner: 10Volans) [09:03:37] (03CR) 10Zabe: dumpInterwiki: Re-generate to add Abstract Wikipedia (and others) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259993 (https://phabricator.wikimedia.org/T420654) (owner: 10Jforrester) [09:04:06] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=10; selector: name=aux-k8s-worker100[6-9].eqiad.wmnet,cluster=aux-k8s,service=kubesvc [09:04:18] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=10; selector: name=aux-k8s-worker200[6-9].codfw.wmnet,cluster=aux-k8s,service=kubesvc [09:04:59] (03CR) 10Mszwarc: dumpInterwiki: Re-generate to add Abstract Wikipedia (and others) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259993 (https://phabricator.wikimedia.org/T420654) (owner: 10Jforrester) [09:05:49] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=10; selector: name=aux-k8s-worker200[2-5].codfw.wmnet,cluster=aux-k8s,service=kubesvc [09:05:55] 06SRE, 10RESTBase, 07Documentation, 07Sustainability (Incident Followup): Dead links at https://wikitech.wikimedia.org/wiki/RESTBase#Analytics_and_metrics - https://phabricator.wikimedia.org/T421208#11747921 (10Aklapper) This ticket does not cover an incident. [09:05:58] 06SRE, 06Traffic, 07Documentation, 07Sustainability (Incident Followup): Dead links on https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Process_count - https://phabricator.wikimedia.org/T421207#11747923 (10Aklapper) This ticket does not cover an incident. [09:07:22] PROBLEM - MariaDB Replica SQL: s3 on clouddb1022 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:07:22] PROBLEM - MariaDB Replica IO: s3 on clouddb1022 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:08:26] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:10:39] 10ops-eqiad, 06SRE, 06DC-Ops: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#11747934 (10elukey) 05Open→03Resolved [09:22:29] (03CR) 10Zabe: dumpInterwiki: Re-generate to add Abstract Wikipedia (and others) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259993 (https://phabricator.wikimedia.org/T420654) (owner: 10Jforrester) [09:26:33] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to Superset for keren.ramirezWMDE - https://phabricator.wikimedia.org/T420896#11747976 (10kera_wmde) @KFrancis thank you! I just signed the NDA and send it back. [09:27:18] (03PS1) 10Elukey: Add missing Prometheus and Grafana configs for k8s-aux [puppet] - 10https://gerrit.wikimedia.org/r/1260628 (https://phabricator.wikimedia.org/T358189) [09:30:18] (03PS1) 10Brouberol: Revert^2 "kafka-main-codfw: disable mirroring to kafka-main-eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/1260629 [09:30:22] (03PS1) 10Brouberol: Revert^2 "kafka-main-eqiad: disable mirroring to kafka-main-codfw" [puppet] - 10https://gerrit.wikimedia.org/r/1260630 [09:30:32] RESOLVED: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:30:34] (03PS1) 10Clément Goubert: trafficserver: Add api.w.o to gateway-check.lua.conf [puppet] - 10https://gerrit.wikimedia.org/r/1260624 (https://phabricator.wikimedia.org/T418145) [09:30:51] (03CR) 10Elukey: [C:03+1] Revert^2 "kafka-main-eqiad: disable mirroring to kafka-main-codfw" [puppet] - 10https://gerrit.wikimedia.org/r/1260630 (owner: 10Brouberol) [09:31:02] (03CR) 10Elukey: [C:03+1] Revert^2 "kafka-main-codfw: disable mirroring to kafka-main-eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/1260629 (owner: 10Brouberol) [09:34:49] (03PS7) 10Clément Goubert: trafficserver: 100% of linkrecommendation to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259071 (https://phabricator.wikimedia.org/T418148) [09:35:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from kartotherian.discovery.wmnet in eqsin #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=upload&var-origin=kartotherian.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [09:36:00] uh oh [09:36:18] That's not me this time [09:38:31] (03PS2) 10Clément Goubert: trafficserver: Add api.w.o to gateway-check.lua.conf [puppet] - 10https://gerrit.wikimedia.org/r/1260624 (https://phabricator.wikimedia.org/T418145) [09:38:31] (03PS9) 10Clément Goubert: trafficserver: 50% of /core/v1 to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259077 (https://phabricator.wikimedia.org/T418146) [09:38:31] (03PS4) 10Clément Goubert: trafficserver: 100% of /core/v1 to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259946 (https://phabricator.wikimedia.org/T418146) [09:38:50] (03PS1) 10JavierMonton: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260637 (https://phabricator.wikimedia.org/T360794) [09:38:54] (03CR) 10Brouberol: [C:03+2] Revert^2 "kafka-main-eqiad: disable mirroring to kafka-main-codfw" [puppet] - 10https://gerrit.wikimedia.org/r/1260630 (owner: 10Brouberol) [09:40:51] FIRING: [5x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from kartotherian.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [09:42:11] (03CR) 10JavierMonton: [C:03+2] stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260637 (https://phabricator.wikimedia.org/T360794) (owner: 10JavierMonton) [09:43:16] (03CR) 10JavierMonton: [C:03+1] Temporarily suspend the flink applications running in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259973 (https://phabricator.wikimedia.org/T414484) (owner: 10Btullis) [09:43:38] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad@0 on kafka-main1010 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-codfw_to_main-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [09:43:38] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad@0 on kafka-main1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-codfw_to_main-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [09:43:38] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad@0 on kafka-main1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-codfw_to_main-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [09:43:38] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad@0 on kafka-main1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-codfw_to_main-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [09:44:11] (03Merged) 10jenkins-bot: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260637 (https://phabricator.wikimedia.org/T360794) (owner: 10JavierMonton) [09:44:38] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad@0 on kafka-main1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-codfw_to_main-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [09:44:56] !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/kafka-mirrormaker: apply [09:45:09] !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/kafka-mirrormaker: apply [09:45:49] !log brouberol@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/kafka-mirrormaker: apply [09:46:09] !log brouberol@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/kafka-mirrormaker: apply [09:47:13] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [09:47:39] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [09:48:13] (03CR) 10Brouberol: [C:03+2] Revert^2 "kafka-main-codfw: disable mirroring to kafka-main-eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/1260629 (owner: 10Brouberol) [09:51:33] !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/kafka-mirrormaker: apply [09:51:42] !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/kafka-mirrormaker: apply [09:52:26] !log brouberol@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/kafka-mirrormaker: apply [09:52:42] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw@0 on kafka-main2007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_main-codfw@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [09:52:42] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw@0 on kafka-main2006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_main-codfw@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [09:52:46] !log brouberol@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/kafka-mirrormaker: apply [09:53:24] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw@0 on kafka-main2009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_main-codfw@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [09:53:24] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw@0 on kafka-main2010 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_main-codfw@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [09:53:42] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw@0 on kafka-main2008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_main-codfw@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [09:53:45] (03PS8) 10Clément Goubert: trafficserver: 100% of linkrecommendation to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259071 (https://phabricator.wikimedia.org/T418148) [09:54:12] (03CR) 10Brouberol: [C:03+2] kafka-jumbo-eqiad: disable mirroring from kafka-main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1255658 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [09:54:48] (03CR) 10JMeybohm: [C:03+1] trafficserver: Add api.w.o to gateway-check.lua.conf [puppet] - 10https://gerrit.wikimedia.org/r/1260624 (https://phabricator.wikimedia.org/T418145) (owner: 10Clément Goubert) [09:55:15] (03CR) 10JMeybohm: [C:03+1] trafficserver: 100% of linkrecommendation to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259071 (https://phabricator.wikimedia.org/T418148) (owner: 10Clément Goubert) [09:57:50] !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/kafka-mirrormaker: apply [09:58:01] !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/kafka-mirrormaker: apply [09:58:26] FIRING: JobUnavailable: Reduced availability for job jmx_kafka_mirrormaker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:59:38] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1011 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [09:59:38] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [09:59:38] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1013 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [09:59:39] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1010 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [09:59:40] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1014 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [09:59:41] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1016 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [09:59:42] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1015 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [09:59:43] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1018 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260325T1000) [10:00:16] looking at the kafka mirrormaker alerts [10:00:24] this is me, I've disabled them on the kafka hosts [10:01:16] !log elukey@puppetserver1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=codfw [10:03:26] FIRING: [2x] JobUnavailable: Reduced availability for job jmx_kafka_mirrormaker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:03:52] ok I think these alerts are being phased out in incinga [10:07:21] RECOVERY - haproxy failover on dbproxy1022 is OK: OK check_failover servers up 1 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [10:07:33] (03CR) 10Hashar: "recheck after having added Python 3.14 to the images ( Ia49d266c5b0bc59968c29d4ad57ebcc36fec525a ) and switching the jobs to them ( I2b6d8" [software/cumin] - 10https://gerrit.wikimedia.org/r/1260081 (owner: 10Volans) [10:10:39] RECOVERY - haproxy failover on dbproxy1023 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [10:10:51] RESOLVED: [5x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from kartotherian.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [10:11:24] (03PS1) 10Dpogorzelski: Weekly rebuild of cert-manager - 20260322 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1260640 [10:12:03] (03Abandoned) 10Dpogorzelski: Weekly rebuild of cert-manager - 20260322 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1260640 (owner: 10Dpogorzelski) [10:12:21] RECOVERY - haproxy failover on dbproxy1028 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [10:13:34] (03CR) 10Klausman: [V:03+2 C:03+2] admin_ng/knative-serving: enable emptyDir feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259985 (https://phabricator.wikimedia.org/T421105) (owner: 10Klausman) [10:15:14] (03CR) 10Klausman: [C:03+1] knative: update images to 1.21.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1260031 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [10:15:43] (03PS2) 10Dpogorzelski: knative: update images to 1.21.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1260031 (https://phabricator.wikimedia.org/T419722) [10:17:02] (03CR) 10Dpogorzelski: "removed domain-mapping and domain-mapping-webhook since these were merged into webhook and controller cmds" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1260031 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [10:17:41] RECOVERY - haproxy failover on dbproxy1029 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [10:19:55] !log klausman@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [10:20:30] !log klausman@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [10:20:31] !log klausman@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [10:20:32] RESOLVED: [2x] JobUnavailable: Reduced availability for job jmx_kafka_mirrormaker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:21:27] !log klausman@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [10:22:22] (03Merged) 10jenkins-bot: admin_ng/knative-serving: enable emptyDir feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259985 (https://phabricator.wikimedia.org/T421105) (owner: 10Klausman) [10:25:07] (03CR) 10Dpogorzelski: [C:03+1] admin_ng/knative-serving: enable emptyDir feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259985 (https://phabricator.wikimedia.org/T421105) (owner: 10Klausman) [10:25:39] (03CR) 10Volans: [C:03+2] Add support for Python 3.14 [software/cumin] - 10https://gerrit.wikimedia.org/r/1260081 (owner: 10Volans) [10:26:02] (03CR) 10Volans: [C:03+2] openstack backend: add support for a proxy [software/cumin] - 10https://gerrit.wikimedia.org/r/1260082 (https://phabricator.wikimedia.org/T420360) (owner: 10Volans) [10:26:34] !log klausman@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [10:26:44] (03CR) 10Elukey: [C:03+1] "Please make sure you build this locally before merging :)" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1260031 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [10:27:33] !log klausman@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [10:30:50] (03PS1) 10Vgutierrez: lvs::realserver: Monitor IPIP realservers [puppet] - 10https://gerrit.wikimedia.org/r/1260644 (https://phabricator.wikimedia.org/T419873) [10:31:11] PS1? that's wrong [10:31:17] (03CR) 10Dpogorzelski: "all builds" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1260031 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [10:31:25] (03CR) 10CI reject: [V:04-1] lvs::realserver: Monitor IPIP realservers [puppet] - 10https://gerrit.wikimedia.org/r/1260644 (https://phabricator.wikimedia.org/T419873) (owner: 10Vgutierrez) [10:31:28] (03CR) 10Dpogorzelski: [C:03+2] knative: update images to 1.21.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1260031 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [10:31:35] (03CR) 10Dpogorzelski: [V:03+2 C:03+2] knative: update images to 1.21.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1260031 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [10:32:30] (03PS1) 10MVernon: tegola/kartotherian - double replicas for single-DC use [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260646 [10:33:39] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [10:34:29] (03CR) 10Rubah Hitam Vukova: [C:03+1] Allow for demoting 2FA-less members of further 6 groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260617 (https://phabricator.wikimedia.org/T418580) (owner: 10Mszwarc) [10:36:14] (03Abandoned) 10Vgutierrez: lvs::realserver: Monitor IPIP realservers [puppet] - 10https://gerrit.wikimedia.org/r/1260644 (https://phabricator.wikimedia.org/T419873) (owner: 10Vgutierrez) [10:40:26] (03CR) 10David Caro: "Though it has less features than stern, it has enough to be useful." [puppet] - 10https://gerrit.wikimedia.org/r/1260072 (owner: 10David Caro) [10:44:23] RECOVERY - MariaDB Replica IO: s3 on clouddb1022 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:44:23] RECOVERY - MariaDB Replica SQL: s3 on clouddb1022 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:45:20] (03CR) 10Rubah Hitam Vukova: [C:03+1] idwiki: Remove unused user groups on Indonesian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251193 (https://phabricator.wikimedia.org/T419105) (owner: 10Codename Noreste) [10:45:30] (03Merged) 10jenkins-bot: Add support for Python 3.14 [software/cumin] - 10https://gerrit.wikimedia.org/r/1260081 (owner: 10Volans) [10:45:31] (03Merged) 10jenkins-bot: openstack backend: add support for a proxy [software/cumin] - 10https://gerrit.wikimedia.org/r/1260082 (https://phabricator.wikimedia.org/T420360) (owner: 10Volans) [10:45:47] !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1022.eqiad.wmnet,service=s3 [10:45:49] (03CR) 10Clément Goubert: [C:03+1] mw-on-k8s: Add 95% paging alert on php-fpm worker saturation [alerts] - 10https://gerrit.wikimedia.org/r/1260231 (https://phabricator.wikimedia.org/T420679) (owner: 10RLazarus) [10:46:25] RECOVERY - MariaDB Replica Lag: s3 on clouddb1022 is OK: OK slave_sql_lag Replication lag: 0.06 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:53:26] !log fceratto@cumin1003 START - Cookbook sre.mysql.sanitize-wiki Managing sanitization for wikis abstractwiki in section s5 [10:55:44] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [10:56:53] (03CR) 10Vgutierrez: [C:03+1] trafficserver: Add api.w.o to gateway-check.lua.conf [puppet] - 10https://gerrit.wikimedia.org/r/1260624 (https://phabricator.wikimedia.org/T418145) (owner: 10Clément Goubert) [10:58:16] fceratto@cumin1003 sanitize-wiki (PID 1928642) is awaiting input [10:58:22] (03PS2) 10MVernon: tegola/kartotherian - increase replicas for single-DC use [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260646 [10:59:24] (03PS1) 10Tiziano Fogli: thanos/compact: add support for instance-based partitioning [puppet] - 10https://gerrit.wikimedia.org/r/1260650 (https://phabricator.wikimedia.org/T386911) [10:59:26] (03PS1) 10Tiziano Fogli: pontoon: override promethues_instances designated_compactor [puppet] - 10https://gerrit.wikimedia.org/r/1260651 (https://phabricator.wikimedia.org/T386911) [10:59:47] (03CR) 10Clément Goubert: [C:03+1] tegola/kartotherian - increase replicas for single-DC use [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260646 (owner: 10MVernon) [11:00:05] mvolz: Time to do the Services – Citoid / Zotero deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260325T1100). [11:00:53] (03CR) 10MVernon: [C:03+2] tegola/kartotherian - increase replicas for single-DC use [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260646 (owner: 10MVernon) [11:01:29] (03PS2) 10Tiziano Fogli: thanos/compact: add support for instance-based partitioning [puppet] - 10https://gerrit.wikimedia.org/r/1260650 (https://phabricator.wikimedia.org/T386911) [11:01:29] (03PS2) 10Tiziano Fogli: pontoon: override promethues_instances designated_compactor [puppet] - 10https://gerrit.wikimedia.org/r/1260651 (https://phabricator.wikimedia.org/T386911) [11:01:32] (03CR) 10CI reject: [V:04-1] thanos/compact: add support for instance-based partitioning [puppet] - 10https://gerrit.wikimedia.org/r/1260650 (https://phabricator.wikimedia.org/T386911) (owner: 10Tiziano Fogli) [11:01:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:01:54] (03CR) 10Clément Goubert: [C:03+2] trafficserver: Add api.w.o to gateway-check.lua.conf [puppet] - 10https://gerrit.wikimedia.org/r/1260624 (https://phabricator.wikimedia.org/T418145) (owner: 10Clément Goubert) [11:03:15] (03PS3) 10Tiziano Fogli: thanos/compact: add support for instance-based partitioning [puppet] - 10https://gerrit.wikimedia.org/r/1260650 (https://phabricator.wikimedia.org/T386911) [11:03:15] (03PS3) 10Tiziano Fogli: pontoon: override promethues_instances designated_compactor [puppet] - 10https://gerrit.wikimedia.org/r/1260651 (https://phabricator.wikimedia.org/T386911) [11:05:34] !log mvernon@deploy2002 helmfile [eqiad] START helmfile.d/services/kartotherian: apply [11:07:32] !log mvernon@deploy2002 helmfile [eqiad] DONE helmfile.d/services/kartotherian: apply [11:07:41] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.sanitize-wiki (exit_code=0) Managing sanitization for wikis abstractwiki in section s5 [11:07:55] !log mvernon@deploy2002 helmfile [codfw] START helmfile.d/services/kartotherian: apply [11:09:59] (03PS4) 10Tiziano Fogli: thanos/compact: add support for instance-based partitioning [puppet] - 10https://gerrit.wikimedia.org/r/1260650 (https://phabricator.wikimedia.org/T386911) [11:09:59] (03PS4) 10Tiziano Fogli: pontoon: override promethues_instances designated_compactor [puppet] - 10https://gerrit.wikimedia.org/r/1260651 (https://phabricator.wikimedia.org/T386911) [11:10:19] (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1260650 (https://phabricator.wikimedia.org/T386911) (owner: 10Tiziano Fogli) [11:11:43] (03CR) 10JMeybohm: [C:03+2] wikikube: Switch to IPIP mode for kube-apiserver [puppet] - 10https://gerrit.wikimedia.org/r/1259995 (https://phabricator.wikimedia.org/T420436) (owner: 10JMeybohm) [11:14:49] !log mvernon@deploy2002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply [11:15:36] !log mvernon@deploy2002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply [11:15:53] !log mvernon@deploy2002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply [11:16:24] !log mvernon@deploy2002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply [11:18:35] !log mvernon@deploy2002 helmfile [codfw] DONE helmfile.d/services/kartotherian: apply [11:19:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260614 (https://phabricator.wikimedia.org/T421055) (owner: 10Awight) [11:21:03] !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1022.eqiad.wmnet,service=s3 [11:23:30] (03PS4) 10JMeybohm: wikikube: Enable ipip_encapsulation and mh scheduler [puppet] - 10https://gerrit.wikimedia.org/r/1259996 (https://phabricator.wikimedia.org/T420436) [11:23:30] (03PS1) 10JMeybohm: wikikube: Switch to IPIP mode for kube-apiserver [puppet] - 10https://gerrit.wikimedia.org/r/1260654 (https://phabricator.wikimedia.org/T420436) [11:24:23] PROBLEM - MariaDB Replica Lag: s3 on clouddb1023 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 566.10 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:24:45] (03CR) 10JMeybohm: [C:03+2] "FTR: I managed to forget to actually include the ipip profile in the role. Follow up at I52af5c2a03368e3b943152bebe7e09dafb82ff14" [puppet] - 10https://gerrit.wikimedia.org/r/1259995 (https://phabricator.wikimedia.org/T420436) (owner: 10JMeybohm) [11:24:56] (03CR) 10JMeybohm: [C:03+2] wikikube: Switch to IPIP mode for kube-apiserver [puppet] - 10https://gerrit.wikimedia.org/r/1260654 (https://phabricator.wikimedia.org/T420436) (owner: 10JMeybohm) [11:36:26] (03PS1) 10Dpogorzelski: knative: update changelogs [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1260656 (https://phabricator.wikimedia.org/T419722) [11:36:35] !log mvernon@deploy2002 helmfile [codfw] START helmfile.d/services/kartotherian: apply [11:38:09] (03PS2) 10Dpogorzelski: knative: update changelogs [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1260656 (https://phabricator.wikimedia.org/T419722) [11:38:16] !log mvernon@deploy2002 helmfile [codfw] DONE helmfile.d/services/kartotherian: apply [11:40:22] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-ctrl1002.eqiad.wmnet [11:40:23] (03CR) 10JMeybohm: [C:03+2] wikikube: Enable ipip_encapsulation and mh scheduler [puppet] - 10https://gerrit.wikimedia.org/r/1259996 (https://phabricator.wikimedia.org/T420436) (owner: 10JMeybohm) [11:41:41] !log jayme@cumin1003 START - Cookbook sre.loadbalancer.migrate-service-ipip for alias: wikikube-master-eqiad@eqiad [11:43:53] !log jayme@cumin1003 START - Cookbook sre.loadbalancer.migrate-service-ipip for alias: wikikube-master-codfw@codfw [11:45:11] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-ctrl1002.eqiad.wmnet [11:45:35] !log jayme@cumin1003 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [11:46:35] !log jayme@cumin1003 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [11:46:35] !log jayme@cumin1003 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for alias: wikikube-master-eqiad@eqiad [11:48:03] (03CR) 10A smart kitten: Phabricator: Remove unused fixed_settings.yaml stuff; update README (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1130323 (https://phabricator.wikimedia.org/T239355) (owner: 10Aklapper) [11:48:26] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:48:44] !log jayme@cumin1003 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [11:49:51] !log jayme@cumin1003 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [11:49:51] !log jayme@cumin1003 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for alias: wikikube-master-codfw@codfw [11:51:14] !log migrated wikikube apiservers (eqiad and codfw) to IPIP - T420436 [11:51:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:20] T420436: Migrate Wikikube k8s apiserver and services to IPIP - https://phabricator.wikimedia.org/T420436 [11:51:26] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-ctrl2001.codfw.wmnet [11:52:59] !log Restart clouddb1022:s3 to enable error_log T420177 [11:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:06] T420177: clouddb1013 crashed after the upgrade to mariadb 10.11.16 - https://phabricator.wikimedia.org/T420177 [11:56:10] (03PS1) 10Hashar: ci: use docker.io package starting with Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1260659 (https://phabricator.wikimedia.org/T418109) [11:56:26] 06SRE, 10Maps, 07Sustainability (Incident Followup): Kartotherian dashboard links don't work - https://phabricator.wikimedia.org/T421226 (10MatthewVernon) 03NEW [11:56:28] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-ctrl2001.codfw.wmnet [11:56:42] (03PS1) 10Kevin Bazira: ml-services: add shared memory volume to gpt isvc for NCCL communication [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260660 (https://phabricator.wikimedia.org/T421105) [11:59:16] (03CR) 10Dpogorzelski: [C:03+2] knative: update changelogs [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1260656 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [11:59:19] (03CR) 10Dpogorzelski: [V:03+2 C:03+2] knative: update changelogs [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1260656 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [12:01:24] (03CR) 10JMeybohm: [C:03+1] "looks reasonable" [puppet] - 10https://gerrit.wikimedia.org/r/1260628 (https://phabricator.wikimedia.org/T358189) (owner: 10Elukey) [12:02:39] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-ctrl2002.codfw.wmnet [12:05:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260617 (https://phabricator.wikimedia.org/T418580) (owner: 10Mszwarc) [12:06:48] (03CR) 10Klausman: [C:03+1] ml-services: add shared memory volume to gpt isvc for NCCL communication [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260660 (https://phabricator.wikimedia.org/T421105) (owner: 10Kevin Bazira) [12:07:14] (03Merged) 10jenkins-bot: Allow for demoting 2FA-less members of further 6 groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260617 (https://phabricator.wikimedia.org/T418580) (owner: 10Mszwarc) [12:07:31] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-ctrl2002.codfw.wmnet [12:07:45] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host wdqs2009.codfw.wmnet [12:07:52] (03PS10) 10Majavah: wmflib: hosts2ips: Allow passing in IP ranges [puppet] - 10https://gerrit.wikimedia.org/r/1211650 [12:08:59] !log mszwarc@deploy2002 Started scap sync-world: Backport for [[gerrit:1260617|Allow for demoting 2FA-less members of further 6 groups (T418580)]] [12:09:05] T418580: Deploy 2FA requirement using $wgRestrictedGroups to Wikimedia production, instead of OATHAuth's custom config - https://phabricator.wikimedia.org/T418580 [12:11:07] (03CR) 10Majavah: [C:03+2] wmflib: hosts2ips: Allow passing in IP ranges [puppet] - 10https://gerrit.wikimedia.org/r/1211650 (owner: 10Majavah) [12:11:19] !log mszwarc@deploy2002 mszwarc: Backport for [[gerrit:1260617|Allow for demoting 2FA-less members of further 6 groups (T418580)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:12:28] !log mszwarc@deploy2002 mszwarc: Continuing with sync [12:13:54] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs2009.codfw.wmnet [12:14:25] PROBLEM - MariaDB Replica Lag: s3 on clouddb1022 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 559.16 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:14:44] (03PS15) 10JHathaway: nftables::service: Improve src/dst filter handling [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah) [12:15:06] (03PS2) 10Kevin Bazira: ml-services: add shared memory volume to gpt isvc for NCCL communication [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260660 (https://phabricator.wikimedia.org/T421105) [12:15:53] (03PS4) 10Btullis: Route dse-k8s API blackbox checks to team-data-platform [puppet] - 10https://gerrit.wikimedia.org/r/1256287 (https://phabricator.wikimedia.org/T420264) [12:16:16] (03CR) 10Majavah: [C:03+2] nftables::service: Improve src/dst filter handling [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah) [12:16:19] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1256287 (https://phabricator.wikimedia.org/T420264) (owner: 10Btullis) [12:19:22] !log mszwarc@deploy2002 Finished scap sync-world: Backport for [[gerrit:1260617|Allow for demoting 2FA-less members of further 6 groups (T418580)]] (duration: 10m 23s) [12:19:29] T418580: Deploy 2FA requirement using $wgRestrictedGroups to Wikimedia production, instead of OATHAuth's custom config - https://phabricator.wikimedia.org/T418580 [12:19:40] (03PS10) 10Majavah: firewall: Declare resources for both providers [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) [12:19:40] (03PS10) 10Majavah: P:wmcs::instance: Convert to firewall wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1211652 (https://phabricator.wikimedia.org/T411089) [12:20:23] PROBLEM - MariaDB Replica SQL: s3 on clouddb1022 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:20:23] PROBLEM - MariaDB Replica IO: s3 on clouddb1022 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:21:23] RECOVERY - MariaDB Replica SQL: s3 on clouddb1022 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:21:23] RECOVERY - MariaDB Replica IO: s3 on clouddb1022 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:21:25] RECOVERY - MariaDB Replica Lag: s3 on clouddb1022 is OK: OK slave_sql_lag Replication lag: 0.09 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:21:28] (03PS6) 10Kgraessle: Enable revert risk filters for first batch of wikis: < 1000 monthly edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247065 (https://phabricator.wikimedia.org/T411485) [12:22:02] (03CR) 10CI reject: [V:04-1] P:wmcs::instance: Convert to firewall wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1211652 (https://phabricator.wikimedia.org/T411089) (owner: 10Majavah) [12:22:18] (03CR) 10CI reject: [V:04-1] Enable revert risk filters for first batch of wikis: < 1000 monthly edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247065 (https://phabricator.wikimedia.org/T411485) (owner: 10Kgraessle) [12:22:33] (03PS11) 10Majavah: firewall: Declare resources for both providers [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) [12:22:33] (03PS11) 10Majavah: P:wmcs::instance: Convert to firewall wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1211652 (https://phabricator.wikimedia.org/T411089) [12:23:17] (03PS1) 10Vgutierrez: trafficserver: Validate .lua.conf files [puppet] - 10https://gerrit.wikimedia.org/r/1260667 (https://phabricator.wikimedia.org/T421203) [12:23:45] FIRING: WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:24:53] (03CR) 10CI reject: [V:04-1] P:wmcs::instance: Convert to firewall wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1211652 (https://phabricator.wikimedia.org/T411089) (owner: 10Majavah) [12:24:58] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [12:25:40] (03CR) 10CI reject: [V:04-1] trafficserver: Validate .lua.conf files [puppet] - 10https://gerrit.wikimedia.org/r/1260667 (https://phabricator.wikimedia.org/T421203) (owner: 10Vgutierrez) [12:25:52] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host wdqs1028.eqiad.wmnet [12:26:18] (03PS7) 10Kgraessle: Enable revert risk filters for first batch of wikis: < 1000 monthly edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247065 (https://phabricator.wikimedia.org/T411485) [12:27:17] (03PS12) 10Majavah: firewall: Declare resources for both providers [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) [12:27:18] (03PS12) 10Majavah: P:wmcs::instance: Convert to firewall wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1211652 (https://phabricator.wikimedia.org/T411089) [12:27:18] (03PS1) 10Majavah: ferm::client: Make signature compatible with firewall::client [puppet] - 10https://gerrit.wikimedia.org/r/1260670 [12:27:22] (03CR) 10Klausman: [C:03+1] ml-services: add shared memory volume to gpt isvc for NCCL communication [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260660 (https://phabricator.wikimedia.org/T421105) (owner: 10Kevin Bazira) [12:27:47] (03CR) 10Kevin Bazira: [C:03+2] ml-services: add shared memory volume to gpt isvc for NCCL communication [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260660 (https://phabricator.wikimedia.org/T421105) (owner: 10Kevin Bazira) [12:29:52] (03Merged) 10jenkins-bot: ml-services: add shared memory volume to gpt isvc for NCCL communication [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260660 (https://phabricator.wikimedia.org/T421105) (owner: 10Kevin Bazira) [12:32:00] (03PS2) 10Vgutierrez: trafficserver: Validate .lua.conf files [puppet] - 10https://gerrit.wikimedia.org/r/1260667 (https://phabricator.wikimedia.org/T421203) [12:32:09] (03CR) 10Majavah: firewall: Declare resources for both providers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) (owner: 10Majavah) [12:32:11] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) (owner: 10Majavah) [12:32:36] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs1028.eqiad.wmnet [12:33:55] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [12:34:07] (03CR) 10CI reject: [V:04-1] trafficserver: Validate .lua.conf files [puppet] - 10https://gerrit.wikimedia.org/r/1260667 (https://phabricator.wikimedia.org/T421203) (owner: 10Vgutierrez) [12:34:25] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-mariadb1002.eqiad.wmnet [12:34:58] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-test-coord1002.eqiad.wmnet [12:37:51] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251193 (https://phabricator.wikimedia.org/T419105) (owner: 10Codename Noreste) [12:37:55] (03PS3) 10Vgutierrez: trafficserver: Validate .lua.conf files [puppet] - 10https://gerrit.wikimedia.org/r/1260667 (https://phabricator.wikimedia.org/T421203) [12:38:03] (03CR) 10Vgutierrez: "CI output on PS2 shows how a syntax error is detected and triggers a CI failure: https://integration.wikimedia.org/ci/job/operations-puppe" [puppet] - 10https://gerrit.wikimedia.org/r/1260667 (https://phabricator.wikimedia.org/T421203) (owner: 10Vgutierrez) [12:38:19] !log mszwarc@deploy2002 mwscript-k8s job started: foreachwikiindblist all demoteIneligibleUsers.php --relay-log checkuser=metawiki --relay-log suppress=metawiki # T418580 [12:38:25] T418580: Deploy 2FA requirement using $wgRestrictedGroups to Wikimedia production, instead of OATHAuth's custom config - https://phabricator.wikimedia.org/T418580 [12:38:27] jouncebot: nowandnext [12:38:27] No deployments scheduled for the next 0 hour(s) and 21 minute(s) [12:38:27] In 0 hour(s) and 21 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260325T1300) [12:38:37] That window is supposed to be cancelled [12:39:06] The switchover got removed again [12:39:13] (03CR) 10Vgutierrez: haproxy: canary CIDERGRINDER 🍎🐤 (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1260076 (owner: 10CDanis) [12:39:40] (03PS5) 10Btullis: Route dse-k8s API blackbox checks to team-data-platform [puppet] - 10https://gerrit.wikimedia.org/r/1256287 (https://phabricator.wikimedia.org/T420264) [12:39:57] Ah no it's an hour before we lock, ok [12:40:36] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-coord1002.eqiad.wmnet [12:40:39] (03PS8) 10Kgraessle: Enable revert risk filters for first batch of wikis: < 1000 monthly edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247065 (https://phabricator.wikimedia.org/T411485) [12:40:51] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251200 (https://phabricator.wikimedia.org/T419312) (owner: 10Gerrit Patch Uploader) [12:40:51] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-mariadb1002.eqiad.wmnet [12:41:08] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256748 (https://phabricator.wikimedia.org/T420704) (owner: 10Codename Noreste) [12:41:13] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1256287 (https://phabricator.wikimedia.org/T420264) (owner: 10Btullis) [12:41:34] Lucas_WMDE: urbanecm TheresNoTime just a head's up since you're marked as deployers for the UTC backport window, please ping either me or bjensen if you see it's gonna go a little long. We'll be locking deployments starting at 1400UTC, we can wait a few minutes if you're not done, but not much more. [12:41:38] !log marostegui@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1022.eqiad.wmnet,service=s3 [12:42:47] And another three patches just got added... We'll be thinking about canceling it next time, because I thought we did. [12:43:05] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1022.eqiad.wmnet with reason: Downgrade clouddb1022 to 10.11.15 [12:45:02] (03CR) 10Clément Goubert: [C:03+2] trafficserver: 100% of linkrecommendation to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259071 (https://phabricator.wikimedia.org/T418148) (owner: 10Clément Goubert) [12:45:23] (03PS9) 10Clément Goubert: trafficserver: 100% of linkrecommendation to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259071 (https://phabricator.wikimedia.org/T418148) [12:46:12] Lucas_WMDE and TheresNoTime [12:46:25] I have three Gerrit patches to deploy [12:46:53] (03PS1) 10Majavah: nftables::client: Improve src/dst filter handling [puppet] - 10https://gerrit.wikimedia.org/r/1260676 (https://phabricator.wikimedia.org/T411102) [12:47:07] Do they all need to be deployed (interdependency)? The backport window has a hard stop around 1400UTC because we'll be locking deployments to prepare for the DC switchover. [12:47:09] pcc latest auto [12:47:13] .. that's the wrong window [12:47:23] (03CR) 10CI reject: [V:04-1] nftables::client: Improve src/dst filter handling [puppet] - 10https://gerrit.wikimedia.org/r/1260676 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah) [12:47:27] bah [12:47:39] taavi: zsh: command not found: bah [12:48:10] (03PS2) 10Majavah: nftables::client: Improve src/dst filter handling [puppet] - 10https://gerrit.wikimedia.org/r/1260676 (https://phabricator.wikimedia.org/T411102) [12:48:27] (03CR) 10Clément Goubert: [C:03+2] trafficserver: 100% of linkrecommendation to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259071 (https://phabricator.wikimedia.org/T418148) (owner: 10Clément Goubert) [12:49:00] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8336/console" [puppet] - 10https://gerrit.wikimedia.org/r/1260676 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah) [12:51:23] so I'll have to reschedule them for today's late backport window [12:51:27] right? [12:51:50] !log marostegui@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1022.eqiad.wmnet,service=s3 [12:52:26] codenamenoreste: They may fit, but I kind of need to know if some can be moved or not in case we're short on time [12:54:17] it's almost 8 AM where I live (almost 13:00 UTC) [12:54:44] I am ready to follow instructions from any backport member to test and deploy any patches [12:55:15] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [12:56:00] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [12:57:27] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2003.codfw.wmnet, ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:57:49] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2001.codfw.wmnet, ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260325T1300). [13:00:04] dcausse, awight, and codenamenoreste: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:16] o/ [13:00:24] Here [13:00:47] (03PS5) 10CDanis: haproxy: canary CIDERGRINDER 🍎🐤 [puppet] - 10https://gerrit.wikimedia.org/r/1260076 [13:00:52] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1260076 (owner: 10CDanis) [13:00:54] (03CR) 10CDanis: haproxy: canary CIDERGRINDER 🍎🐤 (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1260076 (owner: 10CDanis) [13:01:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:02:03] !log Inter.Link - DDoS - Activation of automatic reroute [13:02:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:23] dcausse: I could deploy your patch, unless you're already in there? [13:03:17] awight: haven't started anything yet, but mine did not go well last time so if shipped with others might be annoying [13:03:29] Reposting since it may have gotten buried in botspam: please ping either me or bjensen if you see the window's s gonna go a little long. We'll be locking deployments starting at 1400UTC, we can wait a few minutes if you're not done, but not much more. [13:04:00] awight: in short I'm fine going last if there's some time left [13:04:06] same [13:04:12] <_< me too :-) [13:04:23] I would say, go ahead dcausse [13:04:27] ok [13:04:34] (03PS1) 10Ayounsi: Inter.link remove export BGP community [homer/public] - 10https://gerrit.wikimedia.org/r/1260681 [13:05:14] (03CR) 10Cathal Mooney: [C:03+1] "LGTM, makes sense not to remove to community def in case we ever want to use it." [homer/public] - 10https://gerrit.wikimedia.org/r/1260681 (owner: 10Ayounsi) [13:05:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260045 (owner: 10DCausse) [13:06:16] (03Merged) 10jenkins-bot: Revert^2 "search: use the discovery ns record for the semanticsearch cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260045 (owner: 10DCausse) [13:06:36] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:06:49] !log dcausse@deploy2002 Started scap sync-world: Backport for [[gerrit:1260045|Revert^2 "search: use the discovery ns record for the semanticsearch cluster"]] [13:07:23] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:08:41] awight I might deploy three patches but if it's not possible, I can deploy only one if that's okay [13:09:02] codenamenoreste8: What if I bundle them together with mine, which is also a config change? [13:09:05] !log dcausse@deploy2002 dcausse: Backport for [[gerrit:1260045|Revert^2 "search: use the discovery ns record for the semanticsearch cluster"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:09:11] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:09:51] awight it's an idwiki user group change patch, and the last two are for ptwiki: enabling the block action for the abuse filter, and adding a user right for two user groups [13:09:58] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:10:04] codenamenoreste8: ack, sounds mostly safe! [13:10:24] safe? [13:10:55] meaning, known quantities that I don't expect to explode in spectacular new ways :-) [13:11:08] so they should be fine to deploy all at once IMHO [13:11:32] Sure, go ahead and deploy all three ;) [13:11:38] we would just deploy all the config changes, then review on the debug server and roll back if anything look wrong. [13:11:46] ok! [13:12:25] when I first made the idwiki patch, I was very new to using php and Gerrit, and had to make some fixes - we all make mistakes as newbies [13:12:35] ok seems to work this time, shipping [13:12:47] !log dcausse@deploy2002 dcausse: Continuing with sync [13:17:10] !log dcausse@deploy2002 Finished scap sync-world: Backport for [[gerrit:1260045|Revert^2 "search: use the discovery ns record for the semanticsearch cluster"]] (duration: 10m 20s) [13:17:40] awight: I'm done [13:17:48] tt! [13:17:50] ty [13:17:51] (03PS2) 10Ayounsi: Inter.link remove export BGP community [homer/public] - 10https://gerrit.wikimedia.org/r/1260681 [13:18:51] claime: Seems likely that we'll finish backports within the normal window. [13:18:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by awight@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260614 (https://phabricator.wikimedia.org/T421055) (owner: 10Awight) [13:18:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by awight@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251193 (https://phabricator.wikimedia.org/T419105) (owner: 10Codename Noreste) [13:19:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by awight@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251200 (https://phabricator.wikimedia.org/T419312) (owner: 10Gerrit Patch Uploader) [13:19:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by awight@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256748 (https://phabricator.wikimedia.org/T420704) (owner: 10Codename Noreste) [13:19:08] (03CR) 10CI reject: [V:04-1] Inter.link remove export BGP community [homer/public] - 10https://gerrit.wikimedia.org/r/1260681 (owner: 10Ayounsi) [13:19:08] awight: awesome thanks [13:19:52] (03PS1) 10Elukey: admin_ng: move dse-k8s' cfssl-issuer to pki1002 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260685 (https://phabricator.wikimedia.org/T416664) [13:20:08] (03Merged) 10jenkins-bot: [beta] Kill synthetic refs with feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260614 (https://phabricator.wikimedia.org/T421055) (owner: 10Awight) [13:20:11] (03Merged) 10jenkins-bot: idwiki: Remove unused user groups on Indonesian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251193 (https://phabricator.wikimedia.org/T419105) (owner: 10Codename Noreste) [13:20:14] (03Merged) 10jenkins-bot: ptwiki: Enable block action for the abuse filter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251200 (https://phabricator.wikimedia.org/T419312) (owner: 10Gerrit Patch Uploader) [13:20:18] (03Merged) 10jenkins-bot: ptwiki: Add suppressredirect to autoreviewer and rollbacker user groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256748 (https://phabricator.wikimedia.org/T420704) (owner: 10Codename Noreste) [13:20:51] !log awight@deploy2002 Started scap sync-world: Backport for [[gerrit:1260614|[beta] Kill synthetic refs with feature flag (T421055)]], [[gerrit:1251193|idwiki: Remove unused user groups on Indonesian Wikipedia (T419105)]], [[gerrit:1251200|ptwiki: Enable block action for the abuse filter (T419312)]], [[gerrit:1256748|ptwiki: Add suppressredirect to autoreviewer and rollbacker user groups (T420704)]] [13:21:02] T421055: Disable synthetic list-defined ref logic on the beta cluster - https://phabricator.wikimedia.org/T421055 [13:21:03] T419105: Remove old user groups on Indonesian Wikipedia - https://phabricator.wikimedia.org/T419105 [13:21:03] T419312: Addition of AbuseFilter blocking for the Portuguese Wikipedia - https://phabricator.wikimedia.org/T419312 [13:21:03] T420704: PTWIKI: Give autoreviewers and rollbackers the supressredirect permission - https://phabricator.wikimedia.org/T420704 [13:21:56] (03PS3) 10Ayounsi: Inter.link remove export BGP community [homer/public] - 10https://gerrit.wikimedia.org/r/1260681 [13:23:05] !log awight@deploy2002 codenamenoreste, awight, gerrit-patch-uploader: Backport for [[gerrit:1260614|[beta] Kill synthetic refs with feature flag (T421055)]], [[gerrit:1251193|idwiki: Remove unused user groups on Indonesian Wikipedia (T419105)]], [[gerrit:1251200|ptwiki: Enable block action for the abuse filter (T419312)]], [[gerrit:1256748|ptwiki: Add suppressredirect to autoreviewer and rollbacker user groups (T420704)] [13:23:05] ] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:23:22] codenamenoreste: ^ yes what that says. ready to test [13:23:29] (03PS4) 10Ayounsi: Inter.link remove export BGP community [homer/public] - 10https://gerrit.wikimedia.org/r/1260681 [13:23:41] awight I have WikimediaDebug [13:24:03] (03PS1) 10Hashar: puppetserver: emit info before deploying code [puppet] - 10https://gerrit.wikimedia.org/r/1260686 [13:24:50] .46 [13:25:23] codenamenoreste: great. If you're unsure whether the responses are coming from a debug server, look in the response headers for a line like: server mw-debug.codfw.pinkunicorn-6dc8b46479-2jm6q [13:26:38] hi awight etc! coudl you ping me when the backport window is done? I have a small config patch to deploy [13:26:42] idwiki and ptwiki seem to still work [13:26:46] ottomata: sure [13:26:50] ty [13:27:29] awight: both idwiki and ptwiki patches have worked! [13:27:35] !log awight@deploy2002 codenamenoreste, awight, gerrit-patch-uploader: Continuing with sync [13:27:37] great [13:27:55] (03CR) 10Cathal Mooney: [C:03+1] Inter.link remove export BGP community [homer/public] - 10https://gerrit.wikimedia.org/r/1260681 (owner: 10Ayounsi) [13:28:06] I checked with WikimediaDebug and they are available - get ready [13:28:20] (03CR) 10Ayounsi: [C:03+2] Inter.link remove export BGP community [homer/public] - 10https://gerrit.wikimedia.org/r/1260681 (owner: 10Ayounsi) [13:28:49] (03PS3) 10Hashar: ci: use docker.io package starting with Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1260659 (https://phabricator.wikimedia.org/T418109) [13:28:49] (03CR) 10Hashar: "PCC output:" [puppet] - 10https://gerrit.wikimedia.org/r/1260659 (https://phabricator.wikimedia.org/T418109) (owner: 10Hashar) [13:29:03] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:29:20] (03CR) 10Elukey: "Hey folks! I know that you are probably wondering "really? Before the upgrade?". But for me it would be a great testbed since you'll have " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260685 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey) [13:29:21] (03CR) 10MVernon: [C:04-1] "Probably worth linking to both bugs, to be honest." [puppet] - 10https://gerrit.wikimedia.org/r/1253662 (https://phabricator.wikimedia.org/T419663) (owner: 10Neriah) [13:29:41] (03Merged) 10jenkins-bot: Inter.link remove export BGP community [homer/public] - 10https://gerrit.wikimedia.org/r/1260681 (owner: 10Ayounsi) [13:29:56] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:32:24] !log awight@deploy2002 Finished scap sync-world: Backport for [[gerrit:1260614|[beta] Kill synthetic refs with feature flag (T421055)]], [[gerrit:1251193|idwiki: Remove unused user groups on Indonesian Wikipedia (T419105)]], [[gerrit:1251200|ptwiki: Enable block action for the abuse filter (T419312)]], [[gerrit:1256748|ptwiki: Add suppressredirect to autoreviewer and rollbacker user groups (T420704)]] (duration: 11m 33s) [13:32:35] T421055: Disable synthetic list-defined ref logic on the beta cluster - https://phabricator.wikimedia.org/T421055 [13:32:35] T419105: Remove old user groups on Indonesian Wikipedia - https://phabricator.wikimedia.org/T419105 [13:32:36] T419312: Addition of AbuseFilter blocking for the Portuguese Wikipedia - https://phabricator.wikimedia.org/T419312 [13:32:36] T420704: PTWIKI: Give autoreviewers and rollbackers the supressredirect permission - https://phabricator.wikimedia.org/T420704 [13:33:27] (03PS1) 10Clément Goubert: rest-gateway: Add wikifeeds_catchall from api.w.o [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260688 (https://phabricator.wikimedia.org/T421233) [13:33:37] (03CR) 10Slyngshede: [C:03+1] "Clever" [puppet] - 10https://gerrit.wikimedia.org/r/1260667 (https://phabricator.wikimedia.org/T421203) (owner: 10Vgutierrez) [13:34:46] (03PS1) 10Clément Goubert: trafficserver: 100% of /feed/v1 to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1260690 (https://phabricator.wikimedia.org/T421233) [13:35:38] (03PS1) 10Kevin Bazira: ml-services: use containers instead of container in gpt isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260691 (https://phabricator.wikimedia.org/T421105) [13:38:08] ottomata: all yours! [13:38:10] thank you! [13:39:12] (03CR) 10Ottomata: stream: mw-page-html-content-change-enrich (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260637 (https://phabricator.wikimedia.org/T360794) (owner: 10JavierMonton) [13:40:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by otto@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260091 (https://phabricator.wikimedia.org/T360794) (owner: 10Ottomata) [13:40:59] (03CR) 10Jforrester: dumpInterwiki: Re-generate to add Abstract Wikipedia (and others) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259993 (https://phabricator.wikimedia.org/T420654) (owner: 10Jforrester) [13:40:59] (03CR) 10Fabfur: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1260667 (https://phabricator.wikimedia.org/T421203) (owner: 10Vgutierrez) [13:41:28] (03Merged) 10jenkins-bot: EventStreamConfig - Increase spark_job_ingestion_scale for larger event streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260091 (https://phabricator.wikimedia.org/T360794) (owner: 10Ottomata) [13:42:02] !log otto@deploy2002 Started scap sync-world: Backport for [[gerrit:1260091|EventStreamConfig - Increase spark_job_ingestion_scale for larger event streams (T360794 T351225)]] [13:42:08] (03CR) 10Klausman: [C:03+1] ml-services: use containers instead of container in gpt isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260691 (https://phabricator.wikimedia.org/T421105) (owner: 10Kevin Bazira) [13:42:09] T360794: Event stream with latest revision HTML & parent revision HTML diff - https://phabricator.wikimedia.org/T360794 [13:42:10] T351225: Productionized Edit Types - https://phabricator.wikimedia.org/T351225 [13:42:33] (03CR) 10Kevin Bazira: [C:03+2] ml-services: use containers instead of container in gpt isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260691 (https://phabricator.wikimedia.org/T421105) (owner: 10Kevin Bazira) [13:44:21] !log otto@deploy2002 otto: Backport for [[gerrit:1260091|EventStreamConfig - Increase spark_job_ingestion_scale for larger event streams (T360794 T351225)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:44:31] (03Merged) 10jenkins-bot: ml-services: use containers instead of container in gpt isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260691 (https://phabricator.wikimedia.org/T421105) (owner: 10Kevin Bazira) [13:45:21] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [13:45:26] !log otto@deploy2002 otto: Continuing with sync [13:45:53] (03CR) 10Vgutierrez: [C:03+2] trafficserver: Validate .lua.conf files [puppet] - 10https://gerrit.wikimedia.org/r/1260667 (https://phabricator.wikimedia.org/T421203) (owner: 10Vgutierrez) [13:46:12] (03PS6) 10CDanis: haproxy: canary CIDERGRINDER 🍎🐤 [puppet] - 10https://gerrit.wikimedia.org/r/1260076 [13:46:12] (03PS1) 10CDanis: aptrepo: add cidergrinder [puppet] - 10https://gerrit.wikimedia.org/r/1260693 [13:46:51] (03PS2) 10CDanis: aptrepo: add cidergrinder [puppet] - 10https://gerrit.wikimedia.org/r/1260693 [13:46:51] (03PS7) 10CDanis: haproxy: canary CIDERGRINDER 🍎🐤 [puppet] - 10https://gerrit.wikimedia.org/r/1260076 [13:46:59] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1260693 (owner: 10CDanis) [13:49:49] !log otto@deploy2002 Finished scap sync-world: Backport for [[gerrit:1260091|EventStreamConfig - Increase spark_job_ingestion_scale for larger event streams (T360794 T351225)]] (duration: 07m 48s) [13:49:57] T360794: Event stream with latest revision HTML & parent revision HTML diff - https://phabricator.wikimedia.org/T360794 [13:49:58] T351225: Productionized Edit Types - https://phabricator.wikimedia.org/T351225 [13:50:16] (03PS2) 10Clément Goubert: rest-gateway: Add wikifeeds from api.w.o [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260688 (https://phabricator.wikimedia.org/T421233) [13:51:17] (03PS3) 10CDanis: aptrepo: add cidergrinder [puppet] - 10https://gerrit.wikimedia.org/r/1260693 [13:51:17] (03PS8) 10CDanis: haproxy: canary CIDERGRINDER 🍎🐤 [puppet] - 10https://gerrit.wikimedia.org/r/1260076 [13:51:21] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1260693 (owner: 10CDanis) [13:52:39] (03CR) 10Vgutierrez: haproxy: canary CIDERGRINDER 🍎🐤 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1260076 (owner: 10CDanis) [13:52:58] hey folks, scap will be locked for the switchover today at 14:00 UTC, and the read only part of the switchover is targeted for 15:00 UTC (T413974) [13:52:59] T413974: Northward Datacenter Switchover (March 2026; codfw to eqiad) - https://phabricator.wikimedia.org/T413974 [13:54:17] (03PS2) 10Neriah: upload: Return 400 instead of 429 for non-standard thumbnail sizes [puppet] - 10https://gerrit.wikimedia.org/r/1253662 (https://phabricator.wikimedia.org/T414805) [13:54:27] (03PS9) 10CDanis: haproxy: canary CIDERGRINDER 🍎🐤 [puppet] - 10https://gerrit.wikimedia.org/r/1260076 [13:54:33] (03CR) 10CDanis: haproxy: canary CIDERGRINDER 🍎🐤 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1260076 (owner: 10CDanis) [13:55:03] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11749076 (10ayounsi) As a side note we will need to manually change the IPs of the routed ganeti nodes in rack 23 to the 10.128.1.0/24 subnet. Normal operation... [13:57:36] (03PS1) 10Majavah: nftables: Fix handling of undefs [puppet] - 10https://gerrit.wikimedia.org/r/1260694 [13:57:41] (03CR) 10CDanis: "> The deb package needs to be uploaded to the apt repo (right now it's just on staging I think)." [puppet] - 10https://gerrit.wikimedia.org/r/1260076 (owner: 10CDanis) [13:59:36] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8337/console" [puppet] - 10https://gerrit.wikimedia.org/r/1260694 (owner: 10Majavah) [14:00:04] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260325T1400) [14:00:10] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2026-03-16-124858 to 2026-03-25-132409 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260697 (https://phabricator.wikimedia.org/T418150) [14:00:17] !log blake@cumin1003 START - Cookbook sre.switchdc.mediawiki.00-lock-scap for datacenter switchover from codfw to eqiad [14:00:18] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2026-03-23-124102 to 2026-03-25-132654 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260698 (https://phabricator.wikimedia.org/T420039) [14:00:20] !log root@deploy2002 Locking from deployment [ALL REPOSITORIES]: Datacenter switchover from codfw to eqiad - [14:00:20] !log blake@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.00-lock-scap (exit_code=0) for datacenter switchover from codfw to eqiad [14:01:57] (03PS1) 10Ayounsi: ospf: remove old noop config [homer/public] - 10https://gerrit.wikimedia.org/r/1260699 [14:02:17] (03CR) 10Genoveva Galarza: [C:03+2] wikifunctions: Upgrade evaluators from 2026-03-16-124858 to 2026-03-25-132409 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260697 (https://phabricator.wikimedia.org/T418150) (owner: 10Jforrester) [14:04:37] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2026-03-16-124858 to 2026-03-25-132409 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260697 (https://phabricator.wikimedia.org/T418150) (owner: 10Jforrester) [14:05:22] !log gengh@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:06:03] !log gengh@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:06:05] (03CR) 10Vgutierrez: [C:03+1] haproxy: canary CIDERGRINDER 🍎🐤 [puppet] - 10https://gerrit.wikimedia.org/r/1260076 (owner: 10CDanis) [14:06:25] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:06:34] !log gengh@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:07:13] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:07:14] !log gengh@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:07:20] !log gengh@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:08:10] !log gengh@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:09:03] (03CR) 10Genoveva Galarza: [C:03+2] wikifunctions: Upgrade orchestrator from 2026-03-23-124102 to 2026-03-25-132654 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260698 (https://phabricator.wikimedia.org/T420039) (owner: 10Jforrester) [14:09:04] (03PS1) 10Esanders: Add wss://*.toolforge.org to CSP [puppet] - 10https://gerrit.wikimedia.org/r/1260700 (https://phabricator.wikimedia.org/T420631) [14:10:06] (03CR) 10Cathal Mooney: [C:03+1] "Nice to see this go :)" [homer/public] - 10https://gerrit.wikimedia.org/r/1260699 (owner: 10Ayounsi) [14:10:11] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:10:13] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:10:34] (03CR) 10JHathaway: [C:03+1] firewall: Declare resources for both providers [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) (owner: 10Majavah) [14:10:57] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2026-03-23-124102 to 2026-03-25-132654 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260698 (https://phabricator.wikimedia.org/T420039) (owner: 10Jforrester) [14:11:20] !log gengh@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:11:22] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:11:24] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:11:40] !log gengh@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:12:11] (03CR) 10Ayounsi: [C:03+2] ospf: remove old noop config [homer/public] - 10https://gerrit.wikimedia.org/r/1260699 (owner: 10Ayounsi) [14:12:54] !log gengh@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:13:18] !log gengh@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:13:25] RECOVERY - MariaDB Replica Lag: s3 on clouddb1023 is OK: OK slave_sql_lag Replication lag: 0.14 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:13:38] !log gengh@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:13:50] (03Merged) 10jenkins-bot: ospf: remove old noop config [homer/public] - 10https://gerrit.wikimedia.org/r/1260699 (owner: 10Ayounsi) [14:14:11] !log gengh@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:15:25] (03CR) 10Fabfur: [C:03+1] aptrepo: add cidergrinder [puppet] - 10https://gerrit.wikimedia.org/r/1260693 (owner: 10CDanis) [14:17:18] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['phab2002'] [14:17:35] (03PS4) 10CDanis: aptrepo: add cidergrinder [puppet] - 10https://gerrit.wikimedia.org/r/1260693 [14:17:35] (03PS10) 10CDanis: haproxy: canary CIDERGRINDER 🍎🐤 [puppet] - 10https://gerrit.wikimedia.org/r/1260076 [14:17:46] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1260693 (owner: 10CDanis) [14:17:52] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1260076 (owner: 10CDanis) [14:18:26] RESOLVED: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:20:38] (03CR) 10Majavah: aptrepo: add cidergrinder (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1260693 (owner: 10CDanis) [14:21:42] (03CR) 10CDanis: aptrepo: add cidergrinder (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1260693 (owner: 10CDanis) [14:22:43] (03CR) 10CDanis: [C:03+2] aptrepo: add cidergrinder [puppet] - 10https://gerrit.wikimedia.org/r/1260693 (owner: 10CDanis) [14:23:17] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for katiamusiolek - https://phabricator.wikimedia.org/T420459#11749168 (10Scott_French) [14:23:46] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:24:24] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:24:25] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to data and Superset for Daria-WMDE (Daria Ammalainen (WMDE)) - https://phabricator.wikimedia.org/T420716#11749170 (10Scott_French) [14:24:41] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['phab2002'] [14:25:29] (03PS1) 10Jforrester: check-wf-services: Add a test case from T420021 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260705 (https://phabricator.wikimedia.org/T418887) [14:26:17] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to data and Superset for Daria-WMDE (Daria Ammalainen (WMDE)) - https://phabricator.wikimedia.org/T420716#11749181 (10Scott_French) Great, thanks @Daria-WMDE! Once @KFrancis confirms everything is all set, I believe that should be everythin... [14:26:58] (03PS2) 10JHathaway: nftables: Fix handling of undefs [puppet] - 10https://gerrit.wikimedia.org/r/1260694 (owner: 10Majavah) [14:27:05] (03CR) 10Kamila Součková: "> First, you might have this in another patch I've not seen yet, but just in case: I think you're going to need to add TCP 4081 to mediawi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256384 (https://phabricator.wikimedia.org/T419049) (owner: 10Kamila Součková) [14:27:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256384 (https://phabricator.wikimedia.org/T419049) (owner: 10Kamila Součková) [14:28:21] (03CR) 10Cory Massaro: check-wf-services: Add a test case from T420021 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260705 (https://phabricator.wikimedia.org/T418887) (owner: 10Jforrester) [14:28:29] !log 💙cdanis@apt1002.wikimedia.org ~ 🕥☕ sudo -i reprepro --component main --restrict cidergrinder update bullseye-wikimedia [14:28:33] !log 💙cdanis@apt1002.wikimedia.org ~ 🕥☕ sudo -i reprepro --component main --restrict cidergrinder update trixie-wikimedia [14:28:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:40] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to superset for alice.moutinho - https://phabricator.wikimedia.org/T420751#11749190 (10Scott_French) @Alice.moutinho - Did you receive a new NDA link or are you still awaiting that? [14:28:44] (03PS2) 10Jforrester: check-wf-services: Add a test case from T420021 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260705 (https://phabricator.wikimedia.org/T418887) [14:29:13] (03CR) 10Herron: [C:03+1] Add missing Prometheus and Grafana configs for k8s-aux [puppet] - 10https://gerrit.wikimedia.org/r/1260628 (https://phabricator.wikimedia.org/T358189) (owner: 10Elukey) [14:30:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260325T1400) [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260325T1430) [14:30:36] (03CR) 10CDanis: [C:03+2] haproxy: canary CIDERGRINDER 🍎🐤 [puppet] - 10https://gerrit.wikimedia.org/r/1260076 (owner: 10CDanis) [14:30:39] (03CR) 10Elukey: [C:03+2] Add missing Prometheus and Grafana configs for k8s-aux [puppet] - 10https://gerrit.wikimedia.org/r/1260628 (https://phabricator.wikimedia.org/T358189) (owner: 10Elukey) [14:31:20] elukey: ok to merge yours? [14:31:24] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to superset for alice.moutinho - https://phabricator.wikimedia.org/T420751#11749201 (10Alice.moutinho) Hi @Scott_French , i did, and signed, this monday! [14:31:36] cdanis: yep! [14:31:48] 🚀 [14:31:58] (03CR) 10Jforrester: check-wf-services: Add a test case from T420021 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260705 (https://phabricator.wikimedia.org/T418887) (owner: 10Jforrester) [14:33:27] 06SRE, 06ServiceOps new, 07Datacenter-Switchover: Increased rate of badtoken errors / session store issues due to datacenter switchover? - https://phabricator.wikimedia.org/T421168#11749211 (10MLechvien-WMF) [14:33:52] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8338/console" [puppet] - 10https://gerrit.wikimedia.org/r/1260694 (owner: 10Majavah) [14:33:58] (03CR) 10Majavah: [V:03+1 C:03+2] nftables: Fix handling of undefs [puppet] - 10https://gerrit.wikimedia.org/r/1260694 (owner: 10Majavah) [14:37:14] (03CR) 10JHathaway: [C:03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1260670 (owner: 10Majavah) [14:37:41] (03PS1) 10CDanis: cidergrinder: mmdb lua: typo fix [puppet] - 10https://gerrit.wikimedia.org/r/1260711 [14:38:54] (03CR) 10Majavah: [C:03+2] ferm::client: Make signature compatible with firewall::client [puppet] - 10https://gerrit.wikimedia.org/r/1260670 (owner: 10Majavah) [14:39:35] (03CR) 10JMeybohm: rest-gateway: Add wikifeeds from api.w.o (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260688 (https://phabricator.wikimedia.org/T421233) (owner: 10Clément Goubert) [14:39:42] (03CR) 10CDanis: [C:03+2] cidergrinder: mmdb lua: typo fix [puppet] - 10https://gerrit.wikimedia.org/r/1260711 (owner: 10CDanis) [14:40:11] (03PS3) 10Clément Goubert: rest-gateway: Add wikifeeds from api.w.o [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260688 (https://phabricator.wikimedia.org/T421233) [14:40:35] (03CR) 10Clément Goubert: rest-gateway: Add wikifeeds from api.w.o (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260688 (https://phabricator.wikimedia.org/T421233) (owner: 10Clément Goubert) [14:40:38] (03PS5) 10Tiziano Fogli: thanos/compact: add support for instance-based partitioning [puppet] - 10https://gerrit.wikimedia.org/r/1260650 (https://phabricator.wikimedia.org/T386911) [14:40:38] (03PS5) 10Tiziano Fogli: pontoon: override promethues_instances designated_compactor [puppet] - 10https://gerrit.wikimedia.org/r/1260651 (https://phabricator.wikimedia.org/T386911) [14:41:39] (03CR) 10JHathaway: puppetserver: emit info before deploying code (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1260686 (owner: 10Hashar) [14:43:05] (03CR) 10Jforrester: [C:03+2] check-wf-services: Add a test case from T420021 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260705 (https://phabricator.wikimedia.org/T418887) (owner: 10Jforrester) [14:43:11] (03CR) 10Cory Massaro: [C:03+2] check-wf-services: Add a test case from T420021 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260705 (https://phabricator.wikimedia.org/T418887) (owner: 10Jforrester) [14:43:56] (03PS1) 10Ayounsi: Add Nokia POPs BGP policies [homer/public] - 10https://gerrit.wikimedia.org/r/1260715 (https://phabricator.wikimedia.org/T408892) [14:45:13] (03Merged) 10jenkins-bot: check-wf-services: Add a test case from T420021 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260705 (https://phabricator.wikimedia.org/T418887) (owner: 10Jforrester) [14:46:18] !log blake@cumin1003 START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl for datacenter switchover from codfw to eqiad [14:46:44] (03PS13) 10Majavah: firewall: Declare resources for both providers [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) [14:46:44] (03PS13) 10Majavah: P:wmcs::instance: Convert to firewall wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1211652 (https://phabricator.wikimedia.org/T411089) [14:46:44] (03PS1) 10Majavah: cephadm::target: Convert port to an integer [puppet] - 10https://gerrit.wikimedia.org/r/1260716 [14:46:45] (03PS1) 10Majavah: P:mariadb::ferm: Fix typing around port variable [puppet] - 10https://gerrit.wikimedia.org/r/1260717 [14:46:47] (03PS1) 10Majavah: P:zuul::zuul_web: Make srange an array [puppet] - 10https://gerrit.wikimedia.org/r/1260718 [14:46:51] (03PS1) 10Majavah: P:opensearch::cirrus::test: Convert port to an integer [puppet] - 10https://gerrit.wikimedia.org/r/1260719 [14:46:55] (03PS1) 10Majavah: P:openstack: pdns::auth: Convert port to integer [puppet] - 10https://gerrit.wikimedia.org/r/1260720 [14:46:59] (03PS1) 10Majavah: nftables: Fix issues around virtual resource dependencies [puppet] - 10https://gerrit.wikimedia.org/r/1260721 [14:47:07] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) (owner: 10Majavah) [14:47:33] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q3:rack/setup/install kafka-logging200[6-8] - https://phabricator.wikimedia.org/T418931#11749284 (10Jhancock.wm) @herron we've received these and will try to get them installed by EoW. please update site.pp to include the newest nodes. ty! [14:47:54] We're starting with the prep steps leading to read-only [14:50:11] (03CR) 10Elukey: "Great work! I've replied in the task, we should reach an agreement about vulture/pyroma/prospector, after that I think we are ready to go." [cookbooks] - 10https://gerrit.wikimedia.org/r/1240635 (owner: 10Federico Ceratto) [14:51:28] (03PS1) 10JMeybohm: wikikube: Switch to IPIP mode on workers [puppet] - 10https://gerrit.wikimedia.org/r/1260723 (https://phabricator.wikimedia.org/T420436) [14:51:57] !log blake@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=0) for datacenter switchover from codfw to eqiad [14:52:12] !log blake@cumin1003 START - Cookbook sre.switchdc.mediawiki.00-downtime-db-readonly-checks for datacenter switchover from codfw to eqiad [14:52:14] (03CR) 10Marostegui: [C:03+1] wmnet: update CNAME records for DB masters for dc switchover [dns] - 10https://gerrit.wikimedia.org/r/1255669 (https://phabricator.wikimedia.org/T416705) (owner: 10Gerrit maintenance bot) [14:52:29] !log blake@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.00-downtime-db-readonly-checks (exit_code=0) for datacenter switchover from codfw to eqiad [14:53:18] !log blake@cumin1003 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance for datacenter switchover from codfw to eqiad [14:53:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:55:19] (03PS1) 10CDanis: cidergrinder: mmdb lua: more late-breaking fixes [puppet] - 10https://gerrit.wikimedia.org/r/1260725 [14:55:37] 10ops-codfw, 06collaboration-services, 06DC-Ops, 10Phabricator: phab2002: SEL System Event:, System Board Front LED Panel, Critical, management controller unavailable - https://phabricator.wikimedia.org/T420228#11749332 (10Jhancock.wm) @Dzahn looks like that worked. it rebooted after the update and there... [14:55:42] (03CR) 10CDanis: [C:03+2] cidergrinder: mmdb lua: more late-breaking fixes [puppet] - 10https://gerrit.wikimedia.org/r/1260725 (owner: 10CDanis) [14:55:56] !log blake@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=0) for datacenter switchover from codfw to eqiad [14:56:11] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1260723 (https://phabricator.wikimedia.org/T420436) (owner: 10JMeybohm) [14:56:27] (03PS1) 10Jforrester: check-wf-services: Disable new test case for now [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260726 [14:56:31] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to superset for alice.moutinho - https://phabricator.wikimedia.org/T420751#11749333 (10Scott_French) Great, thank you! @KFrancis - If you could confirm when the NDA is accepted / complete, that would be greatly appreciated. [14:56:36] (03CR) 10Jforrester: [C:03+2] check-wf-services: Disable new test case for now [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260726 (owner: 10Jforrester) [14:57:32] (03CR) 10CI reject: [V:04-1] cidergrinder: mmdb lua: more late-breaking fixes [puppet] - 10https://gerrit.wikimedia.org/r/1260725 (owner: 10CDanis) [14:57:53] (03PS2) 10CDanis: cidergrinder: mmdb lua: more late-breaking fixes [puppet] - 10https://gerrit.wikimedia.org/r/1260725 [14:58:53] (03Merged) 10jenkins-bot: check-wf-services: Disable new test case for now [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260726 (owner: 10Jforrester) [14:59:08] (03CR) 10Cathal Mooney: [C:03+1] "LGTM! Validated all the prefixes against the current JunOS ones they all match." [homer/public] - 10https://gerrit.wikimedia.org/r/1260715 (https://phabricator.wikimedia.org/T408892) (owner: 10Ayounsi) [14:59:32] (03CR) 10JMeybohm: [C:03+1] rest-gateway: Add wikifeeds from api.w.o [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260688 (https://phabricator.wikimedia.org/T421233) (owner: 10Clément Goubert) [14:59:38] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1260727 [15:00:04] Deploy window Datacenter Switchover (T413974) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260325T1500) [15:00:05] T413974: Northward Datacenter Switchover (March 2026; codfw to eqiad) - https://phabricator.wikimedia.org/T413974 [15:00:10] (03CR) 10CDanis: [C:03+2] cidergrinder: mmdb lua: more late-breaking fixes [puppet] - 10https://gerrit.wikimedia.org/r/1260725 (owner: 10CDanis) [15:00:11] good to go for the switchover? [15:00:14] gogogogo [15:00:15] go [15:00:15] ep [15:00:15] go :) [15:00:16] yep [15:00:21] <_joe_> go [15:00:24] !log blake@cumin1003 START - Cookbook sre.switchdc.mediawiki.02-set-readonly for datacenter switchover from codfw to eqiad [15:00:24] !log blake@cumin1003 MediaWiki read-only period starts at: 2026-03-25 15:00:24.089398 [15:00:25] <3 [15:00:26] \i/ [15:00:27] blake@cumin1003: Failed to log message to wiki. Somebody should check the error logs. [15:00:28] blake@cumin1003: Failed to log message to wiki. Somebody should check the error logs. [15:00:33] expected [15:00:46] epic [15:00:48] !log blake@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.02-set-readonly (exit_code=0) for datacenter switchover from codfw to eqiad [15:00:50] blake@cumin1003: Failed to log message to wiki. Somebody should check the error logs. [15:00:54] !log blake@cumin1003 START - Cookbook sre.switchdc.mediawiki.03-set-db-readonly for datacenter switchover from codfw to eqiad [15:00:56] blake@cumin1003: Failed to log message to wiki. Somebody should check the error logs. [15:01:04] we are silent [15:01:20] "vandalism rates plummet after wikipedia deploys new anti-spam update" [15:01:27] (yeah I know I made this joke last switchover) [15:01:48] !log blake@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.03-set-db-readonly (exit_code=0) for datacenter switchover from codfw to eqiad [15:01:49] blake@cumin1003: Failed to log message to wiki. Somebody should check the error logs. [15:01:51] !log blake@cumin1003 START - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki for datacenter switchover from codfw to eqiad [15:01:52] blake@cumin1003: Failed to log message to wiki. Somebody should check the error logs. [15:02:32] !log blake@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki (exit_code=0) for datacenter switchover from codfw to eqiad [15:02:34] blake@cumin1003: Failed to log message to wiki. Somebody should check the error logs. [15:02:34] !log blake@cumin1003 START - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite for datacenter switchover from codfw to eqiad [15:02:36] blake@cumin1003: Failed to log message to wiki. Somebody should check the error logs. [15:02:39] !log blake@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite (exit_code=0) for datacenter switchover from codfw to eqiad [15:02:40] blake@cumin1003: Failed to log message to wiki. Somebody should check the error logs. [15:02:42] !log blake@cumin1003 START - Cookbook sre.switchdc.mediawiki.07-set-readwrite for datacenter switchover from codfw to eqiad [15:02:43] blake@cumin1003: Failed to log message to wiki. Somebody should check the error logs. [15:02:49] sound [15:02:50] And we're back [15:02:53] !log blake@cumin1003 MediaWiki read-only period ends at: 2026-03-25 15:02:52.921926 [15:02:53] !log blake@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=0) for datacenter switchover from codfw to eqiad [15:02:53] sound [15:03:05] 'grats y'all [15:03:23] Impressive as ever. [15:03:40] That's amazingly fast [15:03:56] well folks, from the flight deck, welcome to eqiad, where the local time is 11:03 AM [15:04:20] so glad my edits will have slightly lower ping now [15:04:25] lol [15:05:15] FIRING: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:05:56] ^ might just be delayed edit errors, we're monitoring [15:06:41] (03PS1) 10Btullis: Temporarily disable the deployment of mediawiki to dse-k8s-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1260729 (https://phabricator.wikimedia.org/T414484) [15:06:48] yeah 5xxs are back down, that alert should recover [15:07:24] !log blake@cumin1003 START - Cookbook sre.switchdc.mediawiki.08-restart-mw-jobrunner for datacenter switchover from codfw to eqiad [15:07:25] !log root@deploy2002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: sync [15:07:46] !log root@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: sync [15:07:47] !log blake@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.08-restart-mw-jobrunner (exit_code=0) for datacenter switchover from codfw to eqiad [15:07:56] !log blake@cumin1003 START - Cookbook sre.switchdc.mediawiki.08-start-maintenance for datacenter switchover from codfw to eqiad [15:07:57] !log root@deploy2002 helmfile [codfw] START helmfile.d/services/mw-cron: apply [15:08:01] (03PS1) 10Slyngshede: C:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) [15:08:03] (03PS1) 10Elukey: wikifunctions: increase resources assigned [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260731 (https://phabricator.wikimedia.org/T418160) [15:08:10] (03PS2) 10Btullis: Temporarily disable the deployment of mediawiki to dse-k8s-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1260729 (https://phabricator.wikimedia.org/T414484) [15:08:34] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1260729 (https://phabricator.wikimedia.org/T414484) (owner: 10Btullis) [15:08:43] !log root@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply [15:09:34] (03Abandoned) 10JHathaway: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1260113 (owner: 10JHathaway) [15:09:39] !log root@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [15:09:41] This is a silly question... are there a decent number of disk drives in eqiad/codfw, or is it mostly solid state? I wonder if you could actually hear stuff spin down/up during the switchover. [15:09:51] (03CR) 10CI reject: [V:04-1] C:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede) [15:10:15] RESOLVED: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:10:20] from the logged timestamps, total RO time of 02:28.832528 [15:10:39] !log root@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [15:11:27] !log blake@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.08-start-maintenance (exit_code=0) for datacenter switchover from codfw to eqiad [15:13:13] (03CR) 10Herron: [C:03+1] "LGTM overall! When it comes to sharding the compactors, could we hold on that until the current compactor has been happy for a reasonable" [puppet] - 10https://gerrit.wikimedia.org/r/1260650 (https://phabricator.wikimedia.org/T386911) (owner: 10Tiziano Fogli) [15:13:28] !log blake@cumin1003 START - Cookbook sre.switchdc.mediawiki.09-restore-ttl for datacenter switchover from codfw to eqiad [15:14:03] !log blake@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.09-restore-ttl (exit_code=0) for datacenter switchover from codfw to eqiad [15:15:28] (03CR) 10Blake: [C:03+2] wmnet: update CNAME records for DB masters for dc switchover [dns] - 10https://gerrit.wikimedia.org/r/1255669 (https://phabricator.wikimedia.org/T416705) (owner: 10Gerrit maintenance bot) [15:16:08] (03CR) 10Elukey: [C:03+2] admin_ng: move dse-k8s' cfssl-issuer to pki1002 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260685 (https://phabricator.wikimedia.org/T416664) (owner: 10Elukey) [15:16:31] !log blake@dns1004 START - running authdns-update [15:18:07] !log blake@dns1004 END - running authdns-update [15:18:10] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1260723 (https://phabricator.wikimedia.org/T420436) (owner: 10JMeybohm) [15:18:18] !log blake@cumin1003 START - Cookbook sre.switchdc.mediawiki.09-run-puppet-on-db-masters for datacenter switchover from codfw to eqiad [15:20:29] (03PS2) 10Elukey: wikifunctions: increase resources assigned [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260731 (https://phabricator.wikimedia.org/T415067) [15:22:23] 10SRE-SLO, 10observability, 06Traffic: Factor in pooled status for SLO measurements - https://phabricator.wikimedia.org/T420498#11749443 (10hnowlan) [15:23:59] !log elukey@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [15:24:00] (03PS4) 10Daniel Kinzler: rest gateway: add second Lua filter for header handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250675 (https://phabricator.wikimedia.org/T418969) [15:24:05] !log elukey@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [15:24:10] (03CR) 10Blake: [C:03+2] geo-maps: update map default to list eqiad first [dns] - 10https://gerrit.wikimedia.org/r/1244621 (https://phabricator.wikimedia.org/T413974) (owner: 10Blake) [15:24:17] (03PS1) 10Cathal Mooney: Add policy 'transport-in' to apply as import on transport circuits [homer/public] - 10https://gerrit.wikimedia.org/r/1260734 (https://phabricator.wikimedia.org/T420821) [15:24:25] !log blake@dns1004 START - running authdns-update [15:25:25] (03PS1) 10CDanis: haproxy: CIDERGRINDER 🍎🐤 canary also on text📚 [puppet] - 10https://gerrit.wikimedia.org/r/1260735 [15:25:39] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1260735 (owner: 10CDanis) [15:25:47] (03CR) 10Hashar: puppetserver: emit info before deploying code (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1260686 (owner: 10Hashar) [15:25:55] (03PS2) 10Hashar: puppetserver: emit info before deploying code [puppet] - 10https://gerrit.wikimedia.org/r/1260686 [15:26:20] !log blake@dns1004 END - running authdns-update [15:26:35] (03PS2) 10Cathal Mooney: Add policy 'transport_in' to apply as import on transport circuits [homer/public] - 10https://gerrit.wikimedia.org/r/1260734 (https://phabricator.wikimedia.org/T420821) [15:30:14] (03PS1) 10Jforrester: wikifunctions: Replace check-wf-services.sh with a Python version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260738 (https://phabricator.wikimedia.org/T421243) [15:31:49] !log blake@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.09-run-puppet-on-db-masters (exit_code=0) for datacenter switchover from codfw to eqiad [15:32:03] !log blake@cumin1003 START - Cookbook sre.switchdc.mediawiki.09-unlock-scap for datacenter switchover from codfw to eqiad [15:32:04] !log root@deploy2002 Forcefully removing global lock: Datacenter switchover from codfw to eqiad - [15:32:05] !log root@deploy2002 Unlocked for deployment [ALL REPOSITORIES]: Datacenter switchover from codfw to eqiad - (duration: 91m 45s) [15:32:05] !log blake@cumin1003 END (PASS) - Cookbook sre.switchdc.mediawiki.09-unlock-scap (exit_code=0) for datacenter switchover from codfw to eqiad [15:32:49] (03PS1) 10Kevin Bazira: kserve-inference: add support for rendering predictor volumes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260739 (https://phabricator.wikimedia.org/T421105) [15:33:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by blake@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244628 (https://phabricator.wikimedia.org/T413974) (owner: 10Blake) [15:33:22] (03CR) 10Jforrester: wikifunctions: Replace check-wf-services.sh with a Python version (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260738 (https://phabricator.wikimedia.org/T421243) (owner: 10Jforrester) [15:33:30] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:34:02] (03Merged) 10jenkins-bot: debug: reorder debug backends for eqiad switchover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1244628 (https://phabricator.wikimedia.org/T413974) (owner: 10Blake) [15:34:20] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:34:39] !log blake@deploy2002 Started scap sync-world: Backport for [[gerrit:1244628|debug: reorder debug backends for eqiad switchover (T413974)]] [15:34:43] T413974: Northward Datacenter Switchover (March 2026; codfw to eqiad) - https://phabricator.wikimedia.org/T413974 [15:37:00] !log blake@deploy2002 blake: Backport for [[gerrit:1244628|debug: reorder debug backends for eqiad switchover (T413974)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:37:07] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:37:33] (03CR) 10Klausman: [C:03+1] "Excellent work!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260739 (https://phabricator.wikimedia.org/T421105) (owner: 10Kevin Bazira) [15:37:47] !log blake@deploy2002 blake: Continuing with sync [15:38:48] (03CR) 10Kevin Bazira: [C:03+2] kserve-inference: add support for rendering predictor volumes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260739 (https://phabricator.wikimedia.org/T421105) (owner: 10Kevin Bazira) [15:39:01] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:39:09] (03CR) 10Jforrester: "This seems reasonable. Want me to deploy and see what happens?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260731 (https://phabricator.wikimedia.org/T415067) (owner: 10Elukey) [15:41:06] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:41:23] (03CR) 10Brouberol: [C:03+1] Temporarily disable the deployment of mediawiki to dse-k8s-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1260729 (https://phabricator.wikimedia.org/T414484) (owner: 10Btullis) [15:41:25] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q3:rack/setup/install kafka-logging200[6-8] - https://phabricator.wikimedia.org/T418931#11749558 (10colewhite) I'd like to propose a rename to `logging-kafka` so that these hosts follow the other `logging-*` hosts indicating its role in the larger cluster. [15:41:32] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11749560 (10colewhite) I'd like to propose a rename to `logging-kafka` so that these hosts follow the other `logging-*` hosts indicating its role in the larger cluster. [15:42:20] !log blake@deploy2002 Finished scap sync-world: Backport for [[gerrit:1244628|debug: reorder debug backends for eqiad switchover (T413974)]] (duration: 07m 41s) [15:42:25] T413974: Northward Datacenter Switchover (March 2026; codfw to eqiad) - https://phabricator.wikimedia.org/T413974 [15:43:18] (03Merged) 10jenkins-bot: kserve-inference: add support for rendering predictor volumes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260739 (https://phabricator.wikimedia.org/T421105) (owner: 10Kevin Bazira) [15:43:50] (03CR) 10Elukey: "Yes please :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260731 (https://phabricator.wikimedia.org/T415067) (owner: 10Elukey) [15:44:00] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2003.codfw.wmnet, ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:44:25] (03CR) 10CDanis: [C:03+2] haproxy: CIDERGRINDER 🍎🐤 canary also on text📚 [puppet] - 10https://gerrit.wikimedia.org/r/1260735 (owner: 10CDanis) [15:44:34] 10ops-codfw, 06collaboration-services, 06DC-Ops, 10Phabricator: phab2002: SEL System Event:, System Board Front LED Panel, Critical, management controller unavailable - https://phabricator.wikimedia.org/T420228#11749564 (10Dzahn) 05In progress→03Resolved a:03Dzahn Great! Thank you, @Jhancock.wm... [15:45:01] switchover activities are complete, thanks! [15:45:24] well done! [15:47:48] nice one :) do we know the total time RO? [15:48:35] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:48:41] just under 2.5 minutes, it looks like :) [15:49:18] very impressive, but still not as fast as june 2021. [15:49:22] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:49:47] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:50:37] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:50:51] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [15:51:26] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [15:51:40] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:54:30] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:55:00] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:56:58] Well done bjensen and team! Thanks for the smooth process and clear communication along the way, it was great to watch :) [15:58:33] (03CR) 10Jforrester: [C:03+2] wikifunctions: increase resources assigned [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260731 (https://phabricator.wikimedia.org/T415067) (owner: 10Elukey) [16:00:32] (03CR) 10Dzahn: "I think the real fix here is to simplify this all to just "include profile::docker::engine" which is meant to handle this in a unified man" [puppet] - 10https://gerrit.wikimedia.org/r/1260659 (https://phabricator.wikimedia.org/T418109) (owner: 10Hashar) [16:00:35] (03Merged) 10jenkins-bot: wikifunctions: increase resources assigned [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260731 (https://phabricator.wikimedia.org/T415067) (owner: 10Elukey) [16:01:24] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:02:28] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:02:32] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [16:02:55] (03CR) 10Dzahn: [C:03+2] "ah, yes. that makes sense. thank you" [puppet] - 10https://gerrit.wikimedia.org/r/1260718 (owner: 10Majavah) [16:03:18] (03PS2) 10Majavah: P:zuul::zuul_web: Make srange an array [puppet] - 10https://gerrit.wikimedia.org/r/1260718 [16:03:32] (03CR) 10Dzahn: [C:03+2] P:zuul::zuul_web: Make srange an array [puppet] - 10https://gerrit.wikimedia.org/r/1260718 (owner: 10Majavah) [16:03:33] (03CR) 10Dzahn: [V:03+2 C:03+2] P:zuul::zuul_web: Make srange an array [puppet] - 10https://gerrit.wikimedia.org/r/1260718 (owner: 10Majavah) [16:03:45] 06SRE, 06Traffic, 07Documentation, 07Sustainability (Incident Followup): Dead links on https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Process_count - https://phabricator.wikimedia.org/T421207#11749639 (10BCornwall) 05Open→03In progress a:03BCornwall [16:03:51] (03PS1) 10JMeybohm: kubernetes: Remove docker related hiera settings from nodes [puppet] - 10https://gerrit.wikimedia.org/r/1260742 (https://phabricator.wikimedia.org/T395870) [16:03:51] 06SRE, 06Traffic, 07Documentation, 07Sustainability (Incident Followup): Dead links on https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Process_count - https://phabricator.wikimedia.org/T421207#11749643 (10BCornwall) p:05Triage→03Low [16:03:57] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [16:04:35] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [16:05:00] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2001.codfw.wmnet, ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:05:29] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [16:05:32] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2003.codfw.wmnet, ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:05:57] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [16:05:59] dpogorzelski, klausman --^ I guess test-related [16:06:48] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [16:06:54] (03CR) 10Dzahn: "are we removing httpd though?" [puppet] - 10https://gerrit.wikimedia.org/r/1256446 (https://phabricator.wikimedia.org/T420189) (owner: 10Arnaudb) [16:08:25] elukey: yes, Dawid is playing shuffle-a-version with knative and "friends" [16:08:26] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:10:26] 10SRE-SLO, 06Abstract Wikipedia team, 06ServiceOps new, 07Essential-Work, 13Patch-For-Review: wikifunctions-backend-combined-v1 SLI error budget has been rapidly dropping over Feb 2026 - https://phabricator.wikimedia.org/T418160#11749706 (10ecarg) Looks like the 5xx debug info are in the log fields now h... [16:10:28] 06SRE, 06Traffic, 07Documentation, 07Sustainability (Incident Followup): Dead links on https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Process_count - https://phabricator.wikimedia.org/T421207#11749707 (10BCornwall) 05In progress→03Resolved The information on the page was not accurate and... [16:11:41] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1260742 (https://phabricator.wikimedia.org/T395870) (owner: 10JMeybohm) [16:11:54] (03PS10) 10Clément Goubert: trafficserver: 50% of /core/v1 to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259077 (https://phabricator.wikimedia.org/T418146) [16:12:06] (03PS5) 10Clément Goubert: trafficserver: 100% of /core/v1 to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259946 (https://phabricator.wikimedia.org/T418146) [16:12:51] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Add wikifeeds from api.w.o [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260688 (https://phabricator.wikimedia.org/T421233) (owner: 10Clément Goubert) [16:13:50] jouncebot: nowandnext [16:13:50] For the next 0 hour(s) and 46 minute(s): Datacenter Switchover (T413974) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260325T1500) [16:13:50] In 3 hour(s) and 46 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260325T2000) [16:13:51] T413974: Northward Datacenter Switchover (March 2026; codfw to eqiad) - https://phabricator.wikimedia.org/T413974 [16:14:02] (Will wait) [16:15:13] (03Merged) 10jenkins-bot: rest-gateway: Add wikifeeds from api.w.o [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260688 (https://phabricator.wikimedia.org/T421233) (owner: 10Clément Goubert) [16:18:47] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [16:18:55] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [16:19:08] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [16:20:20] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [16:20:29] !log ebysans@deploy2002 Started deploy [analytics/refinery@80c527b] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@80c527b6] [16:20:30] Dreamy_Jazz: we're done you can go ahead [16:20:36] Thanks [16:20:49] Using scap to deploy private code, will say when done [16:21:45] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [16:21:46] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [16:21:59] (03CR) 10Tiziano Fogli: "Currently, whenever the compactor is halted due to overlapping blocks related to the k8s instance that need to be deleted, the backlog inc" [puppet] - 10https://gerrit.wikimedia.org/r/1260650 (https://phabricator.wikimedia.org/T386911) (owner: 10Tiziano Fogli) [16:22:09] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [16:22:28] !log ebysans@deploy2002 Finished deploy [analytics/refinery@80c527b] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@80c527b6] (duration: 01m 58s) [16:22:32] 10SRE-SLO, 10observability, 06Traffic: Factor in pooled status for SLO measurements - https://phabricator.wikimedia.org/T420498#11749812 (10RLazarus) We were actually just talking about this in the SLOs group last week (adding @Vgutierrez and @CDanis). In theory this is supposed to "Just Work": all our SLOs... [16:23:44] (03PS1) 10Dzahn: Revert^2 "releases: upgrade Java version from 17 to 21" [puppet] - 10https://gerrit.wikimedia.org/r/1260750 [16:28:18] Scap started [16:30:32] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:32:56] claime: It seems scap fails at the sync canaries stage [16:33:26] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:34:12] Apparently the eventrouter chart.metadata.version is missing? [16:34:38] What are you deploying? Wanna move to -security maybe? [16:34:59] Just changes to code in the private folder [16:35:13] Sure can move to -security if you want [16:35:32] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:36:29] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: refactor bios if/else branches (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1253412 (https://phabricator.wikimedia.org/T414216) (owner: 10Elukey) [16:37:05] !log ebysans@deploy2002 Started deploy [analytics/refinery@80c527b]: Regular analytics weekly train [analytics/refinery@80c527b6] [16:37:33] (03CR) 10JHathaway: [C:03+1] sre.hosts.provision: add sys-112c-tn-configg to SUPERMICRO_NO_FQDN_MANAGEMENT [cookbooks] - 10https://gerrit.wikimedia.org/r/1253448 (https://phabricator.wikimedia.org/T414216) (owner: 10Elukey) [16:37:45] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: add sys-112c-tn-configg to SUPERMICRO_NO_FQDN_MANAGEMENT [cookbooks] - 10https://gerrit.wikimedia.org/r/1253448 (https://phabricator.wikimedia.org/T414216) (owner: 10Elukey) [16:38:01] (03CR) 10Phuedx: Test Kitchen SLOs: Renaming slos because of the Test Kitchen renaming (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1238312 (https://phabricator.wikimedia.org/T414381) (owner: 10Santiago Faci) [16:38:06] (03PS15) 10Elukey: sre.hosts.provision: use more URIs to set Supermicro's BIOS settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1253466 (https://phabricator.wikimedia.org/T414216) [16:41:10] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to superset for alice.moutinho - https://phabricator.wikimedia.org/T420751#11749923 (10KFrancis) @Scott_French I'm waiting on legal counsel. I pinged him again! [16:41:37] !log ebysans@deploy2002 Finished deploy [analytics/refinery@80c527b]: Regular analytics weekly train [analytics/refinery@80c527b6] (duration: 04m 32s) [16:42:18] !log Deploying Refinery as part of weekly deployment train [16:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:49] !log ebysans@deploy2002 Started deploy [analytics/refinery@80c527b] (thin): Regular analytics weekly train THIN [analytics/refinery@80c527b6] [16:44:48] !log ebysans@deploy2002 Finished deploy [analytics/refinery@80c527b] (thin): Regular analytics weekly train THIN [analytics/refinery@80c527b6] (duration: 01m 59s) [16:48:41] 10SRE-SLO, 10observability, 06Traffic: Factor in pooled status for SLO measurements - https://phabricator.wikimedia.org/T420498#11749956 (10Vgutierrez) > For instance, a recent decommissioning of codfw cp nodes (T419753) left the legacy ats-be service unavailable and caused the (depooled!) nodes to increment... [16:50:33] (03CR) 10SBassett: [C:03+1] Add wss://*.toolforge.org to CSP [puppet] - 10https://gerrit.wikimedia.org/r/1260700 (https://phabricator.wikimedia.org/T420631) (owner: 10Esanders) [16:50:47] Scap finished [16:53:26] RESOLVED: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:55:08] (03CR) 10Gmodena: [C:03+1] "Ack. WDP apps run wikikube, and we won't be impacted by this change." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259973 (https://phabricator.wikimedia.org/T414484) (owner: 10Btullis) [16:57:36] (03CR) 10RLazarus: [C:03+2] mw-on-k8s: Add 95% paging alert on php-fpm worker saturation [alerts] - 10https://gerrit.wikimedia.org/r/1260231 (https://phabricator.wikimedia.org/T420679) (owner: 10RLazarus) [16:59:33] (03Merged) 10jenkins-bot: mw-on-k8s: Add 95% paging alert on php-fpm worker saturation [alerts] - 10https://gerrit.wikimedia.org/r/1260231 (https://phabricator.wikimedia.org/T420679) (owner: 10RLazarus) [17:03:38] (03CR) 10Jasmine: "Note on the reassignment: Reassigning 10.2.[1-2].41/32 to sophroid since wqds-internal VIPs have since been removed following [2]. This is" [dns] - 10https://gerrit.wikimedia.org/r/1248617 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine) [17:05:58] (03CR) 10Hnowlan: trafficserver: 50% of /core/v1 to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259077 (https://phabricator.wikimedia.org/T418146) (owner: 10Clément Goubert) [17:06:04] (03CR) 10Hnowlan: trafficserver: 100% of /core/v1 to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259946 (https://phabricator.wikimedia.org/T418146) (owner: 10Clément Goubert) [17:06:21] (03CR) 10Hnowlan: [C:03+1] trafficserver: 100% of /feed/v1 to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1260690 (https://phabricator.wikimedia.org/T421233) (owner: 10Clément Goubert) [17:09:19] (03PS2) 10D3r1ck01: Enable JWTs for OAuth1 consumers and OAuth2 owner-only consumers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260006 (https://phabricator.wikimedia.org/T417833) [17:17:24] (03CR) 10Dzahn: [C:03+2] Revert^2 "releases: upgrade Java version from 17 to 21" [puppet] - 10https://gerrit.wikimedia.org/r/1260750 (owner: 10Dzahn) [17:21:57] (03CR) 10Clare Ming: [C:03+2] Test Kitchen UI: Deploy v1.2.7 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259851 (https://phabricator.wikimedia.org/T408186) (owner: 10Santiago Faci) [17:22:30] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-fr-tech: apply [17:23:31] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on releases1003.eqiad.wmnet with reason: debug java install [17:23:33] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-fr-tech: apply [17:23:46] (03Merged) 10jenkins-bot: Test Kitchen UI: Deploy v1.2.7 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259851 (https://phabricator.wikimedia.org/T408186) (owner: 10Santiago Faci) [17:27:25] 06SRE, 06Traffic: Deprecate low-traffic proxoid service and O:hcaptcha_proxy for the older hcaptcha proxy setup - https://phabricator.wikimedia.org/T411097#11750221 (10ssingh) >>! In T411097#11742616, @MLechvien-WMF wrote: > #traffic do we know when we can do this cleanup? @BCornwall and I will be working on... [17:29:02] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen: apply [17:29:42] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen: apply [17:31:02] (03PS1) 10Brouberol: trafficserver: enabling access to airflow-fr-tech.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1260762 (https://phabricator.wikimedia.org/T417213) [17:31:32] (03CR) 10CI reject: [V:04-1] trafficserver: enabling access to airflow-fr-tech.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1260762 (https://phabricator.wikimedia.org/T417213) (owner: 10Brouberol) [17:32:58] (03PS2) 10Brouberol: trafficserver: enabling access to airflow-fr-tech.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1260762 (https://phabricator.wikimedia.org/T417213) [17:33:11] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to data and Superset for Daria-WMDE (Daria Ammalainen (WMDE)) - https://phabricator.wikimedia.org/T420716#11750258 (10Ottomata) [17:33:33] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to superset for alice.moutinho - https://phabricator.wikimedia.org/T420751#11750260 (10Ottomata) [17:33:42] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to Superset for keren.ramirezWMDE - https://phabricator.wikimedia.org/T420896#11750262 (10Ottomata) [17:43:25] (03PS1) 10Bartosz Dziewoński: [WIP] rest-gateway: Refactor request classification for readability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260763 [17:43:42] 06SRE, 06Data-Engineering, 06Data-Engineering-Radar, 06Traffic: Add pageview information to turnilo's webrequest_sampled_live (is_pageview is always "-") - https://phabricator.wikimedia.org/T402612#11750292 (10Ottomata) [17:45:40] (03PS1) 10Jasmine: role::kubernetes::worker: add sophroid to the lvs pools [puppet] - 10https://gerrit.wikimedia.org/r/1260765 (https://phabricator.wikimedia.org/T418748) [17:45:54] (03CR) 10Ssingh: "@bcornwall@wikimedia.org can help merge this." [puppet] - 10https://gerrit.wikimedia.org/r/1260700 (https://phabricator.wikimedia.org/T420631) (owner: 10Esanders) [17:47:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259183 (https://phabricator.wikimedia.org/T419429) (owner: 10Aaron Schulz) [17:49:51] (03CR) 10BCornwall: [V:03+2 C:03+2] "Policy looks sound. Varnish tests are happy." [puppet] - 10https://gerrit.wikimedia.org/r/1260700 (https://phabricator.wikimedia.org/T420631) (owner: 10Esanders) [17:51:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256498 (https://phabricator.wikimedia.org/T420785) (owner: 10Scardenasmolinar) [17:57:41] (03PS1) 10Dzahn: ci: Add 'Signed-by' keyfile reference to thirdparty APT repo config [puppet] - 10https://gerrit.wikimedia.org/r/1260766 (https://phabricator.wikimedia.org/T418109) [18:01:23] (03PS2) 10Dzahn: ci: Add 'Signed-by' keyfile reference to thirdparty APT repo config [puppet] - 10https://gerrit.wikimedia.org/r/1260766 (https://phabricator.wikimedia.org/T418109) [18:02:34] (03PS3) 10Dzahn: ci: Add 'Signed-by' keyfile reference to thirdparty APT repo config [puppet] - 10https://gerrit.wikimedia.org/r/1260766 (https://phabricator.wikimedia.org/T418109) [18:02:37] (03PS1) 10Jasmine: service::catalog: add soprhoid service catalog entry [puppet] - 10https://gerrit.wikimedia.org/r/1260767 (https://phabricator.wikimedia.org/T418748) [18:03:23] (03CR) 10CI reject: [V:04-1] service::catalog: add soprhoid service catalog entry [puppet] - 10https://gerrit.wikimedia.org/r/1260767 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine) [18:04:14] (03PS2) 10Jasmine: service::catalog: add soprhoid service catalog entry [puppet] - 10https://gerrit.wikimedia.org/r/1260767 (https://phabricator.wikimedia.org/T418748) [18:04:45] (03CR) 10CI reject: [V:04-1] service::catalog: add soprhoid service catalog entry [puppet] - 10https://gerrit.wikimedia.org/r/1260767 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine) [18:07:01] (03PS1) 10CDanis: haproxy: CIDERGRINDER 🍎🐤 to all drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1260771 [18:07:14] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1260771 (owner: 10CDanis) [18:08:09] jouncebot: nowandnext [18:08:09] No deployments scheduled for the next 1 hour(s) and 51 minute(s) [18:08:09] In 1 hour(s) and 51 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260325T2000) [18:08:12] (03PS3) 10Jasmine: service::catalog: add soprhoid service catalog entry [puppet] - 10https://gerrit.wikimedia.org/r/1260767 (https://phabricator.wikimedia.org/T418748) [18:08:43] (03CR) 10CI reject: [V:04-1] service::catalog: add soprhoid service catalog entry [puppet] - 10https://gerrit.wikimedia.org/r/1260767 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine) [18:08:44] I'm going to roll envoy updates out to wikikube services in codfw [18:09:11] (03PS4) 10Jasmine: service::catalog: add soprhoid service catalog entry [puppet] - 10https://gerrit.wikimedia.org/r/1260767 (https://phabricator.wikimedia.org/T418748) [18:09:26] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/apertium: apply [18:09:59] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/apertium: apply [18:10:20] (codfw still depooled for a/a services so not much to see) [18:11:19] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/chart-renderer: apply [18:11:47] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply [18:11:53] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [18:11:57] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:12:14] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/citoid: apply [18:12:40] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [18:13:03] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/commons-impact-analytics: apply [18:13:17] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/commons-impact-analytics: apply [18:13:39] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [18:14:11] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [18:14:26] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/data-gateway: apply [18:14:38] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/data-gateway: apply [18:14:52] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [18:15:04] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [18:15:32] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/device-analytics: apply [18:15:45] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/device-analytics: apply [18:16:03] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/echostore: apply [18:16:59] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/echostore: apply [18:17:39] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/edit-analytics: apply [18:17:53] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/edit-analytics: apply [18:18:08] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/editor-analytics: apply [18:18:20] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/editor-analytics: apply [18:18:53] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply [18:19:19] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply [18:19:36] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [18:20:02] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [18:20:37] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply [18:20:38] !log releases1003 - apt-get upgrade - envoyproxy, python3-wmflib [18:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:53] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply [18:21:18] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [18:21:38] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [18:22:00] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/eventstreams: apply [18:22:43] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [18:22:48] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply [18:23:34] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply [18:23:46] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/geo-analytics: apply [18:23:59] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/geo-analytics: apply [18:24:46] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/image-suggestion: apply [18:25:02] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/image-suggestion: apply [18:25:16] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on releases1003.eqiad.wmnet with reason: debug java install [18:25:35] !log brett@cumin2002 START - Cookbook sre.loadbalancer.admin rebooting P{lvs7002.magru.wmnet} and A:liberica [18:25:45] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on releases2003.codfw.wmnet with reason: debug java install [18:26:14] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/ipoid: apply [18:26:34] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply [18:26:43] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/linkrecommendation: apply [18:28:31] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: apply [18:28:43] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [18:29:30] !log brett@cumin2002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs7002.magru.wmnet} and A:liberica [18:33:49] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [18:34:05] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/media-analytics: apply [18:34:19] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/media-analytics: apply [18:35:13] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [18:37:01] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [18:37:07] !log brett@cumin2002 START - Cookbook sre.loadbalancer.admin rebooting P{lvs7001.magru.wmnet} and A:liberica [18:37:08] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [18:37:48] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [18:38:39] 06SRE, 10Maps, 07Sustainability (Incident Followup): Kartotherian dashboard links don't work - https://phabricator.wikimedia.org/T421226#11750552 (10Scott_French) p:05Triage→03Medium a:03Scott_French It seems the linked dashboard no longer exists, and while there is a dedicated [[ https://logstash.wiki... [18:39:42] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [18:39:46] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [18:40:04] (03PS1) 10ArielGlenn: rest-gateway: add values for auth-newuser rate limiting class for feature patch [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260774 (https://phabricator.wikimedia.org/T419796) [18:40:43] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/page-analytics: apply [18:40:59] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/page-analytics: apply [18:41:13] !log brett@cumin2002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs7001.magru.wmnet} and A:liberica [18:41:25] 7001/7002 done, I'm going to move on to rebooting the remaining codfw lvses in a moment [18:41:38] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/proton: apply [18:42:09] cjd91: Want to take the LVS secondary of ulsfo and reboot that with sre.loadbalancer.admin? [18:42:22] sure! [18:42:24] Per https://wikitech.wikimedia.org/wiki/Global_traffic_routing#Updates/reboots we'll just do the one today [18:42:30] ack [18:42:47] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/proton: apply [18:42:58] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/push-notifications: apply [18:43:30] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/push-notifications: apply [18:43:43] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [18:43:47] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [18:44:04] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/recommendation-api: apply [18:44:28] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/recommendation-api: apply [18:44:37] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/sessionstore: apply [18:44:49] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/sessionstore: apply [18:44:55] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox: apply [18:45:29] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [18:45:36] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [18:46:04] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [18:46:10] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [18:46:22] !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2012.codfw.wmnet with reason: Planned reboot [18:46:24] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [18:46:30] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [18:46:46] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [18:46:53] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [18:47:16] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [18:47:21] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [18:47:53] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [18:47:59] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/termbox: apply [18:48:39] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/termbox: apply [18:48:48] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/toolhub: apply [18:49:19] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/toolhub: apply [18:49:34] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply [18:49:55] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply [18:50:24] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [18:50:48] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [18:51:01] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/zotero: apply [18:51:21] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/zotero: apply [18:51:38] ✅ [18:53:25] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-reboot rolling reboot on P{sessionstore[2004-2006].codfw.wmnet} and P{P:Cassandra} [18:53:40] oooh, I bet that was a "Charlie" deploy ;) [18:54:45] yep [18:56:40] FIRING: [2x] ProbeDown: Service sessionstore2004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:57:13] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs2012.codfw.wmnet [19:00:11] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs2012.codfw.wmnet [19:00:32] PROBLEM - PyBal backends health check on lvs2012 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [19:00:42] PROBLEM - PyBal connections to etcd on lvs2012 is CRITICAL: CRITICAL: 0 connections established with conf2004.codfw.wmnet:4001 (min=6) https://wikitech.wikimedia.org/wiki/PyBal [19:00:42] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2012 is CRITICAL: CRITICAL: Service pybal.service is not active. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [19:00:56] PROBLEM - pybal on lvs2012 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [19:00:59] brett: so it seems like the cookbook is not silencing stuff again hmmm [19:01:38] (03CR) 10Daniel Kinzler: [C:03+1] rest-gateway: add values for auth-newuser rate limiting class for feature patch [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260774 (https://phabricator.wikimedia.org/T419796) (owner: 10ArielGlenn) [19:01:40] RESOLVED: [2x] ProbeDown: Service sessionstore2004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:04:20] <3 [19:05:32] RECOVERY - PyBal backends health check on lvs2012 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:05:42] RECOVERY - PyBal connections to etcd on lvs2012 is OK: OK: 6 connections established with conf2004.codfw.wmnet:4001 (min=6) https://wikitech.wikimedia.org/wiki/PyBal [19:05:56] RECOVERY - pybal on lvs2012 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [19:07:58] !log cdobbins@cumin2002 START - Cookbook sre.loadbalancer.admin rebooting P{lvs4010.ulsfo.wmnet} and A:liberica [19:10:42] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs2012 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [19:11:18] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs4010.ulsfo.wmnet} and A:liberica [19:11:47] !log aokoth@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Security Release - T421278 [19:14:13] !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2011.codfw.wmnet with reason: Planned reboot [19:14:26] brett: the question I have is [19:14:33] does explicitly running it actually downtime it? [19:14:34] 06SRE, 10Icinga, 10observability, 10Observability-Alerting: Icinga passive checks go awol and downtime stops working - https://phabricator.wikimedia.org/T196336#11750655 (10BCornwall) Unsure if this is related to this particular issue but running `sre.hosts.downtime` and then `sre.hosts.reboot-single` caus... [19:14:35] let's check that [19:15:45] yeah, it does [19:15:53] it's definitely once the reboot cookbook is run, it gets obliterated [19:16:02] ok good point [19:17:13] !log aokoth@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Security Release - T421278 [19:17:58] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs2011.codfw.wmnet [19:19:24] (03CR) 10Andrea Denisse: [V:03+1 C:03+2] grafana: Hide version number for the anonymous role [puppet] - 10https://gerrit.wikimedia.org/r/1259254 (https://phabricator.wikimedia.org/T402844) (owner: 10Andrea Denisse) [19:20:10] FIRING: [2x] ProbeDown: Service sessionstore2005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:20:56] PROBLEM - Host lvs2011 is DOWN: PING CRITICAL - Packet loss = 100% [19:20:56] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs2011.codfw.wmnet [19:21:16] RECOVERY - Host lvs2011 is UP: PING OK - Packet loss = 0%, RTA = 30.27 ms [19:21:32] PROBLEM - PyBal backends health check on lvs2011 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [19:21:56] PROBLEM - pybal on lvs2011 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [19:22:52] FIRING: [4x] ProbeDown: Service sessionstore2005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:22:53] ^ignore [19:23:03] well, not the sessionstore one :) [19:24:08] !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on P{sessionstore[2004-2006].codfw.wmnet} and P{P:Cassandra} [19:24:56] RECOVERY - pybal on lvs2011 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [19:25:10] RESOLVED: [4x] ProbeDown: Service sessionstore2005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:25:32] RECOVERY - PyBal backends health check on lvs2011 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:26:42] !log aokoth@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Security Release - T421278 [19:29:03] (03PS3) 10JHathaway: nftables::client: Improve src/dst filter handling [puppet] - 10https://gerrit.wikimedia.org/r/1260676 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah) [19:29:31] (03PS4) 10JHathaway: nftables::client: Improve src/dst filter handling [puppet] - 10https://gerrit.wikimedia.org/r/1260676 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah) [19:30:25] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host pybal-test2003.codfw.wmnet [19:30:52] (03CR) 10CI reject: [V:04-1] nftables::client: Improve src/dst filter handling [puppet] - 10https://gerrit.wikimedia.org/r/1260676 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah) [19:31:51] (03PS5) 10JHathaway: nftables::client: Improve src/dst filter handling [puppet] - 10https://gerrit.wikimedia.org/r/1260676 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah) [19:34:13] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pybal-test2003.codfw.wmnet [19:44:16] (03PS1) 10BCornwall: Add sre.cdn.roll-restart-reboot-hcaptcha-proxy [cookbooks] - 10https://gerrit.wikimedia.org/r/1260781 [19:48:57] (03CR) 10BCornwall: [V:03+1] "`" [cookbooks] - 10https://gerrit.wikimedia.org/r/1260781 (owner: 10BCornwall) [19:53:52] (03CR) 10Ssingh: [C:03+1] "🏆" [cookbooks] - 10https://gerrit.wikimedia.org/r/1260781 (owner: 10BCornwall) [19:54:31] (03CR) 10BCornwall: [V:03+1 C:03+2] Add sre.cdn.roll-restart-reboot-hcaptcha-proxy [cookbooks] - 10https://gerrit.wikimedia.org/r/1260781 (owner: 10BCornwall) [20:00:04] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260325T2000) [20:00:04] jdlrobson and AaronSchulz: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:12] o/ mind if I go first? [20:00:37] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-reboot rolling reboot on P{sessionstore[1004-1006].eqiad.wmnet} and P{P:Cassandra} [20:01:15] !log brett@cumin2002 START - Cookbook sre.cdn.roll-restart-reboot-hcaptcha-proxy rolling reboot on P{hcaptcha-proxy7002.wikimedia.org} and A:hcaptcha-proxy [20:03:25] Jdlrobson: ok [20:03:59] 06SRE, 10Icinga, 10observability, 10Observability-Alerting: Icinga passive checks go awol and downtime stops working - https://phabricator.wikimedia.org/T196336#11750819 (10herron) It could be a different thing. AFAIK `sre.hosts.reboot-single` sets downtime itself as well, maybe an edge that happens when... [20:04:07] FIRING: ProbeDown: Service sessionstore1004-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#sessionstore1004-a:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:05:39] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-restart-reboot-hcaptcha-proxy (exit_code=0) rolling reboot on P{hcaptcha-proxy7002.wikimedia.org} and A:hcaptcha-proxy [20:07:59] (03PS5) 10Jdlrobson: Close the legacy-vector dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255775 (https://phabricator.wikimedia.org/T421289) [20:08:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255775 (https://phabricator.wikimedia.org/T421289) (owner: 10Jdlrobson) [20:08:52] (03PS6) 10Kosta Harlan: Deploy temporary accounts to ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247073 (https://phabricator.wikimedia.org/T413771) (owner: 10STran) [20:09:00] aokoth@cumin1003 upgrade (PID 2063845) is awaiting input [20:09:07] (03Merged) 10jenkins-bot: Close the legacy-vector dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255775 (https://phabricator.wikimedia.org/T421289) (owner: 10Jdlrobson) [20:09:12] RESOLVED: ProbeDown: Service sessionstore1004-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#sessionstore1004-a:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:09:40] !log jdlrobson@deploy2002 Started scap sync-world: Backport for [[gerrit:1255775|Close the legacy-vector dblist (T421289)]] [20:09:45] T421289: Close the legacy-vector dblist - https://phabricator.wikimedia.org/T421289 [20:09:48] AaronSchulz: want me to deploy yours after? or would you prefer to do? [20:10:40] Jdlrobson: I'll do it [20:10:42] eevans@cumin1003 roll-reboot (PID 2070056) is awaiting input [20:10:52] AaronSchulz: hopefully the 2 config changes will be quick [20:12:00] !log jdlrobson@deploy2002 jdlrobson: Backport for [[gerrit:1255775|Close the legacy-vector dblist (T421289)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:12:16] !log aokoth@cumin1003 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1003.wikimedia.org with reason: Security Release - T421278 [20:12:18] (03PS7) 10Kosta Harlan: Deploy temporary accounts to ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247073 (https://phabricator.wikimedia.org/T413771) (owner: 10STran) [20:12:41] (03PS2) 10JHathaway: wmf_styleguide: don't run the delta for local CI [puppet] - 10https://gerrit.wikimedia.org/r/1260128 [20:12:45] (03CR) 10Kosta Harlan: [C:03+1] Deploy temporary accounts to ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247073 (https://phabricator.wikimedia.org/T413771) (owner: 10STran) [20:12:53] !log jdlrobson@deploy2002 jdlrobson: Continuing with sync [20:13:01] !log aokoth@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Security Release - T421278 [20:13:09] !log aokoth@cumin1003 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1003.wikimedia.org with reason: Security Release - T421278 [20:13:44] FIRING: Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [20:14:43] !log aokoth@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Security Release - T421278 [20:14:50] !log aokoth@cumin1003 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1003.wikimedia.org with reason: Security Release - T421278 [20:14:57] (03PS3) 10JHathaway: wmf_styleguide: don't run the delta for local CI [puppet] - 10https://gerrit.wikimedia.org/r/1260128 [20:16:47] 10ops-eqiad, 06SRE, 06DC-Ops: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet - https://phabricator.wikimedia.org/T421007#11750869 (10VRiley-WMF) Opened up a Dell ticket 224361255 awaiting response [20:17:22] !log jdlrobson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1255775|Close the legacy-vector dblist (T421289)]] (duration: 07m 42s) [20:17:28] T421289: Close the legacy-vector dblist - https://phabricator.wikimedia.org/T421289 [20:17:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247073 (https://phabricator.wikimedia.org/T413771) (owner: 10STran) [20:18:36] (03Merged) 10jenkins-bot: Deploy temporary accounts to ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247073 (https://phabricator.wikimedia.org/T413771) (owner: 10STran) [20:19:08] !log jdlrobson@deploy2002 Started scap sync-world: Backport for [[gerrit:1247073|Deploy temporary accounts to ruwiki (T413771)]] [20:19:13] T413771: Deploy temporary accounts to Russian Wikipedia - https://phabricator.wikimedia.org/T413771 [20:20:25] (03CR) 10Bking: [V:04-1] "needs an update to charts/opensearch-cluster/CHANGELOG.md as well" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248865 (https://phabricator.wikimedia.org/T418175) (owner: 10Btullis) [20:21:10] FIRING: [4x] ProbeDown: Service sessionstore1004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:21:28] !log jdlrobson@deploy2002 stran, jdlrobson: Backport for [[gerrit:1247073|Deploy temporary accounts to ruwiki (T413771)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:22:43] (03Abandoned) 10Bartosz Dziewoński: [WIP] rest-gateway: Refactor request classification for readability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260763 (owner: 10Bartosz Dziewoński) [20:25:45] !log jdlrobson@deploy2002 stran, jdlrobson: Continuing with sync [20:30:12] !log jdlrobson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1247073|Deploy temporary accounts to ruwiki (T413771)]] (duration: 11m 04s) [20:30:17] T413771: Deploy temporary accounts to Russian Wikipedia - https://phabricator.wikimedia.org/T413771 [20:32:09] all yours AaronSchulz ! [20:32:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by aaron@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259183 (https://phabricator.wikimedia.org/T419429) (owner: 10Aaron Schulz) [20:33:51] (03Merged) 10jenkins-bot: Add Analytics APIs to the RestSandbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259183 (https://phabricator.wikimedia.org/T419429) (owner: 10Aaron Schulz) [20:34:13] (03PS1) 10Bking: opensearch-cluster: Add support for service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260795 (https://phabricator.wikimedia.org/T421293) [20:34:26] !log aaron@deploy2002 Started scap sync-world: Backport for [[gerrit:1259183|Add Analytics APIs to the RestSandbox (T419429)]] [20:34:31] T419429: [SPIKE?] Create an API module for the Analytics API - https://phabricator.wikimedia.org/T419429 [20:35:32] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11750989 (10herron) That would be nice! Off hand there's a name convention on the Kafka side as well where clusters are named like logging-eqiad, main-eqiad. I'm not sure... [20:35:39] (03CR) 10JHathaway: [C:03+2] wmf_styleguide: don't run the delta for local CI [puppet] - 10https://gerrit.wikimedia.org/r/1260128 (owner: 10JHathaway) [20:36:12] PROBLEM - Host sessionstore1005 is DOWN: PING CRITICAL - Packet loss = 100% [20:36:46] !log aaron@deploy2002 aaron: Backport for [[gerrit:1259183|Add Analytics APIs to the RestSandbox (T419429)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:37:59] AaronSchulz: I have a patch to sync when you're done [20:38:09] (03PS1) 10Kosta Harlan: SuggestedInvestigations: Import session into signal matching job [extensions/CheckUser] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1260797 (https://phabricator.wikimedia.org/T421062) [20:38:36] !log aaron@deploy2002 aaron: Continuing with sync [20:39:12] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:40:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/CheckUser] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1260797 (https://phabricator.wikimedia.org/T421062) (owner: 10Kosta Harlan) [20:42:20] (03CR) 10RLazarus: [C:03+1] "Yeah, using the div is a little generic but without any article-specific content available on the static page, I guess it's our best bet! " [puppet] - 10https://gerrit.wikimedia.org/r/1256396 (https://phabricator.wikimedia.org/T420666) (owner: 10Genoveva Galarza) [20:42:58] !log aaron@deploy2002 Finished scap sync-world: Backport for [[gerrit:1259183|Add Analytics APIs to the RestSandbox (T419429)]] (duration: 08m 33s) [20:43:04] T419429: [SPIKE?] Create an API module for the Analytics API - https://phabricator.wikimedia.org/T419429 [20:43:42] kostajh: done [20:43:46] AaronSchulz: thanks [20:43:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/CheckUser] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1260797 (https://phabricator.wikimedia.org/T421062) (owner: 10Kosta Harlan) [20:46:28] (03CR) 10Kamila Součková: [C:03+1] rest-gateway: add values for auth-newuser rate limiting class for feature patch [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260774 (https://phabricator.wikimedia.org/T419796) (owner: 10ArielGlenn) [20:46:38] PROBLEM - MariaDB Replica Lag: s3 on clouddb1022 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 643.81 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:46:44] 10ops-eqiad, 06DC-Ops: sessionstore1005.eqiad.wmnet is down - https://phabricator.wikimedia.org/T421297#11751056 (10Eevans) [20:51:35] !log eevans@cumin1003 END (ERROR) - Cookbook sre.cassandra.roll-reboot (exit_code=97) rolling reboot on P{sessionstore[1004-1006].eqiad.wmnet} and P{P:Cassandra} [20:54:12] RESOLVED: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:57:50] (03Merged) 10jenkins-bot: SuggestedInvestigations: Import session into signal matching job [extensions/CheckUser] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1260797 (https://phabricator.wikimedia.org/T421062) (owner: 10Kosta Harlan) [20:58:28] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1260797|SuggestedInvestigations: Import session into signal matching job (T421062)]] [21:00:04] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260325T2100) [21:01:06] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1260797|SuggestedInvestigations: Import session into signal matching job (T421062)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:04:33] !log kharlan@deploy2002 kharlan: Continuing with sync [21:08:24] 10ops-eqiad, 06DC-Ops: sessionstore1005.eqiad.wmnet is down - https://phabricator.wikimedia.org/T421297#11751169 (10Jclark-ctr) a:03Jclark-ctr [21:08:55] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1260797|SuggestedInvestigations: Import session into signal matching job (T421062)]] (duration: 10m 26s) [21:10:13] (03PS1) 10Dreamy Jazz: SI: Enable on bnwiki, itwiki, simplewiki, and plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260802 (https://phabricator.wikimedia.org/T415529) [21:16:38] (03PS1) 10JHathaway: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1260809 [21:17:13] (03PS2) 10JHathaway: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1260809 [21:17:30] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1260809 (owner: 10JHathaway) [21:19:38] 06SRE, 10Beta-Cluster-Infrastructure, 05Goal, 07Technical-Debt: Minimize infrastructure differences between Beta Cluster and production - https://phabricator.wikimedia.org/T87220#11751304 (10Pppery) [21:20:08] (03Abandoned) 10JHathaway: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1260809 (owner: 10JHathaway) [21:20:32] 06SRE, 06Infrastructure-Foundations, 10Puppet CI, 10Puppet-Infrastructure, 13Patch-For-Review: Default to the Puppet 7 PCC CI test, make it voting and eventually remove the Puppet 5 one - https://phabricator.wikimedia.org/T367399#11751314 (10jhathaway) 05Open→03Resolved instances have been deleted. [21:21:10] FIRING: [3x] ProbeDown: Service sessionstore1005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:23:21] jouncebot: nowandnext [21:23:21] For the next 0 hour(s) and 36 minute(s): Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260325T2100) [21:23:21] In 0 hour(s) and 36 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260325T2200) [21:23:29] !log Evening UTC backport window done [21:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:33] (03CR) 10JHathaway: [C:03+1] puppetserver: emit info before deploying code [puppet] - 10https://gerrit.wikimedia.org/r/1260686 (owner: 10Hashar) [21:27:12] (03CR) 10JHathaway: [C:03+1] sre.hosts.provision: use more URIs to set Supermicro's BIOS settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1253466 (https://phabricator.wikimedia.org/T414216) (owner: 10Elukey) [21:27:16] !log [opensearch-k8s] T414484 Getting ready to depool `dnsdisc=k8s-ingress-dse-aa,name=eqiad`, leaving codfw pooled. This will get us ready for a full rolling-upgrade of the dse-k8s-eqiad cluster tomorrow. [21:27:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:21] T414484: Upgrade DSE clusters to kubernetes 1.31 - https://phabricator.wikimedia.org/T414484 [21:27:40] (03PS1) 10Dzahn: jenkins: include docker, add comments [puppet] - 10https://gerrit.wikimedia.org/r/1260816 (https://phabricator.wikimedia.org/T418521) [21:27:46] (03CR) 10JHathaway: [C:03+1] sre.hosts.provision: use more URIs to set Supermicro's BIOS settings (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1253466 (https://phabricator.wikimedia.org/T414216) (owner: 10Elukey) [21:29:26] (03PS2) 10Dzahn: jenkins: include docker, add comments [puppet] - 10https://gerrit.wikimedia.org/r/1260816 (https://phabricator.wikimedia.org/T418521) [21:30:53] !log Created cusi_case, cusi_user, and cusi_signal on bnwiki, itwiki, simplewiki, plwiki for T415529 [21:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:58] T415529: Enable SuggestedInvestigations on bnwiki, itwiki, simplewiki, plwiki - https://phabricator.wikimedia.org/T415529 [21:31:50] (03CR) 10Dzahn: [C:03+1] gerrit: use Envoy on gerrit [puppet] - 10https://gerrit.wikimedia.org/r/1259945 (https://phabricator.wikimedia.org/T420909) (owner: 10Arnaudb) [21:32:04] (03CR) 10Dreamy Jazz: "To deploy tomorrow (Thursday)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260802 (https://phabricator.wikimedia.org/T415529) (owner: 10Dreamy Jazz) [21:34:20] (03CR) 10Dzahn: [C:03+1] gerrit: forward Gitiles traffic to gerrit-replica (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1259121 (https://phabricator.wikimedia.org/T420595) (owner: 10Arnaudb) [21:35:48] RECOVERY - Host sessionstore1005 is UP: PING WARNING - Packet loss = 33%, RTA = 0.32 ms [21:35:54] !log ryankemper@cumin2002 conftool action : set/pooled=false; selector: dnsdisc=k8s-ingress-dse-aa,name=eqiad [21:36:31] (03CR) 10Dzahn: [C:03+2] Add warning of impending Etherpad deletion [puppet] - 10https://gerrit.wikimedia.org/r/1256544 (https://phabricator.wikimedia.org/T420793) (owner: 10Pppery) [21:38:04] !log [opensearch-k8s] T414484 Depooled eqiad; change verified working (now when I do `host k8s-ingress-dse-aa.discovery.wmnet` from `cumin1003`, and then reverse-lookup the resulting IP, I get a codfw address); so traffic is now routing to dse-k8s-codfw [21:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:10] T414484: Upgrade DSE clusters to kubernetes 1.31 - https://phabricator.wikimedia.org/T414484 [21:39:04] (03PS3) 10Dzahn: jenkins: include docker, add comments [puppet] - 10https://gerrit.wikimedia.org/r/1260816 (https://phabricator.wikimedia.org/T418521) [21:39:20] (03PS4) 10Dzahn: jenkins: include docker, add comments [puppet] - 10https://gerrit.wikimedia.org/r/1260816 (https://phabricator.wikimedia.org/T418521) [21:41:10] RESOLVED: [2x] ProbeDown: Service sessionstore1005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:43:21] (03CR) 10Dzahn: "let's see what the difference really is to just "include profile::docker::engine" nowadays" [puppet] - 10https://gerrit.wikimedia.org/r/1260659 (https://phabricator.wikimedia.org/T418109) (owner: 10Hashar) [21:44:59] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/output/1260816/8342/contint1003.wikimedia.org/change.contint1003.wikimedia.org.err" [puppet] - 10https://gerrit.wikimedia.org/r/1260816 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [21:51:14] 10ops-eqiad, 06SRE, 06DC-Ops: sessionstore1005.eqiad.wmnet is down - https://phabricator.wikimedia.org/T421297#11751691 (10Jclark-ctr) Update firmwares Server booted normally Bios 1.17.2 to 1.20.2 Backplane 7.10 to 7.16 Idrac 7.20.30.50 to 7.30.10.50 [21:55:16] (03PS1) 10Dzahn: zuul::executor: add missing dummy password [labs/private] - 10https://gerrit.wikimedia.org/r/1260825 (https://phabricator.wikimedia.org/T421232) [21:56:30] (03CR) 10Dzahn: [V:03+2 C:03+2] zuul::executor: add missing dummy password [labs/private] - 10https://gerrit.wikimedia.org/r/1260825 (https://phabricator.wikimedia.org/T421232) (owner: 10Dzahn) [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260325T2200) [22:14:40] 10ops-eqiad, 06SRE, 06DC-Ops: sessionstore1005.eqiad.wmnet is down - https://phabricator.wikimedia.org/T421297#11751843 (10Jclark-ctr) @eevans @wiki_willy This is a repeat failure T398225 with the same error. I’ve opened a Dell support ticket (SR224365329) to determine if any components have failed and need... [22:23:31] RESOLVED: Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [22:29:24] (03PS5) 10Jasmine: service::catalog: add sophroid service catalog entry [puppet] - 10https://gerrit.wikimedia.org/r/1260767 (https://phabricator.wikimedia.org/T418748) [22:43:49] (03PS1) 10Dzahn: zuul: use full chain as zookeeper TLS CA bundle [puppet] - 10https://gerrit.wikimedia.org/r/1260833 (https://phabricator.wikimedia.org/T395938) [22:44:05] (03PS1) 10Catrope: Add Logstash logging for successful passwordless logins [extensions/OATHAuth] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1260834 [22:44:13] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/OATHAuth] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1260834 (owner: 10Catrope) [22:44:57] (03CR) 10Tim Starling: "I think if a human runs a maintenance script on the physical deployment host, then it's buyer-beware, there's no real expectation that eve" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256384 (https://phabricator.wikimedia.org/T419049) (owner: 10Kamila Součková) [22:45:22] (03PS2) 10Dzahn: zuul: use full chain as zookeeper TLS CA bundle [puppet] - 10https://gerrit.wikimedia.org/r/1260833 (https://phabricator.wikimedia.org/T395938) [22:48:04] (03CR) 10Dzahn: "So, as you confirm on https://phabricator.wikimedia.org/T405119#11746081 it works with the full chain. and this variable is already set to" [puppet] - 10https://gerrit.wikimedia.org/r/1260833 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [22:49:25] (03CR) 10Dzahn: [V:04-1] "https://puppet-compiler.wmflabs.org/output/1260833/8345/zuul1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1260833 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [22:51:46] (03CR) 10Dduvall: zuul: use full chain as zookeeper TLS CA bundle (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1260833 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [22:51:50] (03CR) 10Dzahn: [V:04-1] "variable is empty because this needs to be in the base class" [puppet] - 10https://gerrit.wikimedia.org/r/1260833 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [22:54:34] (03PS3) 10Dzahn: zuul: use full chain as zookeeper TLS CA bundle [puppet] - 10https://gerrit.wikimedia.org/r/1260833 (https://phabricator.wikimedia.org/T395938) [22:55:10] (03CR) 10Dzahn: zuul: use full chain as zookeeper TLS CA bundle (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1260833 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [22:55:57] (03CR) 10Dduvall: zuul: use full chain as zookeeper TLS CA bundle (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1260833 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [22:57:16] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1260833/8346/zuul1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1260833 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [23:05:05] (03CR) 10Dduvall: [C:03+1] zuul: use full chain as zookeeper TLS CA bundle [puppet] - 10https://gerrit.wikimedia.org/r/1260833 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [23:05:35] (03CR) 10Dzahn: [V:03+1 C:03+2] zuul: use full chain as zookeeper TLS CA bundle [puppet] - 10https://gerrit.wikimedia.org/r/1260833 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [23:05:42] (03PS4) 10Dzahn: zuul: use full chain as zookeeper TLS CA bundle [puppet] - 10https://gerrit.wikimedia.org/r/1260833 (https://phabricator.wikimedia.org/T395938) [23:08:01] (03CR) 10Dzahn: [C:03+2] zuul: use full chain as zookeeper TLS CA bundle [puppet] - 10https://gerrit.wikimedia.org/r/1260833 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [23:08:23] (03CR) 10Cwhite: [C:03+1] "Seems ok to me - should be easy to roll back if it causes issues." [puppet] - 10https://gerrit.wikimedia.org/r/1259382 (https://phabricator.wikimedia.org/T402844) (owner: 10Andrea Denisse) [23:12:19] 06SRE, 10Beta-Cluster-Infrastructure, 06Traffic, 06Release-Engineering-Team (Radar): Beta cluster seems to be extremely slow for logged in user during page navigation - https://phabricator.wikimedia.org/T267435#11752112 (10bd808) 05Open→03Declined let's stop chasing this one. [23:29:11] !log zuul1001 - installed mariadb-client - connected once to zuul db on m1-master; mysql> truncate "alembic_version"; - systemctl restart zuul-web - This fixed the zuul-web service. finally no error in systemctl status. (T405119) [23:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:17] T405119: Set up zuul web on zuul1001/zuul2001 - https://phabricator.wikimedia.org/T405119 [23:36:22] (03CR) 10RLazarus: [C:03+1] mw-*: Use envoy drain configuration everywhere (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260096 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [23:58:23] (03PS1) 10Dzahn: zuul::base: ensure /var/ssh/zuul exists ? WIP [puppet] - 10https://gerrit.wikimedia.org/r/1260847 [23:58:38] (03Restored) 10Bartosz Dziewoński: [WIP] rest-gateway: Refactor request classification for readability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260763 (owner: 10Bartosz Dziewoński) [23:58:38] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on zuul1001.eqiad.wmnet with reason: T421330 [23:58:43] T421330: SystemdUnitFailed - zuul-scheduler - https://phabricator.wikimedia.org/T421330 [23:59:13] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on zuul2001.codfw.wmnet with reason: T421330