[00:00:15] !log krinkle@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192278|Disable wmgUseMdotRouting on Wikidata (T403510)]] (duration: 13m 23s) [00:00:22] T403510: [Rollout Phase 3] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510 [00:06:25] FIRING: [20x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:08:31] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1192646 [00:08:31] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1192646 (owner: 10TrainBranchBot) [00:09:22] (03PS1) 10Ncmonitor: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1192647 [00:09:26] (03PS1) 10Ncmonitor: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1192648 [00:09:30] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1192649 [00:29:48] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1192646 (owner: 10TrainBranchBot) [00:33:18] (03PS2) 10Krinkle: Disable wmgUseMdotRouting on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192279 (https://phabricator.wikimedia.org/T403510) [00:37:55] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbprov1007.eqiad.wmnet with OS bookworm [00:38:08] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov1007 - https://phabricator.wikimedia.org/T400412#11231724 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host dbprov1007.eqiad.wmnet with OS bookworm executed with errors: - dbprov1007... [00:48:53] (03PS1) 10MusikAnimal: migrateFromGadget: add a few more missing transformations [extensions/CommunityRequests] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192652 (https://phabricator.wikimedia.org/T405826) [00:49:29] (03PS1) 10Krinkle: Disable wmgUseMdotRouting on frwiki and dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192653 (https://phabricator.wikimedia.org/T403510) [00:56:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:56:44] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:59:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy2002 using scap backport" [extensions/CommunityRequests] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192652 (https://phabricator.wikimedia.org/T405826) (owner: 10MusikAnimal) [01:00:16] (03Merged) 10jenkins-bot: migrateFromGadget: add a few more missing transformations [extensions/CommunityRequests] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192652 (https://phabricator.wikimedia.org/T405826) (owner: 10MusikAnimal) [01:00:46] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:01:34] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54973 bytes in 0.077 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:01:34] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9309 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:08:20] so nice that SpiderPig caught the wmf/next image being built… but why was that not on the Deployments calendar? or am I blind [01:12:05] I guess I need to try again later. The patch was merged to wmf.21 but isn't deployed yet [01:14:20] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 33s) [01:17:23] !log musikanimal@deploy2002 Started scap sync-world: Backport for [[gerrit:1192652|migrateFromGadget: add a few more missing transformations (T405826 T404138 T404234)]] [01:17:32] T405826: Migration script issues - https://phabricator.wikimedia.org/T405826 [01:17:32] T404138: Update migration script to map projects to tags - https://phabricator.wikimedia.org/T404138 [01:17:33] T404234: repeated twice if it was already part of the wikitext - https://phabricator.wikimedia.org/T404234 [01:22:35] !log musikanimal@deploy2002 musikanimal: Backport for [[gerrit:1192652|migrateFromGadget: add a few more missing transformations (T405826 T404138 T404234)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [01:22:43] T405826: Migration script issues - https://phabricator.wikimedia.org/T405826 [01:22:44] T404138: Update migration script to map projects to tags - https://phabricator.wikimedia.org/T404138 [01:22:45] T404234: repeated twice if it was already part of the wikitext - https://phabricator.wikimedia.org/T404234 [01:23:21] !log musikanimal@deploy2002 musikanimal: Continuing with sync [01:25:31] I realize now I should also check the SAL to see if something is in progress [01:28:16] !log musikanimal@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192652|migrateFromGadget: add a few more missing transformations (T405826 T404138 T404234)]] (duration: 10m 53s) [01:28:27] T405826: Migration script issues - https://phabricator.wikimedia.org/T405826 [01:28:28] T404138: Update migration script to map projects to tags - https://phabricator.wikimedia.org/T404138 [01:28:29] T404234: repeated twice if it was already part of the wikitext - https://phabricator.wikimedia.org/T404234 [01:29:12] (03PS1) 10MusikAnimal: Call WikiPage::doPurge to try and clear cache after language is set [extensions/CommunityRequests] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192655 (https://phabricator.wikimedia.org/T404748) [01:39:20] (03CR) 10Pppery: "I have no context for what is happening here hence no useful feedback to give." [puppet] - 10https://gerrit.wikimedia.org/r/1192648 (owner: 10Ncmonitor) [01:39:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy2002 using scap backport" [extensions/CommunityRequests] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192655 (https://phabricator.wikimedia.org/T404748) (owner: 10MusikAnimal) [01:40:59] (03Merged) 10jenkins-bot: Call WikiPage::doPurge to try and clear cache after language is set [extensions/CommunityRequests] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192655 (https://phabricator.wikimedia.org/T404748) (owner: 10MusikAnimal) [01:41:35] !log musikanimal@deploy2002 Started scap sync-world: Backport for [[gerrit:1192655|Call WikiPage::doPurge to try and clear cache after language is set (T404748)]] [01:41:39] T404748: Newly created wishes in non-English languages do not immediately render with correct RTL and localized labels until cache is purged - https://phabricator.wikimedia.org/T404748 [01:41:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2035:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2035 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [01:44:52] FIRING: [28x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:46:25] FIRING: [21x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:46:43] !log musikanimal@deploy2002 musikanimal: Backport for [[gerrit:1192655|Call WikiPage::doPurge to try and clear cache after language is set (T404748)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [01:46:44] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:46:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:46:46] T404748: Newly created wishes in non-English languages do not immediately render with correct RTL and localized labels until cache is purged - https://phabricator.wikimedia.org/T404748 [01:47:08] !log musikanimal@deploy2002 musikanimal: Continuing with sync [01:47:58] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:52:21] !log musikanimal@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192655|Call WikiPage::doPurge to try and clear cache after language is set (T404748)]] (duration: 10m 47s) [01:52:26] T404748: Newly created wishes in non-English languages do not immediately render with correct RTL and localized labels until cache is purged - https://phabricator.wikimedia.org/T404748 [01:56:38] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54975 bytes in 3.396 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:56:38] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9310 bytes in 3.558 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:01:25] FIRING: [22x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:05:57] (03PS1) 10MusikAnimal: AbstractRenderer: fix extistence dependency on Votes subpage [extensions/CommunityRequests] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192657 [02:14:52] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [02:17:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy2002 using scap backport" [extensions/CommunityRequests] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192657 (owner: 10MusikAnimal) [02:18:41] (03Merged) 10jenkins-bot: AbstractRenderer: fix extistence dependency on Votes subpage [extensions/CommunityRequests] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192657 (owner: 10MusikAnimal) [02:19:14] !log musikanimal@deploy2002 Started scap sync-world: Backport for [[gerrit:1192657|AbstractRenderer: fix extistence dependency on Votes subpage]] [02:24:52] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [02:26:06] !log musikanimal@deploy2002 musikanimal: Backport for [[gerrit:1192657|AbstractRenderer: fix extistence dependency on Votes subpage]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [02:26:30] !log musikanimal@deploy2002 musikanimal: Continuing with sync [02:31:33] !log musikanimal@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192657|AbstractRenderer: fix extistence dependency on Votes subpage]] (duration: 12m 19s) [02:32:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:et-1/1/2 (Transport: cr1-codfw:et-1/0/2 (Arelion, IC-374549) {#20231106}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:36:40] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:39:40] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:43:40] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:43:40] (03PS13) 10Andrea Denisse: mediawiki-engineering: Add REST API alerts with thresholds [alerts] - 10https://gerrit.wikimedia.org/r/1192183 (https://phabricator.wikimedia.org/T405151) [02:43:40] (03CR) 10Andrea Denisse: "Hi folks, I used the envoy_cluster_upstream_rq metric instead of envoy_cluster_upstream_rq_total mostly because the envoy_cluster_upstream" [alerts] - 10https://gerrit.wikimedia.org/r/1192183 (https://phabricator.wikimedia.org/T405151) (owner: 10Andrea Denisse) [02:46:25] FIRING: [22x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:47:58] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:50:38] (03PS14) 10Andrea Denisse: mediawiki-engineering: Add REST API alerts with thresholds [alerts] - 10https://gerrit.wikimedia.org/r/1192183 (https://phabricator.wikimedia.org/T405151) [02:51:33] (03CR) 10Andrea Denisse: "Unresolving for awareness." [alerts] - 10https://gerrit.wikimedia.org/r/1192183 (https://phabricator.wikimedia.org/T405151) (owner: 10Andrea Denisse) [02:51:40] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:55:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 1.364s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:57:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:58:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:00:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 1.333s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:07:40] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:07:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [03:08:40] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:12:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [03:22:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [03:41:56] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ebomani. - https://phabricator.wikimedia.org/T405124#11231902 (10Dzahn) @EBomani You can start by taking a look at [[ https://wikitech.wikimedia.org/wiki/Bastion | the list of bastion hosts ]]. You can pick any of the bastion hosts listed... [03:44:52] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:45:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [04:35:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [04:41:25] FIRING: [22x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:54:27] !log on x1 metawiki creating tables for CommunityRequests [04:54:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:06:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:et-1/1/2 (Transport: cr1-codfw:et-1/0/2 (Arelion, IC-374549) {#20231106}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:09:10] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:11:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:26:50] (03PS1) 10Tim Starling: Enable CommunityRequests on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192663 (https://phabricator.wikimedia.org/T402967) [05:31:25] FIRING: [23x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:31:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:39:11] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:41:55] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2035:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2035 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [05:44:52] FIRING: [28x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251001T0600) [06:05:52] (03PS1) 10Kosta Harlan: CreateAccount: Track interactions with the captchaWord field [extensions/WikimediaEvents] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192666 (https://phabricator.wikimedia.org/T394744) [06:06:05] (03PS1) 10Kosta Harlan: CreateAccount: Track interactions with the captchaWord field [extensions/WikimediaEvents] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192667 (https://phabricator.wikimedia.org/T394744) [06:06:13] (03PS1) 10Kosta Harlan: CreateAccount: Record the CAPTCHA class used in account creation funnel [extensions/WikimediaEvents] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192668 (https://phabricator.wikimedia.org/T405239) [06:06:22] (03PS2) 10Tim Starling: Enable CommunityRequests on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192663 (https://phabricator.wikimedia.org/T402967) [06:06:22] (03PS1) 10Tim Starling: Configure CommunityRequests virtual domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192669 (https://phabricator.wikimedia.org/T402967) [06:06:25] (03PS1) 10Kosta Harlan: CreateAccount: Record the CAPTCHA class used in account creation funnel [extensions/WikimediaEvents] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192670 (https://phabricator.wikimedia.org/T405239) [06:07:10] I'm going to backport some patches to wmf.20 and wmf.21 [06:08:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192670 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan) [06:08:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192667 (https://phabricator.wikimedia.org/T394744) (owner: 10Kosta Harlan) [06:14:52] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [06:16:25] FIRING: [24x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:16:47] (03Merged) 10jenkins-bot: CreateAccount: Record the CAPTCHA class used in account creation funnel [extensions/WikimediaEvents] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192670 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan) [06:16:55] (03Merged) 10jenkins-bot: CreateAccount: Track interactions with the captchaWord field [extensions/WikimediaEvents] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192667 (https://phabricator.wikimedia.org/T394744) (owner: 10Kosta Harlan) [06:17:35] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1192670|CreateAccount: Record the CAPTCHA class used in account creation funnel (T405239)]], [[gerrit:1192667|CreateAccount: Track interactions with the captchaWord field (T394744)]] [06:17:43] T405239: hCaptcha: Enable A/B test for frwiki - https://phabricator.wikimedia.org/T405239 [06:17:44] T394744: Instrument account creation funnel (analytics for Special:CreateAccount) - https://phabricator.wikimedia.org/T394744 [06:21:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:21:44] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:22:36] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1192670|CreateAccount: Record the CAPTCHA class used in account creation funnel (T405239)]], [[gerrit:1192667|CreateAccount: Track interactions with the captchaWord field (T394744)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [06:22:43] T405239: hCaptcha: Enable A/B test for frwiki - https://phabricator.wikimedia.org/T405239 [06:22:44] T394744: Instrument account creation funnel (analytics for Special:CreateAccount) - https://phabricator.wikimedia.org/T394744 [06:23:03] (03CR) 10Samwilson: [C:03+1] Configure CommunityRequests virtual domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192669 (https://phabricator.wikimedia.org/T402967) (owner: 10Tim Starling) [06:23:38] PROBLEM - Docker registry HTTPS interface on registry2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [06:24:28] RECOVERY - Docker registry HTTPS interface on registry2005 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 0.346 second response time https://wikitech.wikimedia.org/wiki/Docker [06:24:52] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [06:24:55] (03CR) 10Samwilson: [C:03+1] Enable CommunityRequests on metawiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192663 (https://phabricator.wikimedia.org/T402967) (owner: 10Tim Starling) [06:26:44] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54975 bytes in 9.334 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:26:44] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9310 bytes in 9.494 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:28:38] PROBLEM - Docker registry HTTPS interface on registry2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [06:28:40] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:29:28] RECOVERY - Docker registry HTTPS interface on registry2005 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 0.346 second response time https://wikitech.wikimedia.org/wiki/Docker [06:31:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:35:04] !log kharlan@deploy2002 kharlan: Continuing with sync [06:37:40] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:40:09] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192670|CreateAccount: Record the CAPTCHA class used in account creation funnel (T405239)]], [[gerrit:1192667|CreateAccount: Track interactions with the captchaWord field (T394744)]] (duration: 22m 34s) [06:40:15] T405239: hCaptcha: Enable A/B test for frwiki - https://phabricator.wikimedia.org/T405239 [06:40:16] T394744: Instrument account creation funnel (analytics for Special:CreateAccount) - https://phabricator.wikimedia.org/T394744 [06:41:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:43:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192666 (https://phabricator.wikimedia.org/T394744) (owner: 10Kosta Harlan) [06:43:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192668 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan) [06:45:19] (03Merged) 10jenkins-bot: CreateAccount: Track interactions with the captchaWord field [extensions/WikimediaEvents] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192666 (https://phabricator.wikimedia.org/T394744) (owner: 10Kosta Harlan) [06:48:09] (03PS2) 10Krinkle: varnish: Enable unified mobile routing on Commons [puppet] - 10https://gerrit.wikimedia.org/r/1192265 (https://phabricator.wikimedia.org/T403510) [06:48:10] (03PS2) 10Krinkle: varnish: Enable unified mobile routing on idwiki, frwiki, dewiki [puppet] - 10https://gerrit.wikimedia.org/r/1192266 (https://phabricator.wikimedia.org/T403510) [06:48:10] (03PS2) 10Krinkle: varnish: Enable unified mobile routing on eswiki, ruwiki, jawiki [puppet] - 10https://gerrit.wikimedia.org/r/1192268 (https://phabricator.wikimedia.org/T403510) [06:48:10] (03PS2) 10Krinkle: varnish: Enable unified mobile routing on all except en.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192271 (https://phabricator.wikimedia.org/T403510) [06:48:11] (03PS2) 10Krinkle: varnish: Enable unified mobile routing on en.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192272 (https://phabricator.wikimedia.org/T403510) [06:48:18] (03PS2) 10Krinkle: varnish: Enable unified mobile routing on de.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192267 [06:48:33] (03Abandoned) 10Krinkle: varnish: Enable unified mobile routing on de.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192267 (owner: 10Krinkle) [06:48:54] (03PS2) 10Krinkle: varnish: Enable unified mobile routing on ru.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192269 [06:49:03] (03Abandoned) 10Krinkle: varnish: Enable unified mobile routing on ru.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192269 (owner: 10Krinkle) [06:49:13] (03PS2) 10Krinkle: varnish: Enable unified mobile routing on ja.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192270 [06:49:15] (03Abandoned) 10Krinkle: varnish: Enable unified mobile routing on ja.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192270 (owner: 10Krinkle) [06:51:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:54:25] (03PS3) 10Krinkle: varnish: Enable unified mobile routing on all except en.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192271 (https://phabricator.wikimedia.org/T403510) [06:54:25] (03PS3) 10Krinkle: varnish: Enable unified mobile routing on en.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192272 (https://phabricator.wikimedia.org/T403510) [06:54:36] (03PS1) 10Bartosz Wójtowicz: ml-services: Update image version for articletopic model on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192782 (https://phabricator.wikimedia.org/T371021) [06:54:40] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:54:59] (03Merged) 10jenkins-bot: CreateAccount: Record the CAPTCHA class used in account creation funnel [extensions/WikimediaEvents] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192668 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan) [06:55:40] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:55:51] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1192666|CreateAccount: Track interactions with the captchaWord field (T394744)]], [[gerrit:1192668|CreateAccount: Record the CAPTCHA class used in account creation funnel (T405239)]] [06:55:59] T394744: Instrument account creation funnel (analytics for Special:CreateAccount) - https://phabricator.wikimedia.org/T394744 [06:56:00] T405239: hCaptcha: Enable A/B test for frwiki - https://phabricator.wikimedia.org/T405239 [06:56:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:58:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:00:05] Amir1, Urbanecm, and awight: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251001T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:26] (03CR) 10Elukey: Wikifunctions SLO: Adjust upper bucket to 10.1s to cover slow reporting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1192609 (https://phabricator.wikimedia.org/T394057) (owner: 10Jforrester) [07:01:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:02:07] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1192666|CreateAccount: Track interactions with the captchaWord field (T394744)]], [[gerrit:1192668|CreateAccount: Record the CAPTCHA class used in account creation funnel (T405239)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:02:14] T394744: Instrument account creation funnel (analytics for Special:CreateAccount) - https://phabricator.wikimedia.org/T394744 [07:02:14] T405239: hCaptcha: Enable A/B test for frwiki - https://phabricator.wikimedia.org/T405239 [07:05:00] !log kharlan@deploy2002 kharlan: Continuing with sync [07:07:13] (03CR) 10JMeybohm: [C:03+1] Update eqiad to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191656 (https://phabricator.wikimedia.org/T405703) (owner: 10Jelto) [07:10:00] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192666|CreateAccount: Track interactions with the captchaWord field (T394744)]], [[gerrit:1192668|CreateAccount: Record the CAPTCHA class used in account creation funnel (T405239)]] (duration: 14m 09s) [07:10:06] T394744: Instrument account creation funnel (analytics for Special:CreateAccount) - https://phabricator.wikimedia.org/T394744 [07:10:08] T405239: hCaptcha: Enable A/B test for frwiki - https://phabricator.wikimedia.org/T405239 [07:11:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:12:25] 06SRE, 10DNS, 06Traffic, 06Traffic-Icebox, and 2 others: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#11232084 (10Krinkle) [07:12:25] (03CR) 10Elukey: "@jforrester@wikimedia.org I see thanks for the explanation! So the le="10.1" bucket doesnt' exists for that metric, the only one that I se" [puppet] - 10https://gerrit.wikimedia.org/r/1192609 (https://phabricator.wikimedia.org/T394057) (owner: 10Jforrester) [07:16:12] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: Update image version for articletopic model on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192782 (https://phabricator.wikimedia.org/T371021) (owner: 10Bartosz Wójtowicz) [07:22:23] 06SRE, 10DNS, 06Traffic, 06Traffic-Icebox, and 2 others: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#11232096 (10Krinkle) 05Open→03Resolved a:03Krinkle I think we can call this done. All wiki listed here, plus a dozen more that I found, have been fixed so that... [07:28:04] (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Update image version for articletopic model on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192782 (https://phabricator.wikimedia.org/T371021) (owner: 10Bartosz Wójtowicz) [07:29:18] (03CR) 10Brouberol: [C:03+1] remove mention of druid10[07-08] in puppet [puppet] - 10https://gerrit.wikimedia.org/r/1192147 (https://phabricator.wikimedia.org/T403801) (owner: 10Stevemunene) [07:29:37] (03PS1) 10Slyngshede: data.yaml: offboarding bvershbow [puppet] - 10https://gerrit.wikimedia.org/r/1192815 [07:29:50] (03Merged) 10jenkins-bot: ml-services: Update image version for articletopic model on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192782 (https://phabricator.wikimedia.org/T371021) (owner: 10Bartosz Wójtowicz) [07:44:40] (03PS3) 10Krinkle: varnish: Enable unified mobile routing on Commons [puppet] - 10https://gerrit.wikimedia.org/r/1192265 (https://phabricator.wikimedia.org/T403510) [07:44:52] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:45:32] (03PS3) 10Krinkle: varnish: Enable unified mobile routing on idwiki, frwiki, dewiki [puppet] - 10https://gerrit.wikimedia.org/r/1192266 (https://phabricator.wikimedia.org/T403510) [07:47:08] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: disable and remove partial backup [puppet] - 10https://gerrit.wikimedia.org/r/1192562 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [07:47:21] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1156.eqiad.wmnet with reason: Maintenance [07:47:29] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [07:47:37] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1156 (T401906)', diff saved to https://phabricator.wikimedia.org/P83520 and previous config saved to /var/cache/conftool/dbconfig/20251001-074736-fceratto.json [07:47:40] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [07:48:25] (03PS4) 10Krinkle: varnish: Enable unified mobile routing on idwiki, frwiki, dewiki [puppet] - 10https://gerrit.wikimedia.org/r/1192266 (https://phabricator.wikimedia.org/T403510) [07:48:51] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T401906)', diff saved to https://phabricator.wikimedia.org/P83521 and previous config saved to /var/cache/conftool/dbconfig/20251001-074850-fceratto.json [07:52:17] (03PS3) 10Krinkle: varnish: Enable unified mobile routing on eswiki, ruwiki, jawiki [puppet] - 10https://gerrit.wikimedia.org/r/1192268 (https://phabricator.wikimedia.org/T403510) [07:59:20] (03PS2) 10Jelto: gitlab: remove packages from daily full backups [puppet] - 10https://gerrit.wikimedia.org/r/1192535 (https://phabricator.wikimedia.org/T378922) [08:00:05] hashar and brennen: Deploy window MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251001T0800) [08:03:22] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov1007 - https://phabricator.wikimedia.org/T400412#11232192 (10jcrespo) I've finished the install following the manual migration to puppet7 instructions show at T349619, but I wonder why it wanted to setup puppet 5 in the first pla... [08:03:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P83522 and previous config saved to /var/cache/conftool/dbconfig/20251001-080357-fceratto.json [08:08:59] !log bwojtowicz@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [08:10:27] !log restart swift on ms-fe2012 T360913 [08:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:31] T360913: Swift proxy server misbehaviour (no longer calling `accept`?) - https://phabricator.wikimedia.org/T360913 [08:13:27] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-eqiad [08:16:32] (03PS3) 10Jelto: gitlab: remove packages from daily full backups [puppet] - 10https://gerrit.wikimedia.org/r/1192535 (https://phabricator.wikimedia.org/T378922) [08:16:35] (03Abandoned) 10Slyngshede: Revert "P:puppetserver::volatile Include XCheeseScore private repo" [puppet] - 10https://gerrit.wikimedia.org/r/1191239 (owner: 10Slyngshede) [08:19:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P83523 and previous config saved to /var/cache/conftool/dbconfig/20251001-081905-fceratto.json [08:19:33] I have blocked MediaWiki train https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/thread/3GE63T2LXDKYKM24QO26I3O7FWGWCANF/ [08:19:46] cause of some internal code needing adjustemnt for 1.44.0-wmf.21 [08:19:57] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-eqiad [08:19:59] err 1.45.0-wmf.21 [08:20:13] (03CR) 10Jelto: [C:03+2] gitlab: remove packages from daily full backups [puppet] - 10https://gerrit.wikimedia.org/r/1192535 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [08:29:32] (03PS3) 10Arnaudb: gerrit: fix allowlist for mod_qos [puppet] - 10https://gerrit.wikimedia.org/r/1192831 (https://phabricator.wikimedia.org/T406017) [08:29:32] (03CR) 10Arnaudb: [C:03+2] "pre-shoting the safety revert after this is submitted" [puppet] - 10https://gerrit.wikimedia.org/r/1192831 (https://phabricator.wikimedia.org/T406017) (owner: 10Arnaudb) [08:30:02] (03PS1) 10Arnaudb: Revert "gerrit: fix allowlist for mod_qos" [puppet] - 10https://gerrit.wikimedia.org/r/1192835 [08:30:13] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov1007 - https://phabricator.wikimedia.org/T400412#11232273 (10jcrespo) 05Open→03Resolved Regarding dbprov1007, work is completed. I also removed dbprov1007 from old puppet master (5). But feel free to coordinate with Infra... [08:31:09] 06SRE, 10Wikimedia-Mailing-lists: Set up a new mailing list for Wales - https://phabricator.wikimedia.org/T406101#11232275 (10Aklapper) @Gemma_Coleman: Hi! Per https://meta.wikimedia.org/wiki/Mailing_lists#Create_a_new_list, please see https://meta.wikimedia.org/wiki/Special:MyLanguage/Mailing_lists/Standardiz... [08:32:25] (03PS1) 10Btullis: Add the JupyterHub.template_paths value to the config file [puppet] - 10https://gerrit.wikimedia.org/r/1192836 (https://phabricator.wikimedia.org/T403863) [08:33:41] (03CR) 10Arnaudb: [C:03+2] Revert "gerrit: fix allowlist for mod_qos" [puppet] - 10https://gerrit.wikimedia.org/r/1192835 (owner: 10Arnaudb) [08:33:54] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7166/co" [puppet] - 10https://gerrit.wikimedia.org/r/1192836 (https://phabricator.wikimedia.org/T403863) (owner: 10Btullis) [08:34:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T401906)', diff saved to https://phabricator.wikimedia.org/P83524 and previous config saved to /var/cache/conftool/dbconfig/20251001-083412-fceratto.json [08:34:18] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [08:34:29] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1162.eqiad.wmnet with reason: Maintenance [08:34:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1162 (T401906)', diff saved to https://phabricator.wikimedia.org/P83525 and previous config saved to /var/cache/conftool/dbconfig/20251001-083435-fceratto.json [08:35:03] (03PS1) 10Arnaudb: Revert^2 "gerrit: fix allowlist for mod_qos" [puppet] - 10https://gerrit.wikimedia.org/r/1192839 [08:35:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T401906)', diff saved to https://phabricator.wikimedia.org/P83526 and previous config saved to /var/cache/conftool/dbconfig/20251001-083549-fceratto.json [08:39:44] (03PS5) 10Krinkle: varnish: Enable unified mobile routing on idwiki, frwiki, dewiki [puppet] - 10https://gerrit.wikimedia.org/r/1192266 (https://phabricator.wikimedia.org/T403510) [08:40:21] (03CR) 10CI reject: [V:04-1] varnish: Enable unified mobile routing on idwiki, frwiki, dewiki [puppet] - 10https://gerrit.wikimedia.org/r/1192266 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [08:41:03] (03PS6) 10Krinkle: varnish: Enable unified mobile routing on idwiki, frwiki, dewiki [puppet] - 10https://gerrit.wikimedia.org/r/1192266 (https://phabricator.wikimedia.org/T403510) [08:41:20] (03PS4) 10Krinkle: varnish: Enable unified mobile routing on all except en.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192271 (https://phabricator.wikimedia.org/T403510) [08:41:50] (03PS4) 10Krinkle: varnish: Enable unified mobile routing on eswiki, ruwiki, jawiki [puppet] - 10https://gerrit.wikimedia.org/r/1192268 (https://phabricator.wikimedia.org/T403510) [08:41:57] (03CR) 10Btullis: [V:03+1 C:03+2] Add the JupyterHub.template_paths value to the config file [puppet] - 10https://gerrit.wikimedia.org/r/1192836 (https://phabricator.wikimedia.org/T403863) (owner: 10Btullis) [08:42:06] (03PS2) 10Jcrespo: Revert "dbbackups: Partial revert of dbprov1007 setup, back to dbprov1003" [puppet] - 10https://gerrit.wikimedia.org/r/1191734 [08:42:17] 06SRE, 10Wikimedia-Mailing-lists: Set up a new mailing list for Wales - https://phabricator.wikimedia.org/T406101#11232302 (10Gemma_Coleman) Hrm I read that and clearly did not understand it then! Is wikimedia-CymruWales@lists.wikimedia.org ok then? However we aren't a separate chapter which is why I didn't pr... [08:45:52] (03CR) 10Jcrespo: Revert "dbbackups: Partial revert of dbprov1007 setup, back to dbprov1003" [puppet] - 10https://gerrit.wikimedia.org/r/1191734 (owner: 10Jcrespo) [08:46:36] (03CR) 10Jcrespo: [C:03+2] Revert "dbbackups: Partial revert of dbprov1007 setup, back to dbprov1003" [puppet] - 10https://gerrit.wikimedia.org/r/1191734 (owner: 10Jcrespo) [08:48:14] (03CR) 10Brouberol: [C:03+1] Update the libyaml-cpp version installed on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1192560 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [08:48:29] (03PS1) 10Btullis: Add missing s to the jupyterhub template_paths configuration [puppet] - 10https://gerrit.wikimedia.org/r/1192843 (https://phabricator.wikimedia.org/T403863) [08:49:26] (03PS2) 10Btullis: Add missing s to the jupyterhub template_paths configuration [puppet] - 10https://gerrit.wikimedia.org/r/1192843 (https://phabricator.wikimedia.org/T403863) [08:50:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P83527 and previous config saved to /var/cache/conftool/dbconfig/20251001-085056-fceratto.json [08:51:33] (03CR) 10Btullis: [C:03+2] Add missing s to the jupyterhub template_paths configuration [puppet] - 10https://gerrit.wikimedia.org/r/1192843 (https://phabricator.wikimedia.org/T403863) (owner: 10Btullis) [08:51:57] (03PS3) 10Arnaudb: Revert^2 "gerrit: fix allowlist for mod_qos" [puppet] - 10https://gerrit.wikimedia.org/r/1192839 (https://phabricator.wikimedia.org/T406017) [08:51:58] (03CR) 10Arnaudb: [C:03+2] "same as the previous test, I'll issue a safety revert" [puppet] - 10https://gerrit.wikimedia.org/r/1192839 (https://phabricator.wikimedia.org/T406017) (owner: 10Arnaudb) [08:54:46] (03PS1) 10Arnaudb: Revert^3 "gerrit: fix allowlist for mod_qos" [puppet] - 10https://gerrit.wikimedia.org/r/1192844 [08:57:25] !log elukey@puppetserver1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=codfw [08:57:26] (03CR) 10Btullis: [V:03+1 C:03+2] Update the libyaml-cpp version installed on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1192560 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [08:57:42] (03CR) 10Arnaudb: [C:03+2] "template rendering is still buggy" [puppet] - 10https://gerrit.wikimedia.org/r/1192844 (owner: 10Arnaudb) [09:00:07] (03PS1) 10Arnaudb: Revert^4 "gerrit: fix allowlist for mod_qos" [puppet] - 10https://gerrit.wikimedia.org/r/1192845 [09:00:16] (03PS1) 10Slyngshede: P:cache::haproxy copy private repo data [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) [09:01:26] (03CR) 10DCausse: [C:03+2] flink jobs: stop search & wdqs jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192590 (https://phabricator.wikimedia.org/T404605) (owner: 10DCausse) [09:01:35] jouncebot: nowandnext [09:01:35] For the next 0 hour(s) and 58 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251001T0800) [09:01:35] In 0 hour(s) and 58 minute(s): eqiad Wikikube kubernetes upgrade (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251001T1000) [09:03:27] (03PS1) 10Jcrespo: installserver: Prevent dbprov1007 & dbprov2006 from full reimage [puppet] - 10https://gerrit.wikimedia.org/r/1192848 (https://phabricator.wikimedia.org/T403166) [09:03:45] (03Merged) 10jenkins-bot: flink jobs: stop search & wdqs jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192590 (https://phabricator.wikimedia.org/T404605) (owner: 10DCausse) [09:04:48] (03PS2) 10Jcrespo: installserver: Prevent dbprov1007 & dbprov2006 from full reimage [puppet] - 10https://gerrit.wikimedia.org/r/1192848 (https://phabricator.wikimedia.org/T403166) [09:06:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P83528 and previous config saved to /var/cache/conftool/dbconfig/20251001-090604-fceratto.json [09:06:45] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:06:50] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:07:27] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [09:08:51] (03CR) 10Jcrespo: [C:03+2] installserver: Prevent dbprov1007 & dbprov2006 from full reimage [puppet] - 10https://gerrit.wikimedia.org/r/1192848 (https://phabricator.wikimedia.org/T403166) (owner: 10Jcrespo) [09:10:28] (03PS2) 10Slyngshede: P:cache::haproxy copy private repo data [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) [09:11:49] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:12:11] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:13:56] FIRING: CirrusConsumerCloudelasticFlinkJobNotRunning: ... [09:14:01] (03PS5) 10Krinkle: varnish: Enable unified mobile routing on eswiki, ruwiki, jawiki [puppet] - 10https://gerrit.wikimedia.org/r/1192268 (https://phabricator.wikimedia.org/T403510) [09:14:02] cirrus_streaming_updater_cloudelastic_consumer in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerCloudelasticFlinkJobNotRunning [09:14:02] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:14:29] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:15:52] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1191653 (https://phabricator.wikimedia.org/T405703) (owner: 10Jelto) [09:16:38] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [09:17:11] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [09:17:36] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [09:18:38] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11232436 (10elukey) @TheDJ Hi! Quick status update so you are up to speed if anything is raised from the community (thanks a lot for what you do!). The Service Ops team is upgrading... [09:18:53] (03CR) 10Tiziano Fogli: [C:03+2] data-platform: port Zookeeper alerts [alerts] - 10https://gerrit.wikimedia.org/r/1182848 (https://phabricator.wikimedia.org/T309012) (owner: 10Tiziano Fogli) [09:19:35] (03PS2) 10Arnaudb: Revert^4 "gerrit: fix allowlist for mod_qos" [puppet] - 10https://gerrit.wikimedia.org/r/1192845 (https://phabricator.wikimedia.org/T406017) [09:19:35] (03CR) 10Arnaudb: [C:03+2] "template rendering is OK" [puppet] - 10https://gerrit.wikimedia.org/r/1192845 (https://phabricator.wikimedia.org/T406017) (owner: 10Arnaudb) [09:20:07] (03PS1) 10Arnaudb: Revert^5 "gerrit: fix allowlist for mod_qos" [puppet] - 10https://gerrit.wikimedia.org/r/1192849 [09:20:32] (03Merged) 10jenkins-bot: data-platform: port Zookeeper alerts [alerts] - 10https://gerrit.wikimedia.org/r/1182848 (https://phabricator.wikimedia.org/T309012) (owner: 10Tiziano Fogli) [09:21:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T401906)', diff saved to https://phabricator.wikimedia.org/P83529 and previous config saved to /var/cache/conftool/dbconfig/20251001-092112-fceratto.json [09:21:17] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [09:21:29] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1182.eqiad.wmnet with reason: Maintenance [09:21:37] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1182 (T401906)', diff saved to https://phabricator.wikimedia.org/P83530 and previous config saved to /var/cache/conftool/dbconfig/20251001-092136-fceratto.json [09:22:45] FIRING: CirrusProducerFlinkJobNotRunning: cirrus_streaming_updater_producer in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusProducerFlinkJobNotRunning [09:22:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T401906)', diff saved to https://phabricator.wikimedia.org/P83531 and previous config saved to /var/cache/conftool/dbconfig/20251001-092251-fceratto.json [09:23:56] FIRING: WcqsStreamingUpdaterFlinkJobNotRunning: WCQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=commons - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning [09:28:39] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [09:28:48] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [09:29:10] FIRING: SLOMetricAbsent: search-update-lag eqiad - https://slo.wikimedia.org/?search=search-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [09:30:56] hashar: I will deploy some wmf.20 / wmf.21 backports now, if you're not running the train now [09:31:04] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11232493 (10elukey) The not great news is that I see the following error in tegola's cronjobs: ` Error: error seeding tile ({Z:15 X:19137 Y:4191}): ERROR: permission denied for table... [09:31:11] (03PS1) 10Kosta Harlan: CreateAccount: Fix server side logging of CAPTCHA class [extensions/WikimediaEvents] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192852 (https://phabricator.wikimedia.org/T405239) [09:31:21] (03PS1) 10Kosta Harlan: CreateAccount: Fix server side logging of CAPTCHA class [extensions/WikimediaEvents] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192853 (https://phabricator.wikimedia.org/T405239) [09:32:19] kostajh: sure, please do! [09:32:37] thanks [09:32:40] (03CR) 10Hashar: [C:03+1] CreateAccount: Fix server side logging of CAPTCHA class [extensions/WikimediaEvents] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192852 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan) [09:32:45] (03CR) 10Hashar: [C:03+1] CreateAccount: Fix server side logging of CAPTCHA class [extensions/WikimediaEvents] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192853 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan) [09:33:08] hashar: can I backport both wmf.20 and wmf.21 together via spiderpig, or do they need to go out one at a time? [09:33:24] I am pretty sure you can do both at the same time [09:33:30] it should +2 both of them [09:33:33] ok, I'll try [09:33:35] update both branches on the deploy server [09:33:51] then `scap sync-world` which grab the whole source tree [09:33:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192853 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan) [09:33:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192852 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan) [09:34:04] :] [09:36:24] (03PS3) 10DCausse: flink jobs: resume search & wdqs jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192591 [09:36:37] 06SRE, 06Commons, 10TimedMediaHandler: Videos on Commons take long to load - https://phabricator.wikimedia.org/T405760#11232520 (10TheDJ) >>! In T405760#11231516, @Prototyperspective wrote: > * I also did an Internet speed test and it was as fast as it should be and again other sites like YouTube videos load... [09:36:44] (03CR) 10DCausse: [C:04-1] "needs to be merged after the k8s upgrade" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192591 (owner: 10DCausse) [09:36:56] FIRING: WdqsStreamingUpdaterFlinkJobNotRunning: WDQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning [09:37:13] claime: any objections to adding something like “(no other deployments)” to https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251001T1000 so it’ll show up in `jouncebot: now`? [09:37:16] hashar: kostajh, just a reminder that we'll start the k8s upgrade on wikikube eqiad in ~25 to 30 minutes so deployments will be suspended for the duration [09:37:39] Lucas_WMDE: yes please <3 [09:37:41] claime: ok, I should be done by then [09:37:44] I should have thought about that [09:37:46] * Lucas_WMDE edits [09:37:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P83532 and previous config saved to /var/cache/conftool/dbconfig/20251001-093758-fceratto.json [09:38:22] jouncebot: refresh [09:38:23] I refreshed my knowledge about deployments. [09:38:24] jouncebot: next [09:38:24] In 0 hour(s) and 21 minute(s): eqiad Wikikube kubernetes upgrade (no other deployments) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251001T1000) [09:38:25] I'll let you know when I'm finished [09:38:26] whee [09:38:55] RECOVERY - snapshot of s3 in eqiad on backupmon1001 is OK: Last snapshot for s3 at eqiad (db1150) taken on 2025-10-01 08:10:03 (1160 GiB, +0.0 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [09:39:19] kostajh: ty <3 [09:41:55] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2035:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2035 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:42:48] (03CR) 10Arnaudb: [C:03+2] Revert^5 "gerrit: fix allowlist for mod_qos" [puppet] - 10https://gerrit.wikimedia.org/r/1192849 (owner: 10Arnaudb) [09:42:54] (03Merged) 10jenkins-bot: CreateAccount: Fix server side logging of CAPTCHA class [extensions/WikimediaEvents] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192853 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan) [09:43:22] (03Merged) 10jenkins-bot: CreateAccount: Fix server side logging of CAPTCHA class [extensions/WikimediaEvents] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192852 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan) [09:43:28] (03PS1) 10Arnaudb: Revert^6 "gerrit: fix allowlist for mod_qos" [puppet] - 10https://gerrit.wikimedia.org/r/1192854 [09:43:36] (03PS1) 10Tiziano Fogli: zookeeper: remove check_prometheus, disable nrpe [puppet] - 10https://gerrit.wikimedia.org/r/1192855 (https://phabricator.wikimedia.org/T309012) [09:44:02] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1192853|CreateAccount: Fix server side logging of CAPTCHA class (T405239)]], [[gerrit:1192852|CreateAccount: Fix server side logging of CAPTCHA class (T405239)]] [09:44:06] T405239: hCaptcha: Enable A/B test for frwiki - https://phabricator.wikimedia.org/T405239 [09:44:52] FIRING: [28x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:45:12] (03PS8) 10Federico Ceratto: mariadb: Update production mysql grants with unix_socket & heartbeat [puppet] - 10https://gerrit.wikimedia.org/r/868392 (https://phabricator.wikimedia.org/T304664) (owner: 10Jcrespo) [09:45:42] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11232563 (10elukey) So we are going to definitely show some stale tiles for the next hours @TheDJ, really sorry about it but we cannot do much at the moment. [09:47:39] FIRING: [10x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1069-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [09:48:01] !log elukey@puppetserver1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=eqiad [09:50:38] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1192853|CreateAccount: Fix server side logging of CAPTCHA class (T405239)]], [[gerrit:1192852|CreateAccount: Fix server side logging of CAPTCHA class (T405239)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:50:41] T405239: hCaptcha: Enable A/B test for frwiki - https://phabricator.wikimedia.org/T405239 [09:52:39] FIRING: [12x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1069-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [09:53:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P83533 and previous config saved to /var/cache/conftool/dbconfig/20251001-095306-fceratto.json [09:53:54] (03CR) 10Tiziano Fogli: mediawiki-engineering: Add REST API alerts with thresholds (034 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1192183 (https://phabricator.wikimedia.org/T405151) (owner: 10Andrea Denisse) [09:54:43] !log kharlan@deploy2002 kharlan: Continuing with sync [09:55:48] (03PS1) 10Elukey: Assign the ML K8s worker role to ml-serve1012 [puppet] - 10https://gerrit.wikimedia.org/r/1192856 (https://phabricator.wikimedia.org/T405891) [09:57:08] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7169/co" [puppet] - 10https://gerrit.wikimedia.org/r/1192856 (https://phabricator.wikimedia.org/T405891) (owner: 10Elukey) [09:57:27] (03CR) 10JMeybohm: [C:03+1] Update eqiad to kubernetes 1.31, calico 3.29 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191653 (https://phabricator.wikimedia.org/T405703) (owner: 10Jelto) [09:57:39] FIRING: [14x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1069-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [09:57:50] There is a train blocker with wmf.21 that is causing lots of errors and is from private code [09:57:53] (03CR) 10Elukey: Assign the ML K8s worker role to ml-serve1012 [puppet] - 10https://gerrit.wikimedia.org/r/1192856 (https://phabricator.wikimedia.org/T405891) (owner: 10Elukey) [09:58:02] claime: there's a production error tracked in T406094 that might be nice to resolve ahead of the Wikikube upgrade [09:58:26] kostajh: checking [09:58:57] (03CR) 10Klausman: [C:03+2] Assign the ML K8s worker role to ml-serve1012 [puppet] - 10https://gerrit.wikimedia.org/r/1192856 (https://phabricator.wikimedia.org/T405891) (owner: 10Elukey) [09:59:37] hashar: do you want to have that issue resolved before the Wikikube upgrade starts? [09:59:49] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192853|CreateAccount: Fix server side logging of CAPTCHA class (T405239)]], [[gerrit:1192852|CreateAccount: Fix server side logging of CAPTCHA class (T405239)]] (duration: 15m 47s) [09:59:53] T405239: hCaptcha: Enable A/B test for frwiki - https://phabricator.wikimedia.org/T405239 [10:00:04] kostajh: Do you have a patch ready to deploy for this? [10:00:05] claime, jelto, and jayme: Time to do the eqiad Wikikube kubernetes upgrade (no other deployments) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251001T1000). [10:00:08] claime: I'm done with my backport [10:00:33] claime: yes, there's a fix for private code I've proposed, and there's also an option to revert the public code that caused the issue. Either is fine with me. [10:00:36] I don't know, I did not know about the upgrade [10:00:50] hashar: you're not on wikitech@ ? [10:01:08] regardless I don't think it matters, brennen can run the train later tonight [10:01:25] FIRING: [25x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:01:25] I think the issue is that we will continue to get bug reports until this problem is solved [10:01:46] I can roll back group 0 [10:01:47] so another option would be to roll back to wmf.20, but it seems the other options are better [10:01:55] or well, revert the faulty change [10:02:07] kostajh: how confident are you about the patch? [10:02:39] FIRING: [14x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1069-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [10:03:10] claime: It's straightforward, but I'm prepared to be surprised. [10:03:14] well [10:03:25] can we rollback https://gerrit.wikimedia.org/r/c/mediawiki/extensions/LoginNotify/+/1183102 [10:03:38] hashar: Both options are 15 minutes minimum anyways [10:03:44] then I guess upgrade WikiKube [10:03:48] and resume the train later tonight [10:04:01] If you want to wait a few minutes, Dreamy_Jazz is looking at the private settings patch now [10:04:12] Might as well fix forward I think, jelto jayme wdyt? [10:04:23] and the private settings patch can be deployed later tonight or tomorrow and checked independently [10:05:56] I think I don't fully understand the consequences of the issue [10:06:17] we should move to #mediawiki_security or the task to discuss it in more detail [10:06:29] +1 :) [10:07:39] FIRING: [16x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1069-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [10:07:56] (03PS1) 10Gergő Tisza: Enable JWT session cookies on group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192857 (https://phabricator.wikimedia.org/T399631) [10:08:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T401906)', diff saved to https://phabricator.wikimedia.org/P83534 and previous config saved to /var/cache/conftool/dbconfig/20251001-100814-fceratto.json [10:08:19] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [10:08:30] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1188.eqiad.wmnet with reason: Maintenance [10:08:38] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1188 (T401906)', diff saved to https://phabricator.wikimedia.org/P83535 and previous config saved to /var/cache/conftool/dbconfig/20251001-100837-fceratto.json [10:08:54] !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [10:09:15] !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [10:09:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T401906)', diff saved to https://phabricator.wikimedia.org/P83536 and previous config saved to /var/cache/conftool/dbconfig/20251001-100951-fceratto.json [10:10:21] !log klausman@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [10:11:05] !log klausman@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [10:11:38] !log klausman@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [10:11:57] !log klausman@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [10:12:53] (03PS8) 10Daniel Kinzler: api-gateway: Remove .tpl extension from yaml files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189440 [10:13:13] (03PS20) 10Daniel Kinzler: Add rate limiting for REST gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189447 (https://phabricator.wikimedia.org/T405574) [10:14:52] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [10:17:00] 06SRE, 06Commons, 10TimedMediaHandler: Videos on Commons take long to load - https://phabricator.wikimedia.org/T405760#11232673 (10Prototyperspective) Yes I know that's just the upper layer. Just mentioning this and the speed is much faster than what's needed to play Commons videos. Thanks for the elaboratio... [10:22:16] (03PS1) 10Dreamy Jazz: Revert "Replace LoginNotify::getInstance with service injection" [extensions/LoginNotify] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192860 (https://phabricator.wikimedia.org/T406094) [10:22:31] (03CR) 10Kosta Harlan: [C:03+1] Revert "Replace LoginNotify::getInstance with service injection" [extensions/LoginNotify] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192860 (https://phabricator.wikimedia.org/T406094) (owner: 10Dreamy Jazz) [10:23:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [extensions/LoginNotify] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192860 (https://phabricator.wikimedia.org/T406094) (owner: 10Dreamy Jazz) [10:23:29] we ended up deciding to revert the LoginNotify patch [10:23:38] which is the easiest/safest/fastest [10:23:50] (03CR) 10Elukey: "Left a couple of nits, plus I have a higher level question/doubt. I usually like to have a generic class that reads from hiera an array of" [puppet] - 10https://gerrit.wikimedia.org/r/1188441 (https://phabricator.wikimedia.org/T406054) (owner: 10Herron) [10:24:20] other options were: rolling back the train, speedy deploy the private patches and both sounded a bit risky [10:24:52] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [10:25:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P83537 and previous config saved to /var/cache/conftool/dbconfig/20251001-102458-fceratto.json [10:29:12] (03CR) 10Elukey: "Hi! We add this fleet wide via profile::base" [puppet] - 10https://gerrit.wikimedia.org/r/1192566 (https://phabricator.wikimedia.org/T381565) (owner: 10Ahmon Dancy) [10:31:11] (03Merged) 10jenkins-bot: Revert "Replace LoginNotify::getInstance with service injection" [extensions/LoginNotify] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192860 (https://phabricator.wikimedia.org/T406094) (owner: 10Dreamy Jazz) [10:31:25] FIRING: [26x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:31:29] claime, kostajh: the revert has been merged, it is being deployed [10:31:35] ack [10:31:45] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1192860|Revert "Replace LoginNotify::getInstance with service injection" (T406094)]] [10:31:46] (03PS1) 10Samtar: EventStreamConfig and stream registration for watchlist click tracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192861 (https://phabricator.wikimedia.org/T401575) [10:36:29] !log hashar@deploy2002 hashar, dreamyjazz: Backport for [[gerrit:1192860|Revert "Replace LoginNotify::getInstance with service injection" (T406094)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:39:17] I am checking whether I can still login [10:40:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P83538 and previous config saved to /var/cache/conftool/dbconfig/20251001-104006-fceratto.json [10:40:24] it works [10:40:28] !log hashar@deploy2002 hashar, dreamyjazz: Continuing with sync [10:42:41] hashar: nice [10:45:32] !log hashar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192860|Revert "Replace LoginNotify::getInstance with service injection" (T406094)]] (duration: 13m 47s) [10:46:54] done [10:50:14] hashar: tyvm [10:50:24] I am checking logstash [10:51:14] claime: I think we are set! thanks [10:51:16] :) [10:52:04] (03CR) 10Mvolz: [C:03+2] Update zotero to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191311 (owner: 10Mvolz) [10:53:02] hashar: great thank you [10:53:44] (03Merged) 10jenkins-bot: Update zotero to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191311 (owner: 10Mvolz) [10:54:53] (03PS1) 10Btullis: Vendor the base.networkpolicy module into the spark-operator chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192866 (https://phabricator.wikimedia.org/T405490) [10:55:13] !log Starting eqiad Wikikube kubernetes cluster upgrade to 1.31 - T405703 [10:55:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T401906)', diff saved to https://phabricator.wikimedia.org/P83539 and previous config saved to /var/cache/conftool/dbconfig/20251001-105514-fceratto.json [10:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:17] T405703: Update wikikube eqiad to kubernetes 1.31 - https://phabricator.wikimedia.org/T405703 [10:55:21] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [10:55:31] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1197.eqiad.wmnet with reason: Maintenance [10:55:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1197 (T401906)', diff saved to https://phabricator.wikimedia.org/P83540 and previous config saved to /var/cache/conftool/dbconfig/20251001-105538-fceratto.json [10:56:33] (03PS1) 10Zabe: Stop setting CategoryLinksSchemaMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192867 (https://phabricator.wikimedia.org/T299951) [10:56:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T401906)', diff saved to https://phabricator.wikimedia.org/P83541 and previous config saved to /var/cache/conftool/dbconfig/20251001-105652-fceratto.json [10:57:16] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [10:57:37] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [10:58:06] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [10:58:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:58:36] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [10:58:40] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [10:59:09] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [10:59:26] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [10:59:42] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-experimental: apply [10:59:57] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [11:00:12] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply [11:01:10] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/ratelimit: apply [11:01:26] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/ratelimit: apply [11:01:34] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ratelimit: apply [11:01:38] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/ratelimit: apply [11:01:42] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/ratelimit: apply [11:02:06] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/ratelimit: apply [11:02:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 1.542s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:02:39] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/zotero: apply [11:02:48] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/zotero: apply [11:03:03] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/zotero: apply [11:03:05] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply [11:03:08] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/zotero: apply [11:03:33] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/zotero: apply [11:03:53] !log cgoubert@deploy2002 Locking from deployment [ALL REPOSITORIES]: eqiad Wikikube kubernetes cluster upgrade to 1.31 - T405703 [11:03:58] T405703: Update wikikube eqiad to kubernetes 1.31 - https://phabricator.wikimedia.org/T405703 [11:04:36] !log cgoubert@cumin1003 START - Cookbook sre.discovery.service-route depool toolhub in eqiad: maintenance [11:04:38] !log cgoubert@cumin1003 END (FAIL) - Cookbook sre.discovery.service-route (exit_code=99) depool toolhub in eqiad: maintenance [11:05:46] !log cgoubert@cumin1003 conftool action : set/pooled=false; selector: dnsdisc=toolhub.* [11:07:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 1.542s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:12:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P83542 and previous config saved to /var/cache/conftool/dbconfig/20251001-111159-fceratto.json [11:12:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [11:15:44] (03CR) 10Stevemunene: [C:03+2] admin/data: add the analytics-wikidata system user and user groups [puppet] - 10https://gerrit.wikimedia.org/r/1191349 (https://phabricator.wikimedia.org/T404073) (owner: 10Stevemunene) [11:18:54] !log cgoubert@cumin1003 conftool action : set/pooled=true; selector: dnsdisc=thumbor.*,name=codfw [11:25:10] !log dropping two unused tables in phabricator db (T403542) [11:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:14] T403542: Drop unexpected/unneeded database tables in Phabricator - https://phabricator.wikimedia.org/T403542 [11:26:25] FIRING: [27x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:27:08] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P83544 and previous config saved to /var/cache/conftool/dbconfig/20251001-112707-fceratto.json [11:29:44] !log cgoubert@cumin1003 conftool action : set/pooled=false; selector: dnsdisc=swift.*,name=eqiad [11:32:07] 06SRE, 10Wikimedia-Mailing-lists: Set up a new mailing list for Wales - https://phabricator.wikimedia.org/T406101#11232821 (10Ladsgroup) We don't really create mailing lists for a full language or a whole country. There is no germany@lists.wikimedia.org or swahili@lists.wikimedia.org. It should be either about... [11:35:37] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [11:35:43] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [11:37:05] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [11:37:11] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [11:39:05] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [11:39:10] FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:39:12] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [11:39:52] RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:40:05] !incidents [11:40:05] 6807 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [11:41:13] !log cgoubert@cumin1003 conftool action : set/pooled=false; selector: dnsdisc=thumbor.*,name=eqiad [11:42:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T401906)', diff saved to https://phabricator.wikimedia.org/P83545 and previous config saved to /var/cache/conftool/dbconfig/20251001-114214-fceratto.json [11:42:19] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [11:42:31] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1225.eqiad.wmnet with reason: Maintenance [11:42:36] !log manually bumped thumbor replicas in codfw to 140 [11:42:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:53] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1229.eqiad.wmnet with reason: Maintenance [11:43:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1229 (T401906)', diff saved to https://phabricator.wikimedia.org/P83546 and previous config saved to /var/cache/conftool/dbconfig/20251001-114259-fceratto.json [11:44:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T401906)', diff saved to https://phabricator.wikimedia.org/P83547 and previous config saved to /var/cache/conftool/dbconfig/20251001-114414-fceratto.json [11:44:52] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:48:38] !log cgoubert@cumin1003 START - Cookbook sre.k8s.wipe-cluster Wipe the K8s cluster wikikube-eqiad: eqiad Wikikube kubernetes cluster upgrade to 1.31 - T405703 [11:48:42] T405703: Update wikikube eqiad to kubernetes 1.31 - https://phabricator.wikimedia.org/T405703 [11:49:10] FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:49:17] sigh [11:49:19] adding more replicas [11:49:29] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [11:50:58] (03Abandoned) 10Btullis: Vendor the base.networkpolicy module into the spark-operator chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192866 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [11:51:38] slyngs: hnowlan: there will probably be some alerts that can't be silenced by the cookbooks once the cluster is wiped btw [11:51:57] ack [11:51:57] FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:52:00] Noted, [11:52:19] trying to address the thumbor issue [11:52:28] hnowlan: you need help? [11:53:04] just trying to scale up but there are some scrapers we could also block [11:53:31] hitting quota sigh [11:54:10] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:54:43] FIRING: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [11:54:44] FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [11:55:21] PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift [11:55:22] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1255 to x3 master [puppet] - 10https://gerrit.wikimedia.org/r/1192871 (https://phabricator.wikimedia.org/T406116) [11:55:23] these are all thumbor related I assume [11:55:35] Emperor: am I right? or a knock-on? [11:56:00] yeah, I see 500rps of 5xxs from swift on ATS [11:56:04] <_joe_> yes it's that quite a few requests get 5xx [11:56:04] (03CR) 10Stevemunene: Define airflow-wikidata PG cluster and airflow instance (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190975 (https://phabricator.wikimedia.org/T404073) (owner: 10Stevemunene) [11:56:18] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11232878 (10Waddie96) @elukey @Muehlenhoff Thanks for working on this! [11:56:57] FIRING: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:57:03] cgoubert@cumin1003 wipe-cluster (PID 777396) is awaiting input [11:57:04] !incidents [11:57:04] 6810 (ACKED) ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw) [11:57:05] 6811 (ACKED) VarnishUnavailable global sre (varnish-upload thanos-rule) [11:57:05] 6812 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule) [11:57:05] 6807 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [11:58:19] PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift [11:58:19] PROBLEM - Swift https backend on ms-fe2017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.179 second response time https://wikitech.wikimedia.org/wiki/Swift [11:58:30] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift [11:58:43] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [11:59:03] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [11:59:05] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [11:59:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P83548 and previous config saved to /var/cache/conftool/dbconfig/20251001-115922-fceratto.json [11:59:27] PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:59:29] PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.167 second response time https://wikitech.wikimedia.org/wiki/Swift [11:59:29] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift [11:59:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in esams #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [11:59:58] !incidents [11:59:59] 6810 (ACKED) ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw) [11:59:59] 6811 (ACKED) VarnishUnavailable global sre (varnish-upload thanos-rule) [11:59:59] 6812 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule) [12:00:00] 6813 (UNACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [12:00:00] 6807 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [12:00:07] !ack 6813 [12:00:08] 6813 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [12:00:13] trying to bump limits [12:00:21] PROBLEM - Swift https frontend on ms-fe2017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift [12:00:26] can someone check capcaity on the cluster to see how much headroom we have? [12:00:29] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 16 hosts with reason: Primary switchover x3 T406116 [12:00:33] hnowlan: on it [12:00:33] T406116: Switchover x3 master (db1258 -> db1255) - https://phabricator.wikimedia.org/T406116 [12:00:43] sorry, was eating, back now. [12:01:11] hnowlan: 1.7kCPU max [12:01:21] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift [12:01:24] 18TiB ram [12:01:27] RECOVERY - Swift https frontend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 5.377 second response time https://wikitech.wikimedia.org/wiki/Swift [12:01:29] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift [12:01:41] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Set db1255 with weight 0 T406116', diff saved to https://phabricator.wikimedia.org/P83549 and previous config saved to /var/cache/conftool/dbconfig/20251001-120140-ladsgroup.json [12:01:43] (03PS1) 10Hnowlan: admin_ng: remove thumbor limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192873 [12:02:01] claime: thanks. If you could ^ [12:02:19] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.182 second response time https://wikitech.wikimedia.org/wiki/Swift [12:02:19] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.178 second response time https://wikitech.wikimedia.org/wiki/Swift [12:02:21] PROBLEM - Swift https backend on ms-fe2020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.175 second response time https://wikitech.wikimedia.org/wiki/Swift [12:02:21] RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 2.755 second response time https://wikitech.wikimedia.org/wiki/Swift [12:02:21] PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.168 second response time https://wikitech.wikimedia.org/wiki/Swift [12:02:29] PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.167 second response time https://wikitech.wikimedia.org/wiki/Swift [12:02:29] PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift [12:02:34] (03CR) 10JMeybohm: [C:03+1] admin_ng: remove thumbor limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192873 (owner: 10Hnowlan) [12:02:39] hnowlan: done [12:02:46] thanks [12:03:00] (03CR) 10Hnowlan: [V:03+2 C:03+2] admin_ng: remove thumbor limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192873 (owner: 10Hnowlan) [12:03:19] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.740 second response time https://wikitech.wikimedia.org/wiki/Swift [12:03:21] RECOVERY - Swift https backend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 1.330 second response time https://wikitech.wikimedia.org/wiki/Swift [12:03:23] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 3.037 second response time https://wikitech.wikimedia.org/wiki/Swift [12:03:25] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 8.013 second response time https://wikitech.wikimedia.org/wiki/Swift [12:03:27] RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 6.315 second response time https://wikitech.wikimedia.org/wiki/Swift [12:03:29] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.169 second response time https://wikitech.wikimedia.org/wiki/Swift [12:03:30] RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.198 second response time https://wikitech.wikimedia.org/wiki/Swift [12:04:10] <_joe_> ok sook I see the recoveries coming [12:04:16] (03CR) 10Ladsgroup: [C:03+2] mariadb: Promote db1255 to x3 master [puppet] - 10https://gerrit.wikimedia.org/r/1192871 (https://phabricator.wikimedia.org/T406116) (owner: 10Gerrit maintenance bot) [12:04:21] PROBLEM - Swift https frontend on ms-fe2017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift [12:04:22] I think the swift sadness is thumbor sadness being passed on [12:04:27] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.200 second response time https://wikitech.wikimedia.org/wiki/Swift [12:04:29] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift [12:04:29] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [12:04:33] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 5.797 second response time https://wikitech.wikimedia.org/wiki/Swift [12:04:51] FIRING: [4x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [12:05:05] PROBLEM - Swift https backend on ms-fe2018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.169 second response time https://wikitech.wikimedia.org/wiki/Swift [12:05:05] !incidents [12:05:05] <_joe_> uh still? [12:05:06] 6810 (ACKED) ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw) [12:05:06] 6811 (ACKED) VarnishUnavailable global sre (varnish-upload thanos-rule) [12:05:06] 6812 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule) [12:05:06] 6813 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [12:05:06] 6807 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [12:05:19] RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.449 second response time https://wikitech.wikimedia.org/wiki/Swift [12:05:27] RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 8.633 second response time https://wikitech.wikimedia.org/wiki/Swift [12:05:27] RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.228 second response time https://wikitech.wikimedia.org/wiki/Swift [12:05:29] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 0.725 second response time https://wikitech.wikimedia.org/wiki/Swift [12:05:29] PROBLEM - Swift https frontend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:05:30] RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 9.029 second response time https://wikitech.wikimedia.org/wiki/Swift [12:05:31] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [12:05:35] RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 7.895 second response time https://wikitech.wikimedia.org/wiki/Swift [12:05:45] thumbor queues are still awful, error rate still high [12:05:51] <_joe_> hnowlan: is thumbor still down? yeah [12:05:58] !log Starting x3 eqiad failover from db1258 to db1255 - T406116 [12:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:02] T406116: Switchover x3 master (db1258 -> db1255) - https://phabricator.wikimedia.org/T406116 [12:06:07] RECOVERY - Swift https backend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 2.258 second response time https://wikitech.wikimedia.org/wiki/Swift [12:06:10] we haven't bumped it a second time yet right? [12:06:12] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [12:06:15] doing now [12:06:19] RECOVERY - Swift https frontend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.524 second response time https://wikitech.wikimedia.org/wiki/Swift [12:06:19] ack [12:06:22] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [12:06:23] that should help [12:06:27] RECOVERY - Swift https frontend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 6.731 second response time https://wikitech.wikimedia.org/wiki/Swift [12:06:27] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Swift [12:06:27] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Swift [12:06:30] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Promote db1255 to x3 primary T406116', diff saved to https://phabricator.wikimedia.org/P83550 and previous config saved to /var/cache/conftool/dbconfig/20251001-120629-ladsgroup.json [12:07:21] PROBLEM - Swift https backend on ms-fe2019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift [12:07:57] bumping one more time [12:08:05] we might need to roll restart to dump queues :/ [12:08:09] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [12:08:17] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [12:08:19] PROBLEM - Swift https frontend on ms-fe2019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift [12:08:29] PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.176 second response time https://wikitech.wikimedia.org/wiki/Swift [12:09:02] might need a statuspage update [12:09:04] hnowlan: you could probably do that with the next deployment, right? [12:09:51] FIRING: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [12:10:00] !incidents [12:10:00] 6810 (ACKED) ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw) [12:10:01] 6811 (ACKED) VarnishUnavailable global sre (varnish-upload thanos-rule) [12:10:01] 6812 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule) [12:10:01] 6813 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [12:10:01] 6807 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [12:10:06] jayme: how do you mean? [12:10:19] PROBLEM - Swift https frontend on ms-fe2018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift [12:10:21] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.168 second response time https://wikitech.wikimedia.org/wiki/Swift [12:10:29] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.169 second response time https://wikitech.wikimedia.org/wiki/Swift [12:10:29] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift [12:10:29] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift [12:10:41] hnowlan: with --state-values-set roll_restart=1 argument to helmfile [12:10:44] hnowlan: helmfile -e codfw --state-values-set roll_restart=1 sync [12:11:17] yeah [12:11:19] PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift [12:11:19] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.176 second response time https://wikitech.wikimedia.org/wiki/Swift [12:11:21] PROBLEM - Swift https backend on ms-fe2020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift [12:11:21] PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift [12:11:21] PROBLEM - Swift https frontend on ms-fe2017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.179 second response time https://wikitech.wikimedia.org/wiki/Swift [12:11:23] sorry I didn't understand phrasing [12:11:27] RECOVERY - Swift https frontend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 8.355 second response time https://wikitech.wikimedia.org/wiki/Swift [12:11:27] RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.214 second response time https://wikitech.wikimedia.org/wiki/Swift [12:11:29] PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift [12:11:29] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift [12:11:30] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: sync [12:11:50] might as well be my phrasing :) [12:11:59] (03PS21) 10Daniel Kinzler: Add rate limiting for REST gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189447 (https://phabricator.wikimedia.org/T405574) [12:12:07] PROBLEM - Swift https backend on ms-fe2018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift [12:12:21] RECOVERY - Swift https backend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Swift [12:12:21] PROBLEM - Swift https backend on ms-fe2017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.186 second response time https://wikitech.wikimedia.org/wiki/Swift [12:12:21] RECOVERY - Swift https backend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 1.604 second response time https://wikitech.wikimedia.org/wiki/Swift [12:12:21] PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.176 second response time https://wikitech.wikimedia.org/wiki/Swift [12:12:23] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 2.087 second response time https://wikitech.wikimedia.org/wiki/Swift [12:12:27] RECOVERY - Swift https frontend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 6.532 second response time https://wikitech.wikimedia.org/wiki/Swift [12:12:27] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Swift [12:12:37] swift errors seem to be going down generally though [12:12:38] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: sync [12:13:21] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift [12:13:31] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 3.109 second response time https://wikitech.wikimedia.org/wiki/Swift [12:13:31] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.687 second response time https://wikitech.wikimedia.org/wiki/Swift [12:13:35] RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 7.204 second response time https://wikitech.wikimedia.org/wiki/Swift [12:13:41] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depool db1258 T406116', diff saved to https://phabricator.wikimedia.org/P83551 and previous config saved to /var/cache/conftool/dbconfig/20251001-121339-ladsgroup.json [12:13:44] T406116: Switchover x3 master (db1258 -> db1255) - https://phabricator.wikimedia.org/T406116 [12:14:07] RECOVERY - Swift https backend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift [12:14:10] FIRING: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:14:21] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.911 second response time https://wikitech.wikimedia.org/wiki/Swift [12:14:21] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.254 second response time https://wikitech.wikimedia.org/wiki/Swift [12:14:25] RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 6.642 second response time https://wikitech.wikimedia.org/wiki/Swift [12:14:27] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.173 second response time https://wikitech.wikimedia.org/wiki/Swift [12:14:29] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift [12:14:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P83552 and previous config saved to /var/cache/conftool/dbconfig/20251001-121429-fceratto.json [12:15:13] (03PS3) 10Arnaudb: Revert^6 "gerrit: fix allowlist for mod_qos" [puppet] - 10https://gerrit.wikimedia.org/r/1192854 (https://phabricator.wikimedia.org/T406017) [12:15:13] (03CR) 10Arnaudb: [C:03+2] "with safety revert" [puppet] - 10https://gerrit.wikimedia.org/r/1192854 (https://phabricator.wikimedia.org/T406017) (owner: 10Arnaudb) [12:15:23] PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift [12:15:23] PROBLEM - Swift https frontend on ms-fe2017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.187 second response time https://wikitech.wikimedia.org/wiki/Swift [12:15:25] RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 1.945 second response time https://wikitech.wikimedia.org/wiki/Swift [12:15:26] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.upgrade for db1258.eqiad.wmnet [12:15:27] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 505 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Swift [12:15:29] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.177 second response time https://wikitech.wikimedia.org/wiki/Swift [12:15:31] (03PS1) 10Arnaudb: Revert^7 "gerrit: fix allowlist for mod_qos" [puppet] - 10https://gerrit.wikimedia.org/r/1192875 [12:15:35] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.depool db1258 - Upgrading db1258.eqiad.wmnet [12:15:39] the queue still looks quite large [12:15:42] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1258 - Upgrading db1258.eqiad.wmnet [12:15:45] yeah [12:16:23] RECOVERY - Swift https frontend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.432 second response time https://wikitech.wikimedia.org/wiki/Swift [12:16:25] RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 4.028 second response time https://wikitech.wikimedia.org/wiki/Swift [12:16:27] RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 5.154 second response time https://wikitech.wikimedia.org/wiki/Swift [12:16:29] RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 6.255 second response time https://wikitech.wikimedia.org/wiki/Swift [12:16:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [12:17:04] at this point we have lots of capacity [12:17:07] PROBLEM - Swift https backend on ms-fe2018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift [12:17:23] PROBLEM - Swift https backend on ms-fe2019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.168 second response time https://wikitech.wikimedia.org/wiki/Swift [12:17:29] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Swift [12:17:40] Emperor: should we roll restart swift fes? [12:18:31] PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift [12:18:32] queue is high but coming down. I note that the envoy graphs for swift have a lot of Upstream request overflow [not exactly sure what that means], maybe envoy-on-swift-frontends still has a backlog of requests? [12:18:33] RECOVERY - Swift https backend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 9.671 second response time https://wikitech.wikimedia.org/wiki/Swift [12:18:39] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:18:58] ...which would suggest s roll-restart might help. On it. [12:19:23] RECOVERY - Swift https frontend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.705 second response time https://wikitech.wikimedia.org/wiki/Swift [12:19:23] PROBLEM - Swift https frontend on ms-fe2019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift [12:19:23] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-eqiad [12:19:25] PROBLEM - Swift https backend on ms-fe2017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift [12:19:27] PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift [12:19:31] RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Swift [12:19:41] !log mvernon@cumin2002 END (ERROR) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=97) rolling restart_daemons on A:swift-fe-eqiad [12:19:43] oh, fiddlesticks, I wanted codfw not eqiad, sorry. [12:19:54] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw [12:20:07] RECOVERY - Swift https backend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 1.500 second response time https://wikitech.wikimedia.org/wiki/Swift [12:20:21] [looking at a couple of codfw frontends, envoy is oddly spiking in CPU usages] [12:20:23] PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift [12:20:29] RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 5.656 second response time https://wikitech.wikimedia.org/wiki/Swift [12:20:33] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 2.093 second response time https://wikitech.wikimedia.org/wiki/Swift [12:21:11] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1258.eqiad.wmnet [12:21:23] PROBLEM - Swift https backend on ms-fe2019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.178 second response time https://wikitech.wikimedia.org/wiki/Swift [12:21:29] RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 2.572 second response time https://wikitech.wikimedia.org/wiki/Swift [12:21:41] !log ladsgroup@cumin1003 START - Cookbook sre.mysql.pool db1258* gradually with 4 steps - Work done [12:22:21] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift [12:22:23] RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Swift [12:22:23] RECOVERY - Swift https backend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 505 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Swift [12:22:23] RECOVERY - Swift https frontend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Swift [12:22:59] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Swift [12:23:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [12:24:09] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift [12:24:33] PROBLEM - Swift https backend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:24:50] 10SRE-swift-storage, 06Commons: Commons: fault in storage backend "local-swift-codfw" - https://phabricator.wikimedia.org/T396186#11232959 (10AFBorchert) 05Resolved→03Open This problem reappears as of now repeatedly on Commons. [12:24:52] RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:25:00] the thumbor queue is still very high [12:25:07] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.194 second response time https://wikitech.wikimedia.org/wiki/Swift [12:25:23] RECOVERY - Swift https backend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.203 second response time https://wikitech.wikimedia.org/wiki/Swift [12:25:25] [rolling-restart of swift frontends about half-way there now] [12:25:36] higher than before the roll-restart actually [12:26:01] I thought the expectation was that a roll-restart of thumbor would clear the queue? [12:26:11] it did [12:26:15] they just filled back up [12:27:09] hnowlan: that's not obviously visible in e.g. https://grafana.wikimedia.org/goto/YriZfbqNR?orgId=1 [12:27:34] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw [12:27:39] Emperor: it is here https://grafana.wikimedia.org/goto/0_FGfb3Ng?orgId=1 [12:28:37] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:29:19] PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.181 second response time https://wikitech.wikimedia.org/wiki/Swift [12:29:37] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T401906)', diff saved to https://phabricator.wikimedia.org/P83554 and previous config saved to /var/cache/conftool/dbconfig/20251001-122936-fceratto.json [12:29:41] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [12:29:41] !log cgoubert@cumin1003 conftool action : set/pooled=true; selector: dnsdisc=thumbor.*,name=eqiad [12:29:52] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:29:53] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1233.eqiad.wmnet with reason: Maintenance [12:30:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1233 (T401906)', diff saved to https://phabricator.wikimedia.org/P83555 and previous config saved to /var/cache/conftool/dbconfig/20251001-122959-fceratto.json [12:30:21] RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 2.398 second response time https://wikitech.wikimedia.org/wiki/Swift [12:31:04] !log cgoubert@cumin1003 conftool action : set/pooled=true; selector: dnsdisc=swift.*,name=eqiad [12:31:13] PROBLEM - Swift https backend on ms-fe2018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift [12:31:16] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T401906)', diff saved to https://phabricator.wikimedia.org/P83556 and previous config saved to /var/cache/conftool/dbconfig/20251001-123115-fceratto.json [12:31:21] PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.177 second response time https://wikitech.wikimedia.org/wiki/Swift [12:32:17] RECOVERY - Swift https backend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 6.274 second response time https://wikitech.wikimedia.org/wiki/Swift [12:33:23] RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 2.409 second response time https://wikitech.wikimedia.org/wiki/Swift [12:33:35] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 5.736 second response time https://wikitech.wikimedia.org/wiki/Swift [12:34:13] 10SRE-swift-storage, 06Commons: Commons: fault in storage backend "local-swift-codfw" - https://phabricator.wikimedia.org/T396186#11232974 (10AFBorchert) Associated dicussion at Commons: https://commons.wikimedia.org/wiki/Commons:Village_pump/Technical#Upload_problem [12:34:51] FIRING: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [12:35:16] !incidents [12:35:17] 6810 (ACKED) ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw) [12:35:17] 6811 (ACKED) VarnishUnavailable global sre (varnish-upload thanos-rule) [12:35:17] 6812 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule) [12:35:17] 6813 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [12:35:17] I think we need an IC and maybe a statuspage update, this is user-visible [12:35:17] 6807 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [12:35:32] (03PS2) 10D3r1ck01: Revert^2 "session: Enable MultiBackendSessionStore on `group1` wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192632 [12:35:36] jelto already did a statuspage update [12:35:57] ah, cool [12:39:10] RESOLVED: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:39:43] RESOLVED: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [12:39:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [12:39:51] FIRING: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [12:40:22] (03PS1) 10Elukey: profile::maps::osm_master: refactor postgres grants [puppet] - 10https://gerrit.wikimedia.org/r/1192877 (https://phabricator.wikimedia.org/T381565) [12:40:59] crawling through the logs on 1 proxy server, with a backtrace about write timeout; it served 206 requests in that second, of which one resulted in a 503, and that was a thumbnail write request. [12:41:09] (03CR) 10Elukey: "@mmuhlenhoff@wikimedia.org not sure if I am missing something, but I had to make these two workarounds to allow maps2011 to work properly." [puppet] - 10https://gerrit.wikimedia.org/r/1192877 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [12:41:22] sorry, request to _read_ a thumbnail [12:41:28] Emperor: could we potentially have overloaded swift with read traffic that never even hit thumbor? [12:41:36] (i.e. where I expect it would have called out to thumbor) [12:41:57] RESOLVED: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:42:20] So the proxy-server backtrace for a failed write timeout corresponded as best as I can tell to an incoming GET for a thumb [12:42:53] thumbor has recovered, queues are at 0 and errors are reasonable [12:42:59] (03PS22) 10Daniel Kinzler: api-gateway: Add rate limiting for REST gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189447 (https://phabricator.wikimedia.org/T405574) [12:43:23] swift errors are declining too [12:44:10] (03PS1) 10Daniel Kinzler: api-gateway: support custom rate limit groups for rest gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192879 [12:44:51] RESOLVED: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [12:46:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P83558 and previous config saved to /var/cache/conftool/dbconfig/20251001-124622-fceratto.json [12:47:03] 10SRE-swift-storage, 06Commons: Commons: fault in storage backend "local-swift-codfw" - https://phabricator.wikimedia.org/T396186#11233038 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon This should have been a new issue, but in any case, we published a [[https://www.wikimediastatus.net/inciden... [12:48:13] (03PS1) 10Ladsgroup: db1172: Upgrade to 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1192880 (https://phabricator.wikimedia.org/T406008) [12:48:38] !incidents [12:48:38] 6813 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [12:48:39] 6810 (RESOLVED) ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw) [12:48:39] 6812 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [12:48:39] 6811 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule) [12:48:39] 6807 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [12:49:12] hnowlan: I think we're back to normal operation now; shall we close out the incident, or leave it a little first? [12:50:43] I think we're mostly good, we're following up on traffic patterns [12:50:46] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1172.eqiad.wmnet with reason: Upgrade to 10.11 [12:51:21] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depool db1172 for upgrade T406008', diff saved to https://phabricator.wikimedia.org/P83559 and previous config saved to /var/cache/conftool/dbconfig/20251001-125120-ladsgroup.json [12:51:25] T406008: Migrate s8 to 10.11 - https://phabricator.wikimedia.org/T406008 [12:53:10] !log cgoubert@cumin1003 conftool action : set/pooled=false; selector: dnsdisc=swift.*,name=eqiad [12:53:26] (03PS2) 10Ladsgroup: db1172: Upgrade to 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1192880 (https://phabricator.wikimedia.org/T406008) [12:54:05] (03CR) 10Ladsgroup: [V:03+2 C:03+2] db1172: Upgrade to 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1192880 (https://phabricator.wikimedia.org/T406008) (owner: 10Ladsgroup) [12:56:00] !log cgoubert@cumin1003 conftool action : set/pooled=false; selector: dnsdisc=thumbor.*,name=eqiad [13:01:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P83561 and previous config saved to /var/cache/conftool/dbconfig/20251001-130131-fceratto.json [13:05:43] (03PS2) 10DDesouza: Update and deploy reader foundational survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192635 (https://phabricator.wikimedia.org/T405410) [13:07:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192635 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza) [13:07:11] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1258* gradually with 4 steps - Work done [13:07:57] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1152 - https://phabricator.wikimedia.org/T406063#11233092 (10Jclark-ctr) Replaced Failed Drive [13:08:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [13:10:07] (03CR) 10Clément Goubert: [C:03+2] admin_ng: Change eqiad pod ip range to 10.67.128.0/17 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191647 (https://phabricator.wikimedia.org/T375845) (owner: 10Jelto) [13:10:13] (03CR) 10Clément Goubert: [C:03+2] Update eqiad to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191656 (https://phabricator.wikimedia.org/T405703) (owner: 10Jelto) [13:10:15] (03CR) 10Clément Goubert: [C:03+2] Update eqiad pod ip range [puppet] - 10https://gerrit.wikimedia.org/r/1191652 (https://phabricator.wikimedia.org/T375845) (owner: 10Jelto) [13:10:21] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-k8s-ssl_6543: Servers wikikube-worker1051.eqiad.wmnet, wikikube-worker1144.eqiad.wmnet, wikikube-worker1322.eqiad.wmnet, wikikube-worker1079.eqiad.wmnet, wikikube-worker1118.eqiad.wmnet, wikikube-worker1304.eqiad.wmnet, wikikube-worker1259.eqiad.wmnet, wikikube-worker1155.eqiad.wmnet, wikikube-worker1160.eqiad.wmnet, wikikube-worker1108. [13:10:21] net, wikikube-worker1116.eqiad.wmnet, wikikube-worker1281.eqiad.wmnet, wikikube-worker1036.eqiad.wmnet, wikikube-worker1320.eqiad.wmnet, wikikube-worker1315.eqiad.wmnet, wikikube-worker1268.eqiad.wmnet, wikikube-worker1016.eqiad.wmnet, wikikube-worker1282.eqiad.wmnet, wikikube-worker1072.eqiad.wmnet, wikikube-worker1149.eqiad.wmnet, wikikube-worker1056.eqiad.wmnet, wikikube-worker1168.eqiad.wmnet, wikikube-worker1112.eqiad.wmnet, wikikube [13:10:21] 037.eqiad.wmnet, wikikube-worker1278.eqiad.wmnet, wikikube-worker1119.eqiad.wmnet, wikikube-worker1162.eqiad.wmnet, wikikube-worker1130.eqiad.wmnet, wikikube-worker1143.eqiad.wmnet, wik https://wikitech.wikimedia.org/wiki/PyBal [13:10:23] (03CR) 10Clément Goubert: [C:03+2] Update eqiad to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1191653 (https://phabricator.wikimedia.org/T405703) (owner: 10Jelto) [13:10:23] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-k8s-ssl_6543: Servers wikikube-worker1280.eqiad.wmnet, wikikube-worker1144.eqiad.wmnet, wikikube-worker1042.eqiad.wmnet, wikikube-worker1118.eqiad.wmnet, wikikube-worker1259.eqiad.wmnet, wikikube-worker1103.eqiad.wmnet, wikikube-worker1116.eqiad.wmnet, wikikube-worker1029.eqiad.wmnet, wikikube-worker1320.eqiad.wmnet, wikikube-worker1094. [13:10:23] net, wikikube-worker1076.eqiad.wmnet, wikikube-worker1273.eqiad.wmnet, wikikube-worker1136.eqiad.wmnet, wikikube-worker1260.eqiad.wmnet, wikikube-worker1157.eqiad.wmnet, wikikube-worker1282.eqiad.wmnet, wikikube-worker1263.eqiad.wmnet, wikikube-worker1307.eqiad.wmnet, wikikube-worker1159.eqiad.wmnet, wikikube-worker1056.eqiad.wmnet, wikikube-worker1168.eqiad.wmnet, wikikube-worker1244.eqiad.wmnet, wikikube-worker1037.eqiad.wmnet, wikikube [13:10:23] 278.eqiad.wmnet, wikikube-worker1119.eqiad.wmnet, wikikube-worker1135.eqiad.wmnet, wikikube-worker1098.eqiad.wmnet, wikikube-worker1309.eqiad.wmnet, wikikube-worker1143.eqiad.wmnet, wik https://wikitech.wikimedia.org/wiki/PyBal [13:10:36] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repool db1172 after upgrade T406008', diff saved to https://phabricator.wikimedia.org/P83563 and previous config saved to /var/cache/conftool/dbconfig/20251001-131033-ladsgroup.json [13:10:40] T406008: Migrate s8 to 10.11 - https://phabricator.wikimedia.org/T406008 [13:10:53] !incidents [13:10:54] 6814 (UNACKED) wmf - metamonitoring - prometheus - notified - vip is now DOWN [13:10:54] 6813 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [13:10:54] 6810 (RESOLVED) ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw) [13:10:54] 6812 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [13:10:55] 6811 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule) [13:10:55] 6807 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [13:11:03] !ack 6814 [13:11:04] 6814 (ACKED) wmf - metamonitoring - prometheus - notified - vip is now DOWN [13:11:07] tappof: ^ :) [13:11:57] cgoubert@cumin1003 wipe-cluster (PID 777396) is awaiting input [13:11:58] FIRING: [25x] ProbeDown: Service chart-renderer:30443 has failed probes (http_chart-renderer_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:12:36] !incidents [13:12:37] 6814 (ACKED) wmf - metamonitoring - prometheus - notified - vip is now DOWN [13:12:37] 6815 (ACKED) [25x] ProbeDown sre (ip4 probes/service eqiad) [13:12:37] 6813 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [13:12:37] 6810 (RESOLVED) ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw) [13:12:37] 6812 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [13:12:38] 6811 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule) [13:12:38] 6807 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [13:12:47] ProbeDown are expected [13:13:14] (03PS12) 10Elukey: WIP: test upgrade-firmware for idrac 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1189502 [13:13:41] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2048.codfw.wmnet'] [13:13:43] (03PS1) 10Arnaudb: gerrit: toggle mod_qos log only [puppet] - 10https://gerrit.wikimedia.org/r/1192882 (https://phabricator.wikimedia.org/T406017) [13:14:51] FIRING: [12x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [13:15:03] hnowlan: /var/lib/o11y-metamonitoring/deadmanswitchamhook/prometheus_k8s_eqiad has a timestamp older than 600 [13:15:23] tappof: ah nice, good to know [13:15:26] (03PS1) 10D3r1ck01: session: Handle an edge-case in MultiBackendSessionStore::set() [core] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192884 (https://phabricator.wikimedia.org/T402808) [13:15:27] SwaggerProbeHasFailures expected [13:15:44] (03PS1) 10D3r1ck01: session: Handle an edge-case in MultiBackendSessionStore::set() [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192885 (https://phabricator.wikimedia.org/T402808) [13:15:52] thumbor looks fine [13:16:40] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T401906)', diff saved to https://phabricator.wikimedia.org/P83564 and previous config saved to /var/cache/conftool/dbconfig/20251001-131639-fceratto.json [13:16:45] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [13:16:55] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1239.eqiad.wmnet with reason: Maintenance [13:17:12] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [core] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192884 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01) [13:17:13] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1254.eqiad.wmnet with reason: Maintenance [13:17:19] (03Merged) 10jenkins-bot: admin_ng: Change eqiad pod ip range to 10.67.128.0/17 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191647 (https://phabricator.wikimedia.org/T375845) (owner: 10Jelto) [13:17:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1254 (T401906)', diff saved to https://phabricator.wikimedia.org/P83565 and previous config saved to /var/cache/conftool/dbconfig/20251001-131719-fceratto.json [13:17:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192885 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01) [13:18:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192632 (owner: 10D3r1ck01) [13:18:38] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1254 (T401906)', diff saved to https://phabricator.wikimedia.org/P83566 and previous config saved to /var/cache/conftool/dbconfig/20251001-131836-fceratto.json [13:19:10] FIRING: JobUnavailable: Reduced availability for job probes/swagger in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:20:04] (03Merged) 10jenkins-bot: Update eqiad to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191656 (https://phabricator.wikimedia.org/T405703) (owner: 10Jelto) [13:20:26] (03CR) 10CI reject: [V:04-1] WIP: test upgrade-firmware for idrac 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1189502 (owner: 10Elukey) [13:20:47] hnowlan: /buffer 4 [13:23:22] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [13:24:20] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [13:24:50] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [13:24:56] !log cgoubert@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [13:25:44] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1152 - https://phabricator.wikimedia.org/T406063#11233155 (10Ladsgroup) Thanks! [13:26:25] FIRING: [28x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:28:32] !log cgoubert@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [13:28:39] !log cgoubert@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [13:29:54] !log cgoubert@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [13:30:00] !log cgoubert@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [13:30:11] !log cgoubert@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [13:30:16] !log cgoubert@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [13:30:30] cgoubert@cumin1003 wipe-cluster (PID 777396) is awaiting input [13:30:31] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2048.codfw.wmnet'] [13:30:41] 10SRE-SLO, 10EditCheck, 06Editing-team (Kanban Board), 07Essential-Work, 05Goal: Fix EditCheck's SLO metrics and create a dashboard for it - https://phabricator.wikimedia.org/T395444#11233174 (10ppelberg) [13:30:51] !log cgoubert@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [13:30:57] !log cgoubert@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [13:31:38] !log cgoubert@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [13:31:44] !log cgoubert@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:33:28] !log cgoubert@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:33:33] !log cgoubert@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [13:33:45] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P83568 and previous config saved to /var/cache/conftool/dbconfig/20251001-133344-fceratto.json [13:33:49] !log cgoubert@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [13:33:53] !log cgoubert@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [13:34:29] (03PS1) 10Btullis: Remove the airflow profile from the analytics_cluster::launcher role [puppet] - 10https://gerrit.wikimedia.org/r/1192889 (https://phabricator.wikimedia.org/T402943) [13:34:48] !log cgoubert@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:34:53] !log cgoubert@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. [13:34:54] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405973#11233204 (10phaultfinder) [13:35:10] (03PS1) 10Bking: wdqs-scholarly: Add wdqs2016 to load balancer pool [puppet] - 10https://gerrit.wikimedia.org/r/1192890 (https://phabricator.wikimedia.org/T405978) [13:35:12] !log cgoubert@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. [13:35:18] !log cgoubert@cumin1003 END (FAIL) - Cookbook sre.k8s.wipe-cluster (exit_code=99) Wipe the K8s cluster wikikube-eqiad: eqiad Wikikube kubernetes cluster upgrade to 1.31 - T405703 [13:35:22] T405703: Update wikikube eqiad to kubernetes 1.31 - https://phabricator.wikimedia.org/T405703 [13:35:51] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7170/co" [puppet] - 10https://gerrit.wikimedia.org/r/1192889 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [13:37:36] (03CR) 10Btullis: [C:03+1] wdqs-scholarly: Add wdqs2016 to load balancer pool [puppet] - 10https://gerrit.wikimedia.org/r/1192890 (https://phabricator.wikimedia.org/T405978) (owner: 10Bking) [13:38:54] (03CR) 10Stevemunene: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1192890 (https://phabricator.wikimedia.org/T405978) (owner: 10Bking) [13:39:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in esams #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [13:40:30] here [13:40:43] thumbor is fine? [13:41:40] what [13:41:55] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2035:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2035 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:41:57] stale alert paging again? [13:42:21] !incidents [13:42:22] 6814 (ACKED) wmf - metamonitoring - prometheus - notified - vip is now DOWN [13:42:22] 6815 (ACKED) [25x] ProbeDown sre (ip4 probes/service eqiad) [13:42:22] 6816 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [13:42:22] 6813 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [13:42:23] 6810 (RESOLVED) ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw) [13:42:23] 6812 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [13:42:23] 6811 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule) [13:42:23] 6807 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [13:44:37] !log cgoubert@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-ctrl[1001-1004].eqiad.wmnet [13:44:37] !log cgoubert@cumin1003 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) pool for host wikikube-ctrl[1001-1004].eqiad.wmnet [13:44:40] !log Deployed refinery-source using jenkins(weekly deployment train) [13:44:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:52] FIRING: [28x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:46:34] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/admin 'sync'. [13:47:25] (03Abandoned) 10Arnaudb: Revert^7 "gerrit: fix allowlist for mod_qos" [puppet] - 10https://gerrit.wikimedia.org/r/1192875 (owner: 10Arnaudb) [13:47:50] (03CR) 10Bking: [C:03+1] Remove the airflow profile from the analytics_cluster::launcher role [puppet] - 10https://gerrit.wikimedia.org/r/1192889 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [13:48:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P83569 and previous config saved to /var/cache/conftool/dbconfig/20251001-134852-fceratto.json [13:49:02] (03CR) 10Bking: [C:03+2] wdqs-scholarly: Add wdqs2016 to load balancer pool [puppet] - 10https://gerrit.wikimedia.org/r/1192890 (https://phabricator.wikimedia.org/T405978) (owner: 10Bking) [13:49:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in esams #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [13:51:26] !log jelto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 239 hosts with reason: eqiad Wikikube kubernetes cluster upgrade to 1.31 - T405703 [13:51:32] T405703: Update wikikube eqiad to kubernetes 1.31 - https://phabricator.wikimedia.org/T405703 [13:51:57] (03CR) 10Andrea Denisse: mediawiki-engineering: Add REST API alerts with thresholds (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1192183 (https://phabricator.wikimedia.org/T405151) (owner: 10Andrea Denisse) [13:52:07] (03PS1) 10Elukey: Set ml-serve1012 as GPU k8s worker [puppet] - 10https://gerrit.wikimedia.org/r/1192894 (https://phabricator.wikimedia.org/T405891) [13:52:08] 06SRE, 06collaboration-services, 10envoy, 06serviceops, 13Patch-For-Review: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663#11233317 (10Eevans) The RESTBase cluster has been upgraded to v1.29.12 (sorry for the delay, I was out all last week and missed the message). [13:52:36] !incidents [13:52:36] 6815 (ACKED) [25x] ProbeDown sre (ip4 probes/service eqiad) [13:52:36] 6814 (RESOLVED) wmf - metamonitoring - prometheus - notified - vip is now DOWN [13:52:36] 6816 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [13:52:37] 6813 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [13:52:37] 6810 (RESOLVED) ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw) [13:52:37] 6812 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [13:52:37] 6811 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule) [13:52:38] 6807 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [13:53:41] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [13:54:09] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw average message consume rate in last 30m on alert1002 is CRITICAL: 0 le 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [13:54:21] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw average message produce rate in last 30m on alert1002 is CRITICAL: 0 le 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [13:54:57] (03CR) 10Elukey: [C:03+2] Set ml-serve1012 as GPU k8s worker [puppet] - 10https://gerrit.wikimedia.org/r/1192894 (https://phabricator.wikimedia.org/T405891) (owner: 10Elukey) [13:56:41] !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=wdqs2016\.codfw\.wmnet [13:58:38] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/toolhub: apply [13:59:56] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405973#11233358 (10phaultfinder) [14:00:19] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply [14:01:40] FIRING: [5x] KubernetesRsyslogDown: rsyslog on wikikube-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:01:48] !log cgoubert@cumin1003 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=toolhub.* [14:02:59] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/apertium: apply [14:03:31] (03PS3) 10Slyngshede: P:cache::haproxy copy private repo data [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) [14:03:50] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [14:04:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1254 (T401906)', diff saved to https://phabricator.wikimedia.org/P83570 and previous config saved to /var/cache/conftool/dbconfig/20251001-140400-fceratto.json [14:04:04] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [14:04:16] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1259.eqiad.wmnet with reason: Maintenance [14:04:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1259 (T401906)', diff saved to https://phabricator.wikimedia.org/P83571 and previous config saved to /var/cache/conftool/dbconfig/20251001-140422-fceratto.json [14:04:27] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/kartotherian: apply [14:04:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in esams #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [14:04:52] FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:04:53] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a3-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406025#11233375 (10phaultfinder) [14:05:08] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/apertium: apply [14:05:15] !incidents [14:05:15] 6815 (ACKED) [25x] ProbeDown sre (ip4 probes/service eqiad) [14:05:15] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [14:05:15] 6817 (UNACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [14:05:16] 6814 (RESOLVED) wmf - metamonitoring - prometheus - notified - vip is now DOWN [14:05:16] 6816 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [14:05:16] 6813 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [14:05:16] 6810 (RESOLVED) ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw) [14:05:16] 6812 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [14:05:17] 6811 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule) [14:05:17] 6807 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [14:05:24] !ack 6817 [14:05:24] 6817 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [14:05:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1259 (T401906)', diff saved to https://phabricator.wikimedia.org/P83572 and previous config saved to /var/cache/conftool/dbconfig/20251001-140538-fceratto.json [14:05:51] Emperor: is swift looking okay? seems like there's an elevated level of errors but just from esams so seems unlikely [14:06:00] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [14:06:12] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/kartotherian: apply [14:06:14] hnowlan: I'm deploying thumbor in eqiad rn, will be ready to repool soon tm [14:06:22] ack [14:06:26] thumbor itself looks fine [14:06:37] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [14:06:53] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [14:06:53] slyngs: could you look at the above alert please? i'm in a meeting [14:06:59] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [14:07:03] Sure [14:07:56] hnowlan: meeting right now, do I need to drop? [14:08:22] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [14:08:28] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [14:08:51] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [14:09:09] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/chart-renderer: apply [14:09:10] RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:09:48] Emperor: not urgent I think [14:09:52] FIRING: [2x] ProbeDown: Service k8s-ingress-wikikube:30443 has failed probes (tcp_k8s-ingress-wikikube_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:09:55] Emperor: hnowlan I can repool thumbor in eqiad now [14:10:15] y/n? [14:11:24] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply [14:11:30] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [14:11:35] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:11:40] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/citoid: apply [14:11:46] Errors doesn't look elevated in Grafana [14:11:47] (03Abandoned) 10Elukey: WIP: test upgrade-firmware for idrac 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1189502 (owner: 10Elukey) [14:12:57] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [14:13:07] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/commons-impact-analytics: apply [14:13:30] (03PS1) 10Elukey: WIP: sre.hardware.upgrade-firmware: add support for IDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1192898 [14:14:10] RESOLVED: [4x] ProbeDown: Service chart-renderer:30443 has failed probes (http_chart-renderer_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:14:12] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2048.codfw.wmnet'] [14:14:18] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/commons-impact-analytics: apply [14:14:18] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2048.codfw.wmnet'] [14:14:21] (03CR) 10Stevemunene: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1192889 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [14:14:52] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [14:15:06] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [14:15:28] 06SRE, 10Cloud-VPS, 06DC-Ops, 06cloud-services-team (FY2025/26-Q1), 13Patch-For-Review: Experiment with cloudcephosd1050 and cloudcephosd1051 in single-nic configuration - https://phabricator.wikimedia.org/T405478#11233469 (10taavi) p:05Triage→03Medium [14:15:47] Ever so slightly elevated compared to the alerting limit of 3 req/s [14:16:20] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [14:16:34] Ok I'm repooling thumbor and swift in eqiad hnowlan Emperor [14:16:45] !log cgoubert@cumin1003 conftool action : set/pooled=true; selector: dnsdisc=thumbor.*,name=eqiad [14:16:54] !log cgoubert@cumin1003 conftool action : set/pooled=true; selector: dnsdisc=swift.*,name=eqiad [14:16:59] !log cgoubert@cumin1003 conftool action : set/pooled=false; selector: dnsdisc=thumbor.*,name=codfw [14:17:03] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/data-gateway: apply [14:18:19] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/data-gateway: apply [14:18:25] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [14:18:42] !incidents [14:18:42] 6815 (ACKED) [25x] ProbeDown sre (ip4 probes/service eqiad) [14:18:42] 6817 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [14:18:43] 6814 (RESOLVED) wmf - metamonitoring - prometheus - notified - vip is now DOWN [14:18:43] 6816 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [14:18:43] 6813 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [14:18:43] 6810 (RESOLVED) ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw) [14:18:44] 6812 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [14:18:44] 6811 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule) [14:18:44] 6807 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [14:18:47] (03PS2) 10Elukey: WIP: sre.hardware.upgrade-firmware: add support for IDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1192898 [14:18:58] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2048.codfw.wmnet'] [14:19:33] claime: go, sorry [14:19:38] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [14:19:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:19:44] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/device-analytics: apply [14:19:59] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c6-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406015#11233490 (10phaultfinder) [14:20:48] !log elukey@puppetserver1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=eqiad [14:20:57] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/device-analytics: apply [14:21:06] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/echostore: apply [14:21:52] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/echostore: apply [14:22:06] The ATSBackendErrorsHigh looks to be going down [14:22:06] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/edit-analytics: apply [14:22:37] !log elukey@puppetserver1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=eqiad [14:23:06] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: apply [14:23:17] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply [14:24:07] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/editor-analytics: apply [14:24:22] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/editor-analytics: apply [14:24:30] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [14:24:36] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply [14:24:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:24:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in esams #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [14:24:51] (03CR) 10CI reject: [V:04-1] WIP: sre.hardware.upgrade-firmware: add support for IDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1192898 (owner: 10Elukey) [14:24:52] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [14:24:57] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [14:24:59] !log cgoubert@deploy2002 Unlocked for deployment [ALL REPOSITORIES]: eqiad Wikikube kubernetes cluster upgrade to 1.31 - T405703 (duration: 201m 05s) [14:25:02] T405703: Update wikikube eqiad to kubernetes 1.31 - https://phabricator.wikimedia.org/T405703 [14:25:25] !log cgoubert@deploy2002 Started scap sync-world: eqiad Wikikube kubernetes cluster upgrade to 1.31 - T405703 [14:25:33] !incidents [14:25:33] 6815 (ACKED) [25x] ProbeDown sre (ip4 probes/service eqiad) [14:25:34] 6817 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [14:25:34] 6814 (RESOLVED) wmf - metamonitoring - prometheus - notified - vip is now DOWN [14:25:34] 6816 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [14:25:34] 6813 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [14:25:34] 6810 (RESOLVED) ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw) [14:25:35] 6812 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [14:25:35] 6811 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule) [14:25:35] 6807 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [14:25:37] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [14:26:00] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [14:26:08] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [14:26:31] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [14:26:51] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [14:27:27] RECOVERY - MegaRAID on db1152 is OK: OK: optimal, 1 logical, 10 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:28:14] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [14:28:19] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [14:29:10] RESOLVED: JobUnavailable: Reduced availability for job probes/swagger in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:29:13] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [14:29:19] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply [14:29:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:29:56] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply [14:30:05] claime, jelto, and jayme: I, the Bot under the Fountain, call upon thee, The Deployer, to do eqiad Wikikube kubernetes upgrade (no other deployments) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251001T1000). [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251001T1430) [14:30:31] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2048.codfw.wmnet'] [14:30:43] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/geo-analytics: apply [14:30:58] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/geo-analytics: apply [14:31:18] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/image-suggestion: apply [14:31:35] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/image-suggestion: apply [14:31:41] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [14:32:12] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [14:32:20] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply [14:32:23] (03CR) 10Btullis: [V:03+1 C:03+2] Remove the airflow profile from the analytics_cluster::launcher role [puppet] - 10https://gerrit.wikimedia.org/r/1192889 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis) [14:32:50] (03PS1) 10Majavah: P:toolforge::k8s::haproxy: Remove no-op maxconn statement [puppet] - 10https://gerrit.wikimedia.org/r/1192899 (https://phabricator.wikimedia.org/T406010) [14:33:14] (03CR) 10Slyngshede: "I'm not sure if we necessarily want to dynamically load the Lua files, but it's an option." [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [14:33:23] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply [14:33:29] (03PS2) 10Majavah: P:toolforge::k8s::haproxy: Remove no-op maxconn statement [puppet] - 10https://gerrit.wikimedia.org/r/1192899 (https://phabricator.wikimedia.org/T406010) [14:33:31] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [14:33:59] * Emperor out of meeting; everything looks good now? [14:34:12] Yes, errors are back down to normal levels [14:34:32] !log elukey@puppetserver1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=eqiad [14:34:51] FIRING: [14x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [14:35:33] 👍 [14:36:57] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:37:41] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11233546 (10elukey) I managed to have an idrac upgrade triggered by the cookbook, but it then failed when checking the state of the idrac (that was down because... [14:37:58] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [14:38:06] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mathoid: apply [14:38:27] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mathoid: apply [14:38:34] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/media-analytics: apply [14:38:45] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/media-analytics: apply [14:38:51] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [14:39:51] FIRING: [13x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [14:40:41] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-misc: apply [14:40:41] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [14:40:46] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [14:40:52] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-misc: apply [14:40:59] (03CR) 10FNegri: [C:03+1] P:toolforge::k8s::haproxy: Remove no-op maxconn statement [puppet] - 10https://gerrit.wikimedia.org/r/1192899 (https://phabricator.wikimedia.org/T406010) (owner: 10Majavah) [14:41:04] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [14:41:07] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [14:41:14] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [14:41:19] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [14:41:25] FIRING: [29x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:42:03] (03CR) 10Majavah: [C:03+2] P:toolforge::k8s::haproxy: Remove no-op maxconn statement [puppet] - 10https://gerrit.wikimedia.org/r/1192899 (https://phabricator.wikimedia.org/T406010) (owner: 10Majavah) [14:43:37] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [14:43:45] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [14:44:10] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [14:44:14] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [14:44:24] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [14:44:30] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [14:44:43] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [14:44:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:44:51] FIRING: [11x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [14:44:56] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [14:45:12] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [14:45:16] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply [14:46:23] (03CR) 10Vgutierrez: "correct me if I'm wrong but the current implementation won't reload main.lua ever" [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [14:46:46] (03CR) 10Ahmon Dancy: osm_master: Create /etc/wikimedia directory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1192566 (https://phabricator.wikimedia.org/T381565) (owner: 10Ahmon Dancy) [14:48:35] (03PS1) 10Majavah: P:toolforge::prometheus: Add external label with project [puppet] - 10https://gerrit.wikimedia.org/r/1192904 (https://phabricator.wikimedia.org/T406010) [14:49:32] !log elukey@puppetserver1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=codfw [14:49:52] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on wikikube-worker2035 - https://phabricator.wikimedia.org/T406060#11233583 (10Jhancock.wm) machine in warranty. requested drive replacement from dell. SR216590250 [14:49:53] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7171/co" [puppet] - 10https://gerrit.wikimedia.org/r/1192904 (https://phabricator.wikimedia.org/T406010) (owner: 10Majavah) [14:50:01] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on wikikube-worker2035 - https://phabricator.wikimedia.org/T406060#11233584 (10Jhancock.wm) a:03Jhancock.wm [14:51:06] (03CR) 10Majavah: osm_master: Create /etc/wikimedia directory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1192566 (https://phabricator.wikimedia.org/T381565) (owner: 10Ahmon Dancy) [14:51:56] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c6-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406015#11233585 (10Jhancock.wm) 05Open→03Resolved [14:52:02] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11233586 (10elukey) Maps back to be served only by the old stack, the k8s maintenance is completed. I am going warm up the tegola's cache in codfw properly, but we have a good indica... [14:53:57] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405973#11233587 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [14:55:01] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a3-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406025#11233590 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [14:56:56] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1192904 (https://phabricator.wikimedia.org/T406010) (owner: 10Majavah) [14:57:11] (03Abandoned) 10Federico Ceratto: es2049.yaml: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1184092 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [14:58:04] (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge::prometheus: Add external label with project [puppet] - 10https://gerrit.wikimedia.org/r/1192904 (https://phabricator.wikimedia.org/T406010) (owner: 10Majavah) [14:58:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:58:27] (03Abandoned) 10Federico Ceratto: Prepare new es2* nodes to replace old ones [puppet] - 10https://gerrit.wikimedia.org/r/1182507 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [14:58:48] FIRING: PuppetFailure: Puppet has failed on ml-serve1012:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:01:56] (03CR) 10Herron: [V:03+1] "This follows the multi-instance pattern from our prometheus puppetization with profiles for each instance. The instances would be main/pi" [puppet] - 10https://gerrit.wikimedia.org/r/1188441 (https://phabricator.wikimedia.org/T406054) (owner: 10Herron) [15:04:25] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [15:04:34] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [15:04:41] (03PS1) 10Federico Ceratto: preseed.yaml: Remove es2051 from preseeding [puppet] - 10https://gerrit.wikimedia.org/r/1192905 (https://phabricator.wikimedia.org/T402859) [15:05:21] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply [15:07:05] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply [15:09:10] FIRING: ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-int:4446 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:09:10] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:52] RESOLVED: ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-int:4446 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:11:58] FIRING: [15x] ProbeDown: Service mw-api-ext-next:4455 has failed probes (http_mw-api-ext-next_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:12:40] <_joe_> claime: is that because of your work I guess? [15:13:22] downtimes expired if I had to guess [15:15:05] (03PS1) 10Kgraessle: set AutoModeratorMultiLingualRevertRisk with available wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192907 (https://phabricator.wikimedia.org/T400727) [15:15:49] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply [15:16:05] yep one downtime expired 4 minutes ago for this service, I can re-create it with 30m downtime [15:16:13] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [15:16:17] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [15:16:25] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [15:16:43] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [15:17:16] !incidents [15:17:17] 6815 (ACKED) [25x] ProbeDown sre (ip4 probes/service eqiad) [15:17:17] 6817 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [15:17:17] 6814 (RESOLVED) wmf - metamonitoring - prometheus - notified - vip is now DOWN [15:17:17] 6816 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [15:17:18] 6813 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [15:17:18] 6810 (RESOLVED) ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw) [15:17:18] 6812 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [15:17:18] 6811 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule) [15:17:19] 6807 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [15:17:28] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply [15:17:33] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply [15:17:39] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [15:18:12] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [15:18:22] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: apply [15:18:37] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: apply [15:18:51] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/page-analytics: apply [15:19:04] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/page-analytics: apply [15:19:09] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/proton: apply [15:19:10] FIRING: [4x] ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:19:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [15:19:52] RESOLVED: [4x] ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:19:55] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/proton: apply [15:20:20] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/push-notifications: apply [15:20:32] (03PS2) 10Ahmon Dancy: Add traindev-staging environment for mw-web and mw-debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187855 (https://phabricator.wikimedia.org/T402350) [15:20:47] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/push-notifications: apply [15:20:55] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/ratelimit: apply [15:21:13] (03PS3) 10Ahmon Dancy: Add traindev-staging environment for mw-web and mw-debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187855 (https://phabricator.wikimedia.org/T402350) [15:21:20] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ratelimit: apply [15:21:27] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [15:21:31] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [15:21:36] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/recommendation-api: apply [15:21:53] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/recommendation-api: apply [15:22:05] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [15:22:14] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [15:23:04] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/sessionstore: apply [15:23:17] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/sessionstore: apply [15:23:34] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox: apply [15:24:09] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [15:24:10] FIRING: [5x] ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:24:25] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [15:24:46] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [15:24:51] FIRING: [4x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [15:24:52] RESOLVED: [6x] ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:25:49] !log cgoubert@deploy2002 Started scap sync-world: eqiad Wikikube kubernetes cluster upgrade to 1.31 - T405703 [15:25:53] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [15:25:53] T405703: Update wikikube eqiad to kubernetes 1.31 - https://phabricator.wikimedia.org/T405703 [15:26:12] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [15:26:21] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repool db1259 after maint T401906', diff saved to https://phabricator.wikimedia.org/P83573 and previous config saved to /var/cache/conftool/dbconfig/20251001-152620-ladsgroup.json [15:26:25] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [15:26:32] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [15:26:45] (03PS1) 10Sergio Gimeno: Growth: remove no longer in use GENewcomerTasksStarterDifficultyEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192913 (https://phabricator.wikimedia.org/T396382) [15:26:47] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [15:26:53] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [15:27:00] (03CR) 10Ladsgroup: [C:03+1] preseed.yaml: Remove es2051 from preseeding [puppet] - 10https://gerrit.wikimedia.org/r/1192905 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [15:27:16] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [15:27:18] (03PS1) 10Cwhite: logstash: w3creportingapi drop canary events [puppet] - 10https://gerrit.wikimedia.org/r/1192914 (https://phabricator.wikimedia.org/T304373) [15:27:59] !log cgoubert@deploy2002 Finished scap sync-world: eqiad Wikikube kubernetes cluster upgrade to 1.31 - T405703 (duration: 03m 16s) [15:29:33] (03PS1) 10Kgraessle: set AutoModeratorMultiLingualRevertRisk with available wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192907 (https://phabricator.wikimedia.org/T400727) [15:29:44] RESOLVED: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [15:30:05] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [15:30:42] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [15:31:03] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/termbox: apply [15:31:42] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/termbox: apply [15:32:04] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [15:32:14] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [15:32:43] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply [15:33:10] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply [15:33:15] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [15:33:36] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [15:33:42] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:34:10] FIRING: [2x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:34:21] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:34:26] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/zotero: apply [15:34:44] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply [15:34:51] RESOLVED: [4x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [15:34:52] FIRING: [6x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:34:52] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:35:40] !log Finished eqiad Wikikube kubernetes cluster upgrade to 1.31 - T405703 [15:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:44] T405703: Update wikikube eqiad to kubernetes 1.31 - https://phabricator.wikimedia.org/T405703 [15:36:10] (03PS1) 10JHathaway: acme-chief: remove hiera purge guard [puppet] - 10https://gerrit.wikimedia.org/r/1192917 (https://phabricator.wikimedia.org/T401858) [15:36:12] (03CR) 10DCausse: [C:03+2] flink jobs: resume search & wdqs jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192591 (owner: 10DCausse) [15:36:57] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:37:21] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:38:13] (03Merged) 10jenkins-bot: flink jobs: resume search & wdqs jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192591 (owner: 10DCausse) [15:38:23] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:39:15] FIRING: [7x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:39:20] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:41:21] hashar: we're done with the maintenance, so the train can continue running or whatever it is it does :) [15:41:25] FIRING: [29x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:41:39] claime: awesome!! And sorry for the delaying earlier today [15:42:29] releng has its team meeting in ~ 25 minutes, I'll talk about the train and I guess it will be resumed at the usual late UTC evening window [15:42:39] FIRING: [14x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1072-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [15:42:55] I sense a chance of posting train photos without being fully off-topic so: https://commons.wikimedia.org/wiki/File:ArcticRail_Dr16_2811_Tampere_2025-09-30.jpg [15:43:33] that is a nice one taavi ! [15:44:00] (03PS1) 10Btullis: Add some WMF specific network policies to the spark-operator chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192919 (https://phabricator.wikimedia.org/T405490) [15:46:25] 10ops-codfw, 06SRE, 06DC-Ops: updating reporting thresholds of PDUs in codfw - https://phabricator.wikimedia.org/T401634#11233800 (10Jhancock.wm) [15:46:29] taavi: {done} https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys [15:46:30] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:46:36] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:47:32] (03CR) 10Cwhite: [C:03+2] logstash: w3creportingapi drop canary events [puppet] - 10https://gerrit.wikimedia.org/r/1192914 (https://phabricator.wikimedia.org/T304373) (owner: 10Cwhite) [15:47:39] FIRING: [14x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1072-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [15:47:42] RESOLVED: [4x] ProbeDown: Service shellbox-video:4080 has failed probes (http_shellbox-video_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:47:43] (03PS2) 10Btullis: Add some WMF specific network policies to the spark-operator chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192919 (https://phabricator.wikimedia.org/T405490) [15:47:58] (03CR) 10Federico Ceratto: [C:03+2] preseed.yaml: Remove es2051 from preseeding [puppet] - 10https://gerrit.wikimedia.org/r/1192905 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [15:48:07] !incidents [15:48:08] 6815 (RESOLVED) [25x] ProbeDown sre (ip4 probes/service eqiad) [15:48:08] 6817 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [15:48:08] 6814 (RESOLVED) wmf - metamonitoring - prometheus - notified - vip is now DOWN [15:48:08] 6816 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [15:48:08] 6813 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [15:48:09] 6810 (RESOLVED) ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw) [15:48:09] 6812 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [15:48:09] 6811 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule) [15:48:09] 6807 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [15:48:18] ah ok [15:49:13] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:49:18] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:51:20] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:51:32] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:51:34] (03PS1) 10Kgraessle: Enable AutoModerator on Italian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192921 (https://phabricator.wikimedia.org/T405152) [15:51:45] (03PS1) 10Kosta Harlan: SimpleCaptcha::canSkipCaptcha: Remove unneeded Config parameter [extensions/ConfirmEdit] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192922 [15:52:02] (03PS1) 10Kosta Harlan: SimpleCaptcha::canSkipCaptcha: Remove unneeded Config parameter [extensions/ConfirmEdit] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192923 [15:52:39] RESOLVED: [5x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1069-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [15:52:42] (03PS1) 10Kosta Harlan: CreateAccountInstrumentationPreAuthenticationProvider: Don't create event if user can skip CAPTCHA [extensions/WikimediaEvents] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192924 (https://phabricator.wikimedia.org/T405239) [15:52:55] (03PS1) 10Kosta Harlan: CreateAccountInstrumentationPreAuthenticationProvider: Don't create event if user can skip CAPTCHA [extensions/WikimediaEvents] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192925 (https://phabricator.wikimedia.org/T405239) [15:53:08] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on wikikube-worker2035 - https://phabricator.wikimedia.org/T406060#11233835 (10Jhancock.wm) and rejected. cause the TSR report doesn't show a disk error. The report actually shows an indetereminate bus error. this could actually be the drive error but i can't tell... [15:53:19] jouncebot: nowandnext [15:53:19] No deployments scheduled for the next 1 hour(s) and 6 minute(s) [15:53:20] In 1 hour(s) and 6 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251001T1700) [15:53:29] hashar: are you doing the train now, or can I run some more backports? [15:54:01] go ahead with backports! [15:54:22] we will do the train in a couple hours via https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251001T1800 [15:54:26] ok [15:54:32] I'll mention it in the releng team meeting [15:54:53] (03PS3) 10Btullis: Add some WMF specific network policies to the spark-operator chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192919 (https://phabricator.wikimedia.org/T405490) [15:55:29] (03Abandoned) 10Btullis: spark: authorize communication between executors on blockManager port [deployment-charts] - 10https://gerrit.wikimedia.org/r/902409 (https://phabricator.wikimedia.org/T331859) (owner: 10Nicolas Fraison) [15:56:05] (03Abandoned) 10Btullis: spark: add hadoop conf configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/902402 (https://phabricator.wikimedia.org/T332909) (owner: 10Nicolas Fraison) [15:56:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192922 (owner: 10Kosta Harlan) [15:56:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192924 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan) [15:56:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192925 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan) [15:56:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192923 (owner: 10Kosta Harlan) [15:56:55] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [15:57:07] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [16:04:27] (03PS1) 10Federico Ceratto: site.pp: Configure es2051 mariadb role [puppet] - 10https://gerrit.wikimedia.org/r/1192927 (https://phabricator.wikimedia.org/T402859) [16:04:58] FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wcqs1001:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:06:45] (03Merged) 10jenkins-bot: SimpleCaptcha::canSkipCaptcha: Remove unneeded Config parameter [extensions/ConfirmEdit] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192922 (owner: 10Kosta Harlan) [16:07:44] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [16:07:52] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [16:09:45] (03Merged) 10jenkins-bot: SimpleCaptcha::canSkipCaptcha: Remove unneeded Config parameter [extensions/ConfirmEdit] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192923 (owner: 10Kosta Harlan) [16:09:46] (03Merged) 10jenkins-bot: CreateAccountInstrumentationPreAuthenticationProvider: Don't create event if user can skip CAPTCHA [extensions/WikimediaEvents] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192925 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan) [16:09:47] (03Merged) 10jenkins-bot: CreateAccountInstrumentationPreAuthenticationProvider: Don't create event if user can skip CAPTCHA [extensions/WikimediaEvents] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192924 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan) [16:10:26] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1192922|SimpleCaptcha::canSkipCaptcha: Remove unneeded Config parameter]], [[gerrit:1192924|CreateAccountInstrumentationPreAuthenticationProvider: Don't create event if user can skip CAPTCHA (T405239)]], [[gerrit:1192925|CreateAccountInstrumentationPreAuthenticationProvider: Don't create event if user can skip CAPTCHA (T405239)]], [[gerrit:1192923|Simp [16:10:26] leCaptcha::canSkipCaptcha: Remove unneeded Config parameter]] [16:10:31] T405239: hCaptcha: Enable A/B test for frwiki - https://phabricator.wikimedia.org/T405239 [16:11:09] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw average message consume rate in last 30m on alert1002 is OK: (C)0 le (W)100 le 134.8 https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [16:11:21] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw average message produce rate in last 30m on alert1002 is OK: (C)0 le (W)100 le 130.3 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [16:14:46] (03PS3) 10Elukey: WIP: sre.hardware.upgrade-firmware: add support for IDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1192898 [16:15:12] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2048.codfw.wmnet'] [16:15:58] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1025:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:16:03] FIRING: [11x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:17:07] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1192922|SimpleCaptcha::canSkipCaptcha: Remove unneeded Config parameter]], [[gerrit:1192924|CreateAccountInstrumentationPreAuthenticationProvider: Don't create event if user can skip CAPTCHA (T405239)]], [[gerrit:1192925|CreateAccountInstrumentationPreAuthenticationProvider: Don't create event if user can skip CAPTCHA (T405239)]], [[gerrit:1192923|SimpleCaptcha::canSk [16:17:07] ipCaptcha: Remove unneeded Config parameter]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:17:11] T405239: hCaptcha: Enable A/B test for frwiki - https://phabricator.wikimedia.org/T405239 [16:19:11] !log kharlan@deploy2002 kharlan: Continuing with sync [16:19:38] elukey@cumin2002 upgrade-firmware (PID 503788) is awaiting input [16:20:58] FIRING: [12x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:21:02] (03CR) 10CI reject: [V:04-1] WIP: sre.hardware.upgrade-firmware: add support for IDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1192898 (owner: 10Elukey) [16:21:45] (03CR) 10BCornwall: [C:03+2] varnish: Enable unified mobile routing on Commons [puppet] - 10https://gerrit.wikimedia.org/r/1192265 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [16:22:45] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 1792406616 and 101 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:23:35] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192922|SimpleCaptcha::canSkipCaptcha: Remove unneeded Config parameter]], [[gerrit:1192924|CreateAccountInstrumentationPreAuthenticationProvider: Don't create event if user can skip CAPTCHA (T405239)]], [[gerrit:1192925|CreateAccountInstrumentationPreAuthenticationProvider: Don't create event if user can skip CAPTCHA (T405239)]], [[gerrit:1192923|Sim [16:23:35] pleCaptcha::canSkipCaptcha: Remove unneeded Config parameter]] (duration: 13m 08s) [16:23:39] T405239: hCaptcha: Enable A/B test for frwiki - https://phabricator.wikimedia.org/T405239 [16:23:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [16:24:58] RESOLVED: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wcqs1001:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:25:58] RESOLVED: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1025:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:26:03] FIRING: [12x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:27:45] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 15000 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:30:58] RESOLVED: [12x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:31:08] (03PS1) 10JHathaway: acme_chief: delete unused files on passive hosts [puppet] - 10https://gerrit.wikimedia.org/r/1192934 (https://phabricator.wikimedia.org/T401858) [16:31:56] (03CR) 10BCornwall: [V:03+1 C:03+2] "Tests are happy" [puppet] - 10https://gerrit.wikimedia.org/r/1192265 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [16:33:02] !log swfrench@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-eqiad [16:34:42] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2048.codfw.wmnet'] [16:35:55] (03PS4) 10Elukey: WIP: sre.hardware.upgrade-firmware: add support for IDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1192898 [16:39:06] !log swfrench@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-eqiad [16:42:34] (03CR) 10CI reject: [V:04-1] WIP: sre.hardware.upgrade-firmware: add support for IDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1192898 (owner: 10Elukey) [16:45:14] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for tais-lessa - https://phabricator.wikimedia.org/T405129#11234044 (10Dzahn) 05In progress→03Stalled stalled on manager approval. [16:46:15] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for gengh - https://phabricator.wikimedia.org/T405713#11234086 (10Dzahn) 05In progress→03Stalled stalled on manager approval [16:49:46] (03CR) 10Hnowlan: [C:03+1] vo-escalate: absent timer [puppet] - 10https://gerrit.wikimedia.org/r/1192610 (owner: 10Herron) [16:51:53] 06SRE, 10SRE-Access-Requests: Requesting access to Superset for marialechnerwmde - https://phabricator.wikimedia.org/T405917#11234167 (10Dzahn) @Maria_Lechner_WMDE Please send an email to Katie Francis of Legal (https://meta.wikimedia.org/wiki/User:KFrancis_(WMF)) to get the NDA signing process started. Once... [16:52:33] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to Superset for marialechnerwmde - https://phabricator.wikimedia.org/T405917#11234168 (10Dzahn) [16:53:20] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to Superset for marialechnerwmde - https://phabricator.wikimedia.org/T405917#11234182 (10Dzahn) (this should be like https://gerrit.wikimedia.org/r/c/operations/puppet/+/1191507/4/modules/admin/data/data.yaml) [16:55:15] !incidents [16:55:15] 6818 (ACKED) Manual (paged) by LSobanski (lsobanski@wikimedia.org): Test page [16:55:15] 6815 (RESOLVED) [25x] ProbeDown sre (ip4 probes/service eqiad) [16:55:16] 6817 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [16:55:16] 6814 (RESOLVED) wmf - metamonitoring - prometheus - notified - vip is now DOWN [16:55:16] 6816 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [16:55:16] 6813 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [16:55:16] 6810 (RESOLVED) ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw) [16:55:17] 6812 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [16:55:17] 6811 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule) [16:55:18] 6807 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [16:55:29] 06SRE, 10SRE-Access-Requests: Requesting access to restricted for AramilFeraxa - https://phabricator.wikimedia.org/T405796#11234186 (10Dzahn) a:03thcipriani Hi Tyler, there is a request for the "restricted" group here. They want to run maintenance scripts on the deployment server. Details at T405796#11221398 [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251001T1700) [17:05:31] 06SRE, 10SRE-Access-Requests: Requesting access to restricted for AramilFeraxa - https://phabricator.wikimedia.org/T405796#11234223 (10thcipriani) >>! In T405796#11234186, @Dzahn wrote: > Hi Tyler, there is a request for the "restricted" group here. They want to run maintenance scripts on the deployment server... [17:07:10] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: eqiad row C/D Collaboration Services host migrations - https://phabricator.wikimedia.org/T405940#11234225 (10Dzahn) @RobH What's your preferred way to schedule this? Want to let me know which slots work for you? Or should we just suggest something vi... [17:08:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [17:10:20] 06SRE, 10SRE-Access-Requests: Requesting access to restricted for AramilFeraxa - https://phabricator.wikimedia.org/T405796#11234242 (10Dzahn) Thank you, Tyler. @FCeratto-WMF This can continue with the "verify SSH key out of band" check box. I am not 100% sure if we also need approval from Nasma Ahmed in th... [17:10:30] 06SRE, 10SRE-Access-Requests: Requesting access to restricted for AramilFeraxa - https://phabricator.wikimedia.org/T405796#11234243 (10Dzahn) a:05thcipriani→03None [17:11:19] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: eqiad row C/D Collaboration Services host migrations - https://phabricator.wikimedia.org/T405940#11234244 (10RobH) So we won't be ready to move forward on this until after Oct 15th (our deadline for installing the new switches) but afterwards. If you... [17:19:36] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Serve mobile and desktop variants through the same URL (unified mobile routing) - https://phabricator.wikimedia.org/T214998#11234267 (10Krinkle) [17:24:51] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: eqiad row C/D Collaboration Services host migrations - https://phabricator.wikimedia.org/T405940#11234302 (10Dzahn) @RobH We can indeed do the gitlab-runners first and separate them. Let's do that. I am suggesting October 16th, 15:00 - 15:30 UTC, con... [17:29:04] (03CR) 10BCornwall: [V:03+1 C:03+2] varnish: Enable unified mobile routing on idwiki, frwiki, dewiki [puppet] - 10https://gerrit.wikimedia.org/r/1192266 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [17:31:56] (03PS1) 10Ssingh: team-sre: cdn: add wdqs-main.discovery.wmnet to ignored backends [alerts] - 10https://gerrit.wikimedia.org/r/1192940 (https://phabricator.wikimedia.org/T406141) [17:35:00] (03CR) 10Dzahn: [C:03+2] zuul: let zuul-scheduler also reach zookeeper outside container [puppet] - 10https://gerrit.wikimedia.org/r/1192615 (https://phabricator.wikimedia.org/T405118) (owner: 10Dzahn) [17:39:18] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: eqiad row C/D Collaboration Services host migrations - https://phabricator.wikimedia.org/T405940#11234421 (10RobH) I am unable to add folks to the gcal entry, but can you add both John and Valerie to the gcal event so they are aware of the window? Jo... [17:40:12] (03PS2) 10Ssingh: team-sre: cdn: ignore wdqs-main.discovery.wmnet in ATSBackendErrorsHigh [alerts] - 10https://gerrit.wikimedia.org/r/1192940 (https://phabricator.wikimedia.org/T406141) [17:44:52] FIRING: [28x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:45:41] 06SRE, 06Infrastructure-Foundations, 10vm-requests: eqiad: 2 VM request for hCaptcha - https://phabricator.wikimedia.org/T406166 (10ssingh) 03NEW [17:45:58] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad: 2 VM request for hCaptcha - https://phabricator.wikimedia.org/T406166#11234464 (10ssingh) [17:46:21] 06SRE, 06Infrastructure-Foundations, 10vm-requests: codfw: 2 VM request for hCaptcha - https://phabricator.wikimedia.org/T406167 (10ssingh) 03NEW [17:47:33] 06SRE, 06Infrastructure-Foundations, 10vm-requests: codfw: 2 VM request for hCaptcha - https://phabricator.wikimedia.org/T406167#11234481 (10ssingh) [17:48:07] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad: 2 VM request for hCaptcha - https://phabricator.wikimedia.org/T406166#11234487 (10ssingh) [17:48:25] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: codfw: 2 VM request for hCaptcha - https://phabricator.wikimedia.org/T406167#11234491 (10ssingh) [18:00:05] hashar and brennen: Deploy window MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251001T1800) [18:00:09] o/ [18:01:55] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2035:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2035 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:06:40] (03PS1) 10TrainBranchBot: group1 to 1.45.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192947 (https://phabricator.wikimedia.org/T405677) [18:06:43] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by brennen@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192947 (https://phabricator.wikimedia.org/T405677) (owner: 10TrainBranchBot) [18:07:31] (03Merged) 10jenkins-bot: group1 to 1.45.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192947 (https://phabricator.wikimedia.org/T405677) (owner: 10TrainBranchBot) [18:07:36] (03CR) 10Ottomata: "Thank you! <3" [puppet] - 10https://gerrit.wikimedia.org/r/1192914 (https://phabricator.wikimedia.org/T304373) (owner: 10Cwhite) [18:09:01] brennen: can you please let me know when you're done, as I'd like to deploy a config patch before the UTC late window, if possible [18:09:39] kostajh: will do. [18:10:15] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for gengh - https://phabricator.wikimedia.org/T405713#11234595 (10DSantamaria) Approved! [18:14:52] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [18:18:22] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.21 refs T405677 [18:18:29] T405677: 1.45.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T405677 [18:21:16] (03PS1) 10Ottomata: EventStreamConfig - Enable hive ingestion for eventgate-logging-external based streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192950 (https://phabricator.wikimedia.org/T304373) [18:21:48] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to Superset for marialechnerwmde - https://phabricator.wikimedia.org/T405917#11234616 (10WMDE-leszek) > But they are saying no ssh needed.. so we can safely assume they mean the lowest of the 3 levels. This is indeed what we want here.... [18:22:42] kostajh: let's give it a couple of minutes to settle and then i'd say go ahead w/your config deploy. [18:22:48] (03PS2) 10Ottomata: EventStreamConfig - Enable hive ingestion for eventgate-logging-external based streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192950 (https://phabricator.wikimedia.org/T304373) [18:23:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:24:52] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [18:31:14] kostajh: things looking ok from my end, all yours. [18:36:27] 06SRE: Make the shell group analytics-privatedata-users less confusing - https://phabricator.wikimedia.org/T405517#11234672 (10Novem_Linguae) I added "level 1", "level 2", "level 3" to the doc page at https://wikitech.wikimedia.org/w/index.php?title=Data_Platform/Data_access&diff=prev&oldid=2347634. Let's see if... [18:42:08] brennen: thanks! [18:45:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190992 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan) [18:46:11] (03Merged) 10jenkins-bot: hCaptcha: Enable A/B test for frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190992 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan) [18:46:45] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1190992|hCaptcha: Enable A/B test for frwiki (T405239)]] [18:46:52] T405239: hCaptcha: Enable A/B test for frwiki - https://phabricator.wikimedia.org/T405239 [18:53:03] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1190992|hCaptcha: Enable A/B test for frwiki (T405239)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:53:10] T405239: hCaptcha: Enable A/B test for frwiki - https://phabricator.wikimedia.org/T405239 [18:58:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:58:35] (03CR) 10MusikAnimal: "One thing I forgot to mention at T402967 is we need to make bureaucrats and Community Wishlist managers themselves capable of assigning `c" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192663 (https://phabricator.wikimedia.org/T402967) (owner: 10Tim Starling) [18:59:03] FIRING: PuppetFailure: Puppet has failed on ml-serve1012:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:01:29] (03CR) 10MusikAnimal: "Also, maybe we should *remove* `manage-wishlist` from the `sysop` group (via configuration)? That was put there as it's a sensible thing t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192663 (https://phabricator.wikimedia.org/T402967) (owner: 10Tim Starling) [19:06:48] still testing [19:08:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [19:08:48] !log kharlan@deploy2002 kharlan: Continuing with sync [19:09:15] (03PS1) 10Scott French: mw-*: Tune 8.3 releases to prevent deployment timeouts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192954 (https://phabricator.wikimedia.org/T405955) [19:09:49] (03PS6) 10Krinkle: varnish: Enable unified mobile routing on eswiki, ruwiki, jawiki [puppet] - 10https://gerrit.wikimedia.org/r/1192268 (https://phabricator.wikimedia.org/T403510) [19:10:31] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:10:37] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:10:45] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:10:53] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:10:58] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:11:04] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:11:09] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:13:09] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1190992|hCaptcha: Enable A/B test for frwiki (T405239)]] (duration: 26m 24s) [19:13:17] T405239: hCaptcha: Enable A/B test for frwiki - https://phabricator.wikimedia.org/T405239 [19:14:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 20.35% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:17:43] (03CR) 10Jon Harald Søby: [C:04-1] "According to the bug, they want to change the portal talk namespace from "Portal vaten" to "Werênayışê portali". That change still needs t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191861 (https://phabricator.wikimedia.org/T328207) (owner: 10Cappybaraa) [19:19:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.43% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:21:22] (03PS1) 10Scott French: P:conftool::hiddenparma: enable known_client_expression_validation [puppet] - 10https://gerrit.wikimedia.org/r/1192620 (https://phabricator.wikimedia.org/T403220) [19:25:20] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for gengh - https://phabricator.wikimedia.org/T405713#11234810 (10Dzahn) 05Stalled→03In progress [19:26:00] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for gengh - https://phabricator.wikimedia.org/T405713#11234811 (10Dzahn) a:05DSantamaria→03None Thanks. Checking the approval box and setting to "in progress":) [19:26:12] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for gengh - https://phabricator.wikimedia.org/T405713#11234815 (10Dzahn) [19:28:07] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: eqiad row C/D Collaboration Services host migrations - https://phabricator.wikimedia.org/T405940#11234823 (10Dzahn) Done! Added both in gcal just now. [19:28:55] (03PS1) 10Kgraessle: Set AutoModeratorMultiLingualRevertRisk with available wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192956 (https://phabricator.wikimedia.org/T400727) [19:30:27] (03CR) 10BCornwall: [V:03+1 C:03+2] varnish: Enable unified mobile routing on eswiki, ruwiki, jawiki [puppet] - 10https://gerrit.wikimedia.org/r/1192268 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [19:30:37] 10ops-codfw, 06SRE, 06DC-Ops: Check list of PXE miss-configs for codfw - https://phabricator.wikimedia.org/T401442#11234832 (10Jhancock.wm) 05Open→03Resolved [19:32:18] (03PS1) 10Herron: vopsbot: switch rotation for 247 oncall [puppet] - 10https://gerrit.wikimedia.org/r/1192957 [19:38:19] (03PS5) 10Krinkle: varnish: Enable unified mobile routing on all except en.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192271 (https://phabricator.wikimedia.org/T403510) [19:38:25] (03PS4) 10Krinkle: varnish: Enable unified mobile routing on en.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192272 (https://phabricator.wikimedia.org/T403510) [19:39:52] FIRING: [7x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:41:23] (03CR) 10Scott French: [C:03+1] Add traindev-staging environment for mw-web and mw-debug (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187855 (https://phabricator.wikimedia.org/T402350) (owner: 10Ahmon Dancy) [19:41:40] FIRING: [28x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:43:48] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on dbproxy1024 - https://phabricator.wikimedia.org/T405804#11234873 (10Jclark-ctr) @Ladsgroup Drive sdb will have to added to md0 https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Sw_raid_rebuild_directions ` jclark@dbproxy1024:~$ cat /proc/mdstat... [19:44:54] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on dbproxy1024 - https://phabricator.wikimedia.org/T405804#11234877 (10Jclark-ctr) @Ladsgroup Drive sdb will have to added to md0 https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Sw_raid_rebuild_directions ` jclark@dbproxy1024:~$ cat /proc/mdstat... [19:49:56] !log cloud [19:50:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:09] I did not mean to do that :) [19:58:17] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1152 - https://phabricator.wikimedia.org/T406063#11234914 (10Jclark-ctr) drive listed as online in idrac and part of raid 10 [19:58:35] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1152 - https://phabricator.wikimedia.org/T406063#11234915 (10Jclark-ctr) 05Open→03Resolved [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251001T2000). [20:00:05] danisztls, xSavitar, and tgr: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:22] o/ [20:01:17] I'll self-deploy my patches today [20:02:27] Does danisztls happen to be around? :) [20:04:20] I'll go ahead and when they come online, they can deploy right after me. [20:05:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by derick@deploy2002 using scap backport" [core] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192884 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01) [20:05:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by derick@deploy2002 using scap backport" [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192885 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01) [20:12:21] 06SRE, 10DNS, 06Traffic, 10wikimediafoundation.org, 07IPv6: wikimediafoundation.org does not support IPv6 - https://phabricator.wikimedia.org/T403269#11234952 (10BCornwall) 05Open→03In progress p:05Triage→03Low a:03BCornwall I've contacted @Varnent to try and get the right IPv6 address. [20:13:45] (03PS3) 10Cappybaraa: diqwiki: Add namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191861 (https://phabricator.wikimedia.org/T328207) [20:16:47] (03PS4) 10Cappybaraa: diqwiki: Add namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191861 (https://phabricator.wikimedia.org/T328207) [20:18:49] (03Merged) 10jenkins-bot: session: Handle an edge-case in MultiBackendSessionStore::set() [core] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192884 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01) [20:19:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [20:20:32] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#11234974 (10Reedy) Yeah... https://lists.wikimedia.org/postorius/lists/mediawiki-announce.lists.wikimedia.org/ is stil... [20:20:53] (03Merged) 10jenkins-bot: session: Handle an edge-case in MultiBackendSessionStore::set() [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192885 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01) [20:21:31] !log derick@deploy2002 Started scap sync-world: Backport for [[gerrit:1192884|session: Handle an edge-case in MultiBackendSessionStore::set() (T402808)]], [[gerrit:1192885|session: Handle an edge-case in MultiBackendSessionStore::set() (T402808)]] [20:21:37] T402808: Deploy separate anonymous session backend to Wikimedia production, in log-only mode - https://phabricator.wikimedia.org/T402808 [20:27:48] !log derick@deploy2002 derick, d3r1ck01: Backport for [[gerrit:1192884|session: Handle an edge-case in MultiBackendSessionStore::set() (T402808)]], [[gerrit:1192885|session: Handle an edge-case in MultiBackendSessionStore::set() (T402808)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:28:07] T402808: Deploy separate anonymous session backend to Wikimedia production, in log-only mode - https://phabricator.wikimedia.org/T402808 [20:28:29] * xSavitar testing... [20:29:56] All looks good. [20:30:07] !log derick@deploy2002 derick, d3r1ck01: Continuing with sync [20:31:15] I have an UBN I'd like to deploy at the end of the window, fyi [20:31:39] sorry, I'm late [20:32:00] I will deploy mine after yall [20:32:56] danisztls, sure! Will signal you once I'm done. [20:33:05] arlolra_ Ack! [20:33:37] danisztls, almost done syncing backports then I'll deploy a config patch next (which should be faster I hope) [20:33:50] (03PS1) 10Arlolra: Revert "Add parsoid support in ProofreadPage extension" [extensions/ProofreadPage] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192971 [20:34:28] !log derick@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192884|session: Handle an edge-case in MultiBackendSessionStore::set() (T402808)]], [[gerrit:1192885|session: Handle an edge-case in MultiBackendSessionStore::set() (T402808)]] (duration: 12m 57s) [20:34:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/ProofreadPage] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192971 (owner: 10Arlolra) [20:34:34] T402808: Deploy separate anonymous session backend to Wikimedia production, in log-only mode - https://phabricator.wikimedia.org/T402808 [20:34:56] xSavitar: thanks! I just found out that the team want to postpone the deployment though. [20:35:13] danisztls: Okay [20:35:37] (03CR) 10C. Scott Ananian: [C:03+1] Revert "Add parsoid support in ProofreadPage extension" [extensions/ProofreadPage] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192971 (owner: 10Arlolra) [20:36:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by derick@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192632 (owner: 10D3r1ck01) [20:37:54] (03Merged) 10jenkins-bot: Revert^2 "session: Enable MultiBackendSessionStore on `group1` wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192632 (owner: 10D3r1ck01) [20:38:27] !log derick@deploy2002 Started scap sync-world: Backport for [[gerrit:1192632|Revert^2 "session: Enable MultiBackendSessionStore on `group1` wikis"]] [20:43:31] (03CR) 10Btullis: [C:03+2] Replace old ingestion wiki list file with new autoupdated file [puppet] - 10https://gerrit.wikimedia.org/r/1191750 (owner: 10Snwachukwu) [20:44:10] (03CR) 10LWatson: "Are you referring specifically to the "MediaWiki train"? I ask because there's a Web team deployment included in the schedule https://wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189281 (https://phabricator.wikimedia.org/T404398) (owner: 10Eric Gardner) [20:44:51] !log derick@deploy2002 d3r1ck01, derick: Backport for [[gerrit:1192632|Revert^2 "session: Enable MultiBackendSessionStore on `group1` wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:44:55] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:45:22] * xSavitar testing... [20:46:25] FIRING: [29x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:46:45] All seems fine. Syncing [20:46:53] !log derick@deploy2002 d3r1ck01, derick: Continuing with sync [20:46:54] (03CR) 10LWatson: "Is there another way to verify how many trains have passed that the extension was included in like a phab tag?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189281 (https://phabricator.wikimedia.org/T404398) (owner: 10Eric Gardner) [20:47:18] We will monitor Grafana and logstash shortly after just in case. Cc tgr_ [20:47:30] (03PS1) 10MusikAnimal: metawiki: Configure permissions for CommunityRequests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192972 (https://phabricator.wikimedia.org/T402967) [20:48:24] (03CR) 10MusikAnimal: "I've submitted I2d282523aab1 for the above." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192663 (https://phabricator.wikimedia.org/T402967) (owner: 10Tim Starling) [20:48:33] (03CR) 10LWatson: "Disregard, I see a note about this in the task" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189281 (https://phabricator.wikimedia.org/T404398) (owner: 10Eric Gardner) [20:49:09] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2025-09-24-083919 to 2025-09-30-194529 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192973 (https://phabricator.wikimedia.org/T378558) [20:49:25] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2025-09-24-180530 to 2025-09-25-181720 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192974 (https://phabricator.wikimedia.org/T378558) [20:51:13] !log derick@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192632|Revert^2 "session: Enable MultiBackendSessionStore on `group1` wikis"]] (duration: 12m 46s) [20:52:32] arlolra_, do you want to take over? [20:52:36] I'm done deploying [20:52:40] sure, thanks [20:52:45] yw! [20:53:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy2002 using scap backport" [extensions/ProofreadPage] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192971 (owner: 10Arlolra) [20:54:43] (03Merged) 10jenkins-bot: Revert "Add parsoid support in ProofreadPage extension" [extensions/ProofreadPage] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192971 (owner: 10Arlolra) [20:55:17] !log arlolra@deploy2002 Started scap sync-world: Backport for [[gerrit:1192971|Revert "Add parsoid support in ProofreadPage extension"]] [20:57:04] anyone else in this deployment queue after arlolra_ ? [20:58:41] TimStarling, no afaik. [20:59:04] danisztls mentioned they've postponed their deply. [20:59:19] s/deply/deployment [20:59:35] !log arlolra@deploy2002 arlolra: Backport for [[gerrit:1192971|Revert "Add parsoid support in ProofreadPage extension"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251001T2100) [21:00:47] !log arlolra@deploy2002 arlolra: Continuing with sync [21:01:08] TimStarling: We're deploying in our window, but at first just services. [21:01:27] (03CR) 10Tim Starling: [C:03+2] Configure CommunityRequests virtual domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192669 (https://phabricator.wikimedia.org/T402967) (owner: 10Tim Starling) [21:02:06] I just need that config change, I'm merging it to be on the safe side, pretty sure to go out if I merge it [21:02:30] Oh, sure, no rush at our end, as long as we can deploy in ~ 30 mins. :-) [21:02:53] (03Merged) 10jenkins-bot: Configure CommunityRequests virtual domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192669 (https://phabricator.wikimedia.org/T402967) (owner: 10Tim Starling) [21:03:28] (03CR) 10Ecarg: [C:03+2] wikifunctions: Upgrade evaluators from 2025-09-24-083919 to 2025-09-30-194529 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192973 (https://phabricator.wikimedia.org/T378558) (owner: 10Jforrester) [21:04:39] OK so there was a backport window just ending? I have the "WMF Deployments" google calendar but I guess it's out of date [21:05:04] !log arlolra@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192971|Revert "Add parsoid support in ProofreadPage extension"]] (duration: 09m 47s) [21:05:13] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2025-09-24-083919 to 2025-09-30-194529 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192973 (https://phabricator.wikimedia.org/T378558) (owner: 10Jforrester) [21:05:13] TimStarling: Yeah, the https://wikitech.wikimedia.org/wiki/Deployments page is the source of truth. I didn't know there was a GCal form of it. [21:05:20] It sounds like it's out of date. [21:06:01] (03PS1) 10Bking: dse-k8s-eqiad: explicitly set quotas for opensearch-test namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192978 (https://phabricator.wikimedia.org/T397246) [21:06:08] the Google Calendar is linked from that page, in the grey box [21:06:30] Huh, you're right. Banner blindness is amazing. [21:06:31] !log ecarg@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [21:07:29] !log ecarg@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [21:07:52] !log tstarling@deploy2002 Started scap sync-world: Backport for [[gerrit:1192669|Configure CommunityRequests virtual domain (T402967)]] [21:07:59] T402967: Deploy CommunityRequests extension to prod - https://phabricator.wikimedia.org/T402967 [21:09:02] !log ecarg@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [21:09:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [21:10:06] !log ecarg@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [21:10:35] !log tstarling@deploy2002 tstarling: Backport for [[gerrit:1192669|Configure CommunityRequests virtual domain (T402967)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:10:44] !log ecarg@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [21:11:01] !log tstarling@deploy2002 tstarling: Continuing with sync [21:11:40] !log ecarg@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [21:12:28] (03CR) 10Ecarg: [C:03+2] wikifunctions: Upgrade orchestrator from 2025-09-24-180530 to 2025-09-25-181720 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192974 (https://phabricator.wikimedia.org/T378558) (owner: 10Jforrester) [21:14:19] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2025-09-24-180530 to 2025-09-25-181720 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192974 (https://phabricator.wikimedia.org/T378558) (owner: 10Jforrester) [21:15:18] !log ecarg@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [21:15:28] !log tstarling@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192669|Configure CommunityRequests virtual domain (T402967)]] (duration: 07m 36s) [21:15:34] T402967: Deploy CommunityRequests extension to prod - https://phabricator.wikimedia.org/T402967 [21:15:46] !log ecarg@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [21:16:42] !log ecarg@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [21:17:18] !log ecarg@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [21:17:42] !log ecarg@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [21:18:17] !log ecarg@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [21:20:36] (03PS7) 10Jforrester: Enable Wikifunctions client mode on Wiktionaries, Part III, and Incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172048 (https://phabricator.wikimedia.org/T397401) [21:20:49] TimStarling: All done? I can wait. [21:24:01] yes all done for now [21:24:05] Thanks! [21:24:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172048 (https://phabricator.wikimedia.org/T397401) (owner: 10Jforrester) [21:25:09] (03Merged) 10jenkins-bot: Enable Wikifunctions client mode on Wiktionaries, Part III, and Incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172048 (https://phabricator.wikimedia.org/T397401) (owner: 10Jforrester) [21:25:41] !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1172048|Enable Wikifunctions client mode on Wiktionaries, Part III, and Incubator (T397401 T401682)]] [21:25:49] T397401: If we follow Parsoid’s rollout and integrate Wikifunctions on most Wiktionaries and some low-traffic Wikipedias, we will get the testing we need to confidently roll out to larger wikis. - https://phabricator.wikimedia.org/T397401 [21:25:50] T401682: Wikimania in-person request: Enable Wikifunctions client mode on the Wikimedia Incubator - https://phabricator.wikimedia.org/T401682 [21:30:00] !log jforrester@deploy2002 jforrester: Backport for [[gerrit:1172048|Enable Wikifunctions client mode on Wiktionaries, Part III, and Incubator (T397401 T401682)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:30:49] (03CR) 10Bking: [C:03+2] dse-k8s-eqiad: explicitly set quotas for opensearch-test namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192978 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [21:30:49] (03CR) 10LWatson: [C:03+1] "I reviewed the code and verified that two deployment trains have passed: `wmf/1.45.0-wmf.20` and `mw1.45.0-wmf.21`." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189281 (https://phabricator.wikimedia.org/T404398) (owner: 10Eric Gardner) [21:31:02] (03CR) 10Bking: [C:03+2] "self-merging in the interest of time" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192978 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [21:31:03] !log jforrester@deploy2002 jforrester: Continuing with sync [21:32:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [21:34:54] (03CR) 10LWatson: Deploy ReaderExperiments to Beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189288 (https://phabricator.wikimedia.org/T404398) (owner: 10Eric Gardner) [21:35:20] !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1172048|Enable Wikifunctions client mode on Wiktionaries, Part III, and Incubator (T397401 T401682)]] (duration: 09m 39s) [21:35:28] T397401: If we follow Parsoid’s rollout and integrate Wikifunctions on most Wiktionaries and some low-traffic Wikipedias, we will get the testing we need to confidently roll out to larger wikis. - https://phabricator.wikimedia.org/T397401 [21:35:29] T401682: Wikimania in-person request: Enable Wikifunctions client mode on the Wikimedia Incubator - https://phabricator.wikimedia.org/T401682 [21:36:58] !log ryankemper@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [21:37:34] (03CR) 10LWatson: [C:03+1] Enable ReaderExperiments on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189293 (https://phabricator.wikimedia.org/T404398) (owner: 10Eric Gardner) [21:37:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [21:38:06] (03CR) 10Jforrester: [C:03+1] "Confirmed that the ReaderExperiments repo is cloned and live in wmf.20 and wmf.21 in production, so this is safe to merge as-is now and wo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189281 (https://phabricator.wikimedia.org/T404398) (owner: 10Eric Gardner) [21:38:22] (03Merged) 10jenkins-bot: dse-k8s-eqiad: explicitly set quotas for opensearch-test namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192978 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [21:39:28] !log ryankemper@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [21:40:06] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [21:40:13] (03CR) 10LWatson: [C:03+1] "Looks good based on the example given https://www.mediawiki.org/wiki/Writing_an_extension_for_deployment#Deploy_to_Beta_Cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189288 (https://phabricator.wikimedia.org/T404398) (owner: 10Eric Gardner) [21:40:37] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [21:40:41] (03CR) 10LWatson: [C:03+1] Load ReaderExperiments extension in CommonSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189294 (https://phabricator.wikimedia.org/T404398) (owner: 10Eric Gardner) [21:44:52] FIRING: [28x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:44:55] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:46:22] (03PS3) 10MusikAnimal: Enable CommunityRequests on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192663 (https://phabricator.wikimedia.org/T402967) (owner: 10Tim Starling) [21:46:25] FIRING: [29x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:48:13] (03CR) 10Ladsgroup: [C:03+2] site.pp: Configure es2051 mariadb role [puppet] - 10https://gerrit.wikimedia.org/r/1192927 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [21:48:18] (03CR) 10Ladsgroup: [C:03+1] site.pp: Configure es2051 mariadb role [puppet] - 10https://gerrit.wikimedia.org/r/1192927 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [21:49:12] (03CR) 10Bvibber: Deploy ReaderExperiments to Beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189288 (https://phabricator.wikimedia.org/T404398) (owner: 10Eric Gardner) [21:53:38] (03PS2) 10MusikAnimal: metawiki: Configure permissions for CommunityRequests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192972 (https://phabricator.wikimedia.org/T402967) [21:55:42] Everything look clear for a small config deploy? Got setup ReaderExperiments on Beta ready to roll :D [21:55:49] I can spiderpig it up :D [21:56:05] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1018.eqiad.wmnet with OS bullseye [21:56:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192663 (https://phabricator.wikimedia.org/T402967) (owner: 10Tim Starling) [21:56:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192972 (https://phabricator.wikimedia.org/T402967) (owner: 10MusikAnimal) [21:56:23] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2017.codfw.wmnet with OS bullseye [21:56:57] (03Merged) 10jenkins-bot: Enable CommunityRequests on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192663 (https://phabricator.wikimedia.org/T402967) (owner: 10Tim Starling) [21:57:02] (03Merged) 10jenkins-bot: metawiki: Configure permissions for CommunityRequests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192972 (https://phabricator.wikimedia.org/T402967) (owner: 10MusikAnimal) [21:57:38] !log tstarling@deploy2002 Started scap sync-world: Backport for [[gerrit:1192663|Enable CommunityRequests on metawiki (T402967)]], [[gerrit:1192972|metawiki: Configure permissions for CommunityRequests (T402967)]] [21:57:44] T402967: Deploy CommunityRequests extension to prod - https://phabricator.wikimedia.org/T402967 [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251001T2200) [22:01:55] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2035:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2035 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:02:50] !log tstarling@deploy2002 musikanimal, tstarling: Backport for [[gerrit:1192663|Enable CommunityRequests on metawiki (T402967)]], [[gerrit:1192972|metawiki: Configure permissions for CommunityRequests (T402967)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:02:57] T402967: Deploy CommunityRequests extension to prod - https://phabricator.wikimedia.org/T402967 [22:04:00] !log tstarling@deploy2002 musikanimal, tstarling: Continuing with sync [22:08:21] !log tstarling@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192663|Enable CommunityRequests on metawiki (T402967)]], [[gerrit:1192972|metawiki: Configure permissions for CommunityRequests (T402967)]] (duration: 10m 42s) [22:08:27] T402967: Deploy CommunityRequests extension to prod - https://phabricator.wikimedia.org/T402967 [22:10:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189281 (https://phabricator.wikimedia.org/T404398) (owner: 10Eric Gardner) [22:10:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189288 (https://phabricator.wikimedia.org/T404398) (owner: 10Eric Gardner) [22:10:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189293 (https://phabricator.wikimedia.org/T404398) (owner: 10Eric Gardner) [22:10:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189294 (https://phabricator.wikimedia.org/T404398) (owner: 10Eric Gardner) [22:11:17] (03Merged) 10jenkins-bot: Add ReaderExperiments extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189281 (https://phabricator.wikimedia.org/T404398) (owner: 10Eric Gardner) [22:11:22] (03Merged) 10jenkins-bot: Deploy ReaderExperiments to Beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189288 (https://phabricator.wikimedia.org/T404398) (owner: 10Eric Gardner) [22:11:25] (03Merged) 10jenkins-bot: Enable ReaderExperiments on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189293 (https://phabricator.wikimedia.org/T404398) (owner: 10Eric Gardner) [22:11:28] (03Merged) 10jenkins-bot: Load ReaderExperiments extension in CommonSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189294 (https://phabricator.wikimedia.org/T404398) (owner: 10Eric Gardner) [22:12:02] !log bvibber@deploy2002 Started scap sync-world: Backport for [[gerrit:1189281|Add ReaderExperiments extension (T404398)]], [[gerrit:1189288|Deploy ReaderExperiments to Beta cluster (T404398)]], [[gerrit:1189293|Enable ReaderExperiments on Beta (T404398)]], [[gerrit:1189294|Load ReaderExperiments extension in CommonSettings-labs.php (T404398)]] [22:12:08] T404398: Image Browsing: Deploy the prototype to Beta - https://phabricator.wikimedia.org/T404398 [22:13:06] !log migrating wishes to CommunityRequests with migrateFromGadget.php [22:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:52] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [22:24:52] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [22:38:07] PROBLEM - Ubuntu mirror in sync with upstream on mirror1001 is CRITICAL: /srv/mirrors/ubuntu is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [22:39:42] !log bvibber@deploy2002 egardner, bvibber: Backport for [[gerrit:1189281|Add ReaderExperiments extension (T404398)]], [[gerrit:1189288|Deploy ReaderExperiments to Beta cluster (T404398)]], [[gerrit:1189293|Enable ReaderExperiments on Beta (T404398)]], [[gerrit:1189294|Load ReaderExperiments extension in CommonSettings-labs.php (T404398)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes [22:39:42] can now be verified there. [22:39:48] T404398: Image Browsing: Deploy the prototype to Beta - https://phabricator.wikimedia.org/T404398 [22:40:03] !log bvibber@deploy2002 egardner, bvibber: Continuing with sync [22:43:33] (03CR) 10LWatson: [C:03+1] Deploy ReaderExperiments to Beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189288 (https://phabricator.wikimedia.org/T404398) (owner: 10Eric Gardner) [22:47:48] ryankemper@cumin2002 reimage (PID 671971) is awaiting input [22:48:25] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [22:52:35] !log bvibber@deploy2002 Finished scap sync-world: Backport for [[gerrit:1189281|Add ReaderExperiments extension (T404398)]], [[gerrit:1189288|Deploy ReaderExperiments to Beta cluster (T404398)]], [[gerrit:1189293|Enable ReaderExperiments on Beta (T404398)]], [[gerrit:1189294|Load ReaderExperiments extension in CommonSettings-labs.php (T404398)]] (duration: 40m 32s) [22:52:42] T404398: Image Browsing: Deploy the prototype to Beta - https://phabricator.wikimedia.org/T404398 [22:53:36] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [22:54:21] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [22:54:42] whew that took a long time. localization cache update :D [22:57:20] 06SRE, 10LDAP-Access-Requests: Grant Access to wmde and nda for Maria Lechner WMDE - https://phabricator.wikimedia.org/T406106#11235324 (10KFrancis) Hi all, may I move forward with processing the NDA? [22:58:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:59:03] FIRING: PuppetFailure: Puppet has failed on ml-serve1012:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [23:06:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [23:09:19] 06SRE, 10LDAP-Access-Requests: Grant Access to wmde and nda for Maria Lechner WMDE - https://phabricator.wikimedia.org/T406106#11235332 (10Dzahn) This ticket is mostly a duplicate of T405917 now. (but don't worry about it too much, not a big deal, it is being handled either way) What is actually needed here:... [23:11:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [23:14:11] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1018.eqiad.wmnet with OS bullseye [23:15:25] (03PS1) 10Dzahn: wikistats: move backup dir out of git repo path [puppet] - 10https://gerrit.wikimedia.org/r/1192985 (https://phabricator.wikimedia.org/T401859) [23:16:20] (03CR) 10Dzahn: [C:03+2] wikistats: move backup dir out of git repo path [puppet] - 10https://gerrit.wikimedia.org/r/1192985 (https://phabricator.wikimedia.org/T401859) (owner: 10Dzahn) [23:16:39] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs2017.codfw.wmnet with OS bullseye [23:21:44] (03PS1) 10Dzahn: wikistats: use wmflib::debian_php_version to pick PHP version [puppet] - 10https://gerrit.wikimedia.org/r/1192986 [23:22:26] (03CR) 10Dzahn: [C:03+2] wikistats: use wmflib::debian_php_version to pick PHP version [puppet] - 10https://gerrit.wikimedia.org/r/1192986 (owner: 10Dzahn) [23:24:08] (03PS1) 10Bking: dse-k8s-eqiad: bump up minimum pod resources for opensearch-test ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192987 (https://phabricator.wikimedia.org/T397246) [23:25:13] (03PS2) 10Bking: dse-k8s-eqiad: bump up minimum pod resources for opensearch-test ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192987 (https://phabricator.wikimedia.org/T397246) [23:26:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [23:31:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [23:33:41] (03PS1) 10Dzahn: wikistats: do not ensure dir that is already used with git::clone [puppet] - 10https://gerrit.wikimedia.org/r/1192989 (https://phabricator.wikimedia.org/T401859) [23:34:35] 06SRE, 10LDAP-Access-Requests: Grant Access to wmde and nda for Maria Lechner WMDE - https://phabricator.wikimedia.org/T406106#11235387 (10KFrancis) Thank you, @Aklapper. The agreement has been sent out for signatures. I'll confirm when it's complete. [23:36:30] (03PS2) 10Dzahn: wikistats: do not ensure dir that is already used with git::clone [puppet] - 10https://gerrit.wikimedia.org/r/1192989 (https://phabricator.wikimedia.org/T401859) [23:36:40] (03CR) 10Dzahn: [C:03+2] wikistats: do not ensure dir that is already used with git::clone [puppet] - 10https://gerrit.wikimedia.org/r/1192989 (https://phabricator.wikimedia.org/T401859) (owner: 10Dzahn) [23:38:04] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1192990 [23:38:04] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1192990 (owner: 10TrainBranchBot) [23:38:33] (03CR) 10Pppery: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191861 (https://phabricator.wikimedia.org/T328207) (owner: 10Cappybaraa) [23:39:52] FIRING: [7x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:53:04] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1192990 (owner: 10TrainBranchBot) [23:59:50] (03PS1) 10MusikAnimal: Increase timeout for MessageIndex lock [extensions/Translate] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192992 (https://phabricator.wikimedia.org/T402967)