[00:00:15] <logmsgbot>	 !log krinkle@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192278|Disable wmgUseMdotRouting on Wikidata (T403510)]] (duration: 13m 23s)
[00:00:22] <stashbot>	 T403510: [Rollout Phase 3] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510
[00:06:25] <jinxer-wm>	 FIRING: [20x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:08:31] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1192646
[00:08:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1192646 (owner: 10TrainBranchBot)
[00:09:22] <wikibugs>	 (03PS1) 10Ncmonitor: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1192647
[00:09:26] <wikibugs>	 (03PS1) 10Ncmonitor: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1192648
[00:09:30] <wikibugs>	 (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1192649
[00:29:48] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1192646 (owner: 10TrainBranchBot)
[00:33:18] <wikibugs>	 (03PS2) 10Krinkle: Disable wmgUseMdotRouting on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192279 (https://phabricator.wikimedia.org/T403510)
[00:37:55] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbprov1007.eqiad.wmnet with OS bookworm
[00:38:08] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov1007 - https://phabricator.wikimedia.org/T400412#11231724 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host dbprov1007.eqiad.wmnet with OS bookworm executed with errors: - dbprov1007...
[00:48:53] <wikibugs>	 (03PS1) 10MusikAnimal: migrateFromGadget: add a few more missing transformations [extensions/CommunityRequests] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192652 (https://phabricator.wikimedia.org/T405826)
[00:49:29] <wikibugs>	 (03PS1) 10Krinkle: Disable wmgUseMdotRouting on frwiki and dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192653 (https://phabricator.wikimedia.org/T403510)
[00:56:44] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:56:44] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:59:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy2002 using scap backport" [extensions/CommunityRequests] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192652 (https://phabricator.wikimedia.org/T405826) (owner: 10MusikAnimal)
[01:00:16] <wikibugs>	 (03Merged) 10jenkins-bot: migrateFromGadget: add a few more missing transformations [extensions/CommunityRequests] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192652 (https://phabricator.wikimedia.org/T405826) (owner: 10MusikAnimal)
[01:00:46] <logmsgbot>	 !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image
[01:01:34] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54973 bytes in 0.077 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:01:34] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9309 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:08:20] <musikanimal>	 so nice that SpiderPig caught the wmf/next image being built… but why was that not on the Deployments calendar? or am I blind
[01:12:05] <musikanimal>	 I guess I need to try again later. The patch was merged to wmf.21 but isn't deployed yet
[01:14:20] <logmsgbot>	 !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 33s)
[01:17:23] <logmsgbot>	 !log musikanimal@deploy2002 Started scap sync-world: Backport for [[gerrit:1192652|migrateFromGadget: add a few more missing transformations (T405826 T404138 T404234)]]
[01:17:32] <stashbot>	 T405826: Migration script issues - https://phabricator.wikimedia.org/T405826
[01:17:32] <stashbot>	 T404138: Update migration script to map projects to tags - https://phabricator.wikimedia.org/T404138
[01:17:33] <stashbot>	 T404234: <languages /> repeated twice if it was already part of the wikitext - https://phabricator.wikimedia.org/T404234
[01:22:35] <logmsgbot>	 !log musikanimal@deploy2002 musikanimal: Backport for [[gerrit:1192652|migrateFromGadget: add a few more missing transformations (T405826 T404138 T404234)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[01:22:43] <stashbot>	 T405826: Migration script issues - https://phabricator.wikimedia.org/T405826
[01:22:44] <stashbot>	 T404138: Update migration script to map projects to tags - https://phabricator.wikimedia.org/T404138
[01:22:45] <stashbot>	 T404234: <languages /> repeated twice if it was already part of the wikitext - https://phabricator.wikimedia.org/T404234
[01:23:21] <logmsgbot>	 !log musikanimal@deploy2002 musikanimal: Continuing with sync
[01:25:31] <musikanimal>	 I realize now I should also check the SAL to see if something is in progress
[01:28:16] <logmsgbot>	 !log musikanimal@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192652|migrateFromGadget: add a few more missing transformations (T405826 T404138 T404234)]] (duration: 10m 53s)
[01:28:27] <stashbot>	 T405826: Migration script issues - https://phabricator.wikimedia.org/T405826
[01:28:28] <stashbot>	 T404138: Update migration script to map projects to tags - https://phabricator.wikimedia.org/T404138
[01:28:29] <stashbot>	 T404234: <languages /> repeated twice if it was already part of the wikitext - https://phabricator.wikimedia.org/T404234
[01:29:12] <wikibugs>	 (03PS1) 10MusikAnimal: Call WikiPage::doPurge to try and clear cache after language is set [extensions/CommunityRequests] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192655 (https://phabricator.wikimedia.org/T404748)
[01:39:20] <wikibugs>	 (03CR) 10Pppery: "I have no context for what is happening here hence no useful feedback to give." [puppet] - 10https://gerrit.wikimedia.org/r/1192648 (owner: 10Ncmonitor)
[01:39:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy2002 using scap backport" [extensions/CommunityRequests] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192655 (https://phabricator.wikimedia.org/T404748) (owner: 10MusikAnimal)
[01:40:59] <wikibugs>	 (03Merged) 10jenkins-bot: Call WikiPage::doPurge to try and clear cache after language is set [extensions/CommunityRequests] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192655 (https://phabricator.wikimedia.org/T404748) (owner: 10MusikAnimal)
[01:41:35] <logmsgbot>	 !log musikanimal@deploy2002 Started scap sync-world: Backport for [[gerrit:1192655|Call WikiPage::doPurge to try and clear cache after language is set (T404748)]]
[01:41:39] <stashbot>	 T404748: Newly created wishes in non-English languages do not immediately render with correct RTL and localized labels until cache is purged - https://phabricator.wikimedia.org/T404748
[01:41:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2035:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2035 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[01:44:52] <jinxer-wm>	 FIRING: [28x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:46:25] <jinxer-wm>	 FIRING: [21x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:46:43] <logmsgbot>	 !log musikanimal@deploy2002 musikanimal: Backport for [[gerrit:1192655|Call WikiPage::doPurge to try and clear cache after language is set (T404748)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[01:46:44] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:46:44] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:46:46] <stashbot>	 T404748: Newly created wishes in non-English languages do not immediately render with correct RTL and localized labels until cache is purged - https://phabricator.wikimedia.org/T404748
[01:47:08] <logmsgbot>	 !log musikanimal@deploy2002 musikanimal: Continuing with sync
[01:47:58] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[01:52:21] <logmsgbot>	 !log musikanimal@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192655|Call WikiPage::doPurge to try and clear cache after language is set (T404748)]] (duration: 10m 47s)
[01:52:26] <stashbot>	 T404748: Newly created wishes in non-English languages do not immediately render with correct RTL and localized labels until cache is purged - https://phabricator.wikimedia.org/T404748
[01:56:38] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54975 bytes in 3.396 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:56:38] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9310 bytes in 3.558 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:01:25] <jinxer-wm>	 FIRING: [22x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:05:57] <wikibugs>	 (03PS1) 10MusikAnimal: AbstractRenderer: fix extistence dependency on Votes subpage [extensions/CommunityRequests] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192657
[02:14:52] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[02:17:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy2002 using scap backport" [extensions/CommunityRequests] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192657 (owner: 10MusikAnimal)
[02:18:41] <wikibugs>	 (03Merged) 10jenkins-bot: AbstractRenderer: fix extistence dependency on Votes subpage [extensions/CommunityRequests] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192657 (owner: 10MusikAnimal)
[02:19:14] <logmsgbot>	 !log musikanimal@deploy2002 Started scap sync-world: Backport for [[gerrit:1192657|AbstractRenderer: fix extistence dependency on Votes subpage]]
[02:24:52] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[02:26:06] <logmsgbot>	 !log musikanimal@deploy2002 musikanimal: Backport for [[gerrit:1192657|AbstractRenderer: fix extistence dependency on Votes subpage]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[02:26:30] <logmsgbot>	 !log musikanimal@deploy2002 musikanimal: Continuing with sync
[02:31:33] <logmsgbot>	 !log musikanimal@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192657|AbstractRenderer: fix extistence dependency on Votes subpage]] (duration: 12m 19s)
[02:32:51] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:et-1/1/2 (Transport: cr1-codfw:et-1/0/2 (Arelion, IC-374549) {#20231106}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[02:36:40] <icinga-wm>	 PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:39:40] <icinga-wm>	 RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:43:40] <icinga-wm>	 PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:43:40] <wikibugs>	 (03PS13) 10Andrea Denisse: mediawiki-engineering: Add REST API alerts with thresholds [alerts] - 10https://gerrit.wikimedia.org/r/1192183 (https://phabricator.wikimedia.org/T405151)
[02:43:40] <wikibugs>	 (03CR) 10Andrea Denisse: "Hi folks, I used the envoy_cluster_upstream_rq metric instead of envoy_cluster_upstream_rq_total mostly because the envoy_cluster_upstream" [alerts] - 10https://gerrit.wikimedia.org/r/1192183 (https://phabricator.wikimedia.org/T405151) (owner: 10Andrea Denisse)
[02:46:25] <jinxer-wm>	 FIRING: [22x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:47:58] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[02:50:38] <wikibugs>	 (03PS14) 10Andrea Denisse: mediawiki-engineering: Add REST API alerts with thresholds [alerts] - 10https://gerrit.wikimedia.org/r/1192183 (https://phabricator.wikimedia.org/T405151)
[02:51:33] <wikibugs>	 (03CR) 10Andrea Denisse: "Unresolving for awareness." [alerts] - 10https://gerrit.wikimedia.org/r/1192183 (https://phabricator.wikimedia.org/T405151) (owner: 10Andrea Denisse)
[02:51:40] <icinga-wm>	 RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:55:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 1.364s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[02:57:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[02:58:11] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:00:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 1.333s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[03:07:40] <icinga-wm>	 PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:07:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[03:08:40] <icinga-wm>	 RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:12:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[03:22:51] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[03:41:56] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ebomani. - https://phabricator.wikimedia.org/T405124#11231902 (10Dzahn) @EBomani  You can start by taking a look at [[ https://wikitech.wikimedia.org/wiki/Bastion  | the list of bastion hosts ]].  You can pick any of the bastion hosts listed...
[03:44:52] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[03:45:06] <jinxer-wm>	 FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[04:35:06] <jinxer-wm>	 RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[04:41:25] <jinxer-wm>	 FIRING: [22x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:54:27] <TimStarling>	 !log on x1 metawiki creating tables for CommunityRequests
[04:54:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:06:51] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:et-1/1/2 (Transport: cr1-codfw:et-1/0/2 (Arelion, IC-374549) {#20231106}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[05:09:10] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:11:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[05:26:50] <wikibugs>	 (03PS1) 10Tim Starling: Enable CommunityRequests on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192663 (https://phabricator.wikimedia.org/T402967)
[05:31:25] <jinxer-wm>	 FIRING: [23x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:31:51] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[05:39:11] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:41:55] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2035:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2035 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[05:44:52] <jinxer-wm>	 FIRING: [28x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251001T0600)
[06:05:52] <wikibugs>	 (03PS1) 10Kosta Harlan: CreateAccount: Track interactions with the captchaWord field [extensions/WikimediaEvents] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192666 (https://phabricator.wikimedia.org/T394744)
[06:06:05] <wikibugs>	 (03PS1) 10Kosta Harlan: CreateAccount: Track interactions with the captchaWord field [extensions/WikimediaEvents] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192667 (https://phabricator.wikimedia.org/T394744)
[06:06:13] <wikibugs>	 (03PS1) 10Kosta Harlan: CreateAccount: Record the CAPTCHA class used in account creation funnel [extensions/WikimediaEvents] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192668 (https://phabricator.wikimedia.org/T405239)
[06:06:22] <wikibugs>	 (03PS2) 10Tim Starling: Enable CommunityRequests on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192663 (https://phabricator.wikimedia.org/T402967)
[06:06:22] <wikibugs>	 (03PS1) 10Tim Starling: Configure CommunityRequests virtual domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192669 (https://phabricator.wikimedia.org/T402967)
[06:06:25] <wikibugs>	 (03PS1) 10Kosta Harlan: CreateAccount: Record the CAPTCHA class used in account creation funnel [extensions/WikimediaEvents] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192670 (https://phabricator.wikimedia.org/T405239)
[06:07:10] <kostajh>	 I'm going to backport some patches to wmf.20 and wmf.21
[06:08:42] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192670 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan)
[06:08:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192667 (https://phabricator.wikimedia.org/T394744) (owner: 10Kosta Harlan)
[06:14:52] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[06:16:25] <jinxer-wm>	 FIRING: [24x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:16:47] <wikibugs>	 (03Merged) 10jenkins-bot: CreateAccount: Record the CAPTCHA class used in account creation funnel [extensions/WikimediaEvents] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192670 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan)
[06:16:55] <wikibugs>	 (03Merged) 10jenkins-bot: CreateAccount: Track interactions with the captchaWord field [extensions/WikimediaEvents] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192667 (https://phabricator.wikimedia.org/T394744) (owner: 10Kosta Harlan)
[06:17:35] <logmsgbot>	 !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1192670|CreateAccount: Record the CAPTCHA class used in account creation funnel (T405239)]], [[gerrit:1192667|CreateAccount: Track interactions with the captchaWord field (T394744)]]
[06:17:43] <stashbot>	 T405239: hCaptcha: Enable A/B test for frwiki - https://phabricator.wikimedia.org/T405239
[06:17:44] <stashbot>	 T394744: Instrument account creation funnel (analytics for Special:CreateAccount) - https://phabricator.wikimedia.org/T394744
[06:21:44] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:21:44] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:22:36] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1192670|CreateAccount: Record the CAPTCHA class used in account creation funnel (T405239)]], [[gerrit:1192667|CreateAccount: Track interactions with the captchaWord field (T394744)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[06:22:43] <stashbot>	 T405239: hCaptcha: Enable A/B test for frwiki - https://phabricator.wikimedia.org/T405239
[06:22:44] <stashbot>	 T394744: Instrument account creation funnel (analytics for Special:CreateAccount) - https://phabricator.wikimedia.org/T394744
[06:23:03] <wikibugs>	 (03CR) 10Samwilson: [C:03+1] Configure CommunityRequests virtual domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192669 (https://phabricator.wikimedia.org/T402967) (owner: 10Tim Starling)
[06:23:38] <icinga-wm>	 PROBLEM - Docker registry HTTPS interface on registry2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker
[06:24:28] <icinga-wm>	 RECOVERY - Docker registry HTTPS interface on registry2005 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 0.346 second response time https://wikitech.wikimedia.org/wiki/Docker
[06:24:52] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[06:24:55] <wikibugs>	 (03CR) 10Samwilson: [C:03+1] Enable CommunityRequests on metawiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192663 (https://phabricator.wikimedia.org/T402967) (owner: 10Tim Starling)
[06:26:44] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54975 bytes in 9.334 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:26:44] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9310 bytes in 9.494 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:28:38] <icinga-wm>	 PROBLEM - Docker registry HTTPS interface on registry2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker
[06:28:40] <icinga-wm>	 PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:29:28] <icinga-wm>	 RECOVERY - Docker registry HTTPS interface on registry2005 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 0.346 second response time https://wikitech.wikimedia.org/wiki/Docker
[06:31:51] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[06:35:04] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Continuing with sync
[06:37:40] <icinga-wm>	 RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:40:09] <logmsgbot>	 !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192670|CreateAccount: Record the CAPTCHA class used in account creation funnel (T405239)]], [[gerrit:1192667|CreateAccount: Track interactions with the captchaWord field (T394744)]] (duration: 22m 34s)
[06:40:15] <stashbot>	 T405239: hCaptcha: Enable A/B test for frwiki - https://phabricator.wikimedia.org/T405239
[06:40:16] <stashbot>	 T394744: Instrument account creation funnel (analytics for Special:CreateAccount) - https://phabricator.wikimedia.org/T394744
[06:41:51] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[06:43:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192666 (https://phabricator.wikimedia.org/T394744) (owner: 10Kosta Harlan)
[06:43:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192668 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan)
[06:45:19] <wikibugs>	 (03Merged) 10jenkins-bot: CreateAccount: Track interactions with the captchaWord field [extensions/WikimediaEvents] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192666 (https://phabricator.wikimedia.org/T394744) (owner: 10Kosta Harlan)
[06:48:09] <wikibugs>	 (03PS2) 10Krinkle: varnish: Enable unified mobile routing on Commons [puppet] - 10https://gerrit.wikimedia.org/r/1192265 (https://phabricator.wikimedia.org/T403510)
[06:48:10] <wikibugs>	 (03PS2) 10Krinkle: varnish: Enable unified mobile routing on idwiki, frwiki, dewiki [puppet] - 10https://gerrit.wikimedia.org/r/1192266 (https://phabricator.wikimedia.org/T403510)
[06:48:10] <wikibugs>	 (03PS2) 10Krinkle: varnish: Enable unified mobile routing on eswiki, ruwiki, jawiki [puppet] - 10https://gerrit.wikimedia.org/r/1192268 (https://phabricator.wikimedia.org/T403510)
[06:48:10] <wikibugs>	 (03PS2) 10Krinkle: varnish: Enable unified mobile routing on all except en.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192271 (https://phabricator.wikimedia.org/T403510)
[06:48:11] <wikibugs>	 (03PS2) 10Krinkle: varnish: Enable unified mobile routing on en.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192272 (https://phabricator.wikimedia.org/T403510)
[06:48:18] <wikibugs>	 (03PS2) 10Krinkle: varnish: Enable unified mobile routing on de.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192267
[06:48:33] <wikibugs>	 (03Abandoned) 10Krinkle: varnish: Enable unified mobile routing on de.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192267 (owner: 10Krinkle)
[06:48:54] <wikibugs>	 (03PS2) 10Krinkle: varnish: Enable unified mobile routing on ru.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192269
[06:49:03] <wikibugs>	 (03Abandoned) 10Krinkle: varnish: Enable unified mobile routing on ru.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192269 (owner: 10Krinkle)
[06:49:13] <wikibugs>	 (03PS2) 10Krinkle: varnish: Enable unified mobile routing on ja.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192270
[06:49:15] <wikibugs>	 (03Abandoned) 10Krinkle: varnish: Enable unified mobile routing on ja.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192270 (owner: 10Krinkle)
[06:51:51] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[06:54:25] <wikibugs>	 (03PS3) 10Krinkle: varnish: Enable unified mobile routing on all except en.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192271 (https://phabricator.wikimedia.org/T403510)
[06:54:25] <wikibugs>	 (03PS3) 10Krinkle: varnish: Enable unified mobile routing on en.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192272 (https://phabricator.wikimedia.org/T403510)
[06:54:36] <wikibugs>	 (03PS1) 10Bartosz Wójtowicz: ml-services: Update image version for articletopic model on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192782 (https://phabricator.wikimedia.org/T371021)
[06:54:40] <icinga-wm>	 PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:54:59] <wikibugs>	 (03Merged) 10jenkins-bot: CreateAccount: Record the CAPTCHA class used in account creation funnel [extensions/WikimediaEvents] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192668 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan)
[06:55:40] <icinga-wm>	 RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:55:51] <logmsgbot>	 !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1192666|CreateAccount: Track interactions with the captchaWord field (T394744)]], [[gerrit:1192668|CreateAccount: Record the CAPTCHA class used in account creation funnel (T405239)]]
[06:55:59] <stashbot>	 T394744: Instrument account creation funnel (analytics for Special:CreateAccount) - https://phabricator.wikimedia.org/T394744
[06:56:00] <stashbot>	 T405239: hCaptcha: Enable A/B test for frwiki - https://phabricator.wikimedia.org/T405239
[06:56:51] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[06:58:11] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251001T0700).
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:00:26] <wikibugs>	 (03CR) 10Elukey: Wikifunctions SLO: Adjust upper bucket to 10.1s to cover slow reporting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1192609 (https://phabricator.wikimedia.org/T394057) (owner: 10Jforrester)
[07:01:51] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[07:02:07] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1192666|CreateAccount: Track interactions with the captchaWord field (T394744)]], [[gerrit:1192668|CreateAccount: Record the CAPTCHA class used in account creation funnel (T405239)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[07:02:14] <stashbot>	 T394744: Instrument account creation funnel (analytics for Special:CreateAccount) - https://phabricator.wikimedia.org/T394744
[07:02:14] <stashbot>	 T405239: hCaptcha: Enable A/B test for frwiki - https://phabricator.wikimedia.org/T405239
[07:05:00] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Continuing with sync
[07:07:13] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] Update eqiad to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191656 (https://phabricator.wikimedia.org/T405703) (owner: 10Jelto)
[07:10:00] <logmsgbot>	 !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192666|CreateAccount: Track interactions with the captchaWord field (T394744)]], [[gerrit:1192668|CreateAccount: Record the CAPTCHA class used in account creation funnel (T405239)]] (duration: 14m 09s)
[07:10:06] <stashbot>	 T394744: Instrument account creation funnel (analytics for Special:CreateAccount) - https://phabricator.wikimedia.org/T394744
[07:10:08] <stashbot>	 T405239: hCaptcha: Enable A/B test for frwiki - https://phabricator.wikimedia.org/T405239
[07:11:51] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[07:12:25] <wikibugs>	 06SRE, 10DNS, 06Traffic, 06Traffic-Icebox, and 2 others: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#11232084 (10Krinkle)
[07:12:25] <wikibugs>	 (03CR) 10Elukey: "@jforrester@wikimedia.org I see thanks for the explanation! So the le="10.1" bucket doesnt' exists for that metric, the only one that I se" [puppet] - 10https://gerrit.wikimedia.org/r/1192609 (https://phabricator.wikimedia.org/T394057) (owner: 10Jforrester)
[07:16:12] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: Update image version for articletopic model on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192782 (https://phabricator.wikimedia.org/T371021) (owner: 10Bartosz Wójtowicz)
[07:22:23] <wikibugs>	 06SRE, 10DNS, 06Traffic, 06Traffic-Icebox, and 2 others: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#11232096 (10Krinkle) 05Open→03Resolved a:03Krinkle I think we can call this done. All wiki listed here, plus a dozen more that I found, have been fixed so that...
[07:28:04] <wikibugs>	 (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Update image version for articletopic model on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192782 (https://phabricator.wikimedia.org/T371021) (owner: 10Bartosz Wójtowicz)
[07:29:18] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] remove mention of druid10[07-08] in puppet [puppet] - 10https://gerrit.wikimedia.org/r/1192147 (https://phabricator.wikimedia.org/T403801) (owner: 10Stevemunene)
[07:29:37] <wikibugs>	 (03PS1) 10Slyngshede: data.yaml: offboarding bvershbow [puppet] - 10https://gerrit.wikimedia.org/r/1192815
[07:29:50] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: Update image version for articletopic model on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192782 (https://phabricator.wikimedia.org/T371021) (owner: 10Bartosz Wójtowicz)
[07:44:40] <wikibugs>	 (03PS3) 10Krinkle: varnish: Enable unified mobile routing on Commons [puppet] - 10https://gerrit.wikimedia.org/r/1192265 (https://phabricator.wikimedia.org/T403510)
[07:44:52] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[07:45:32] <wikibugs>	 (03PS3) 10Krinkle: varnish: Enable unified mobile routing on idwiki, frwiki, dewiki [puppet] - 10https://gerrit.wikimedia.org/r/1192266 (https://phabricator.wikimedia.org/T403510)
[07:47:08] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: disable and remove partial backup [puppet] - 10https://gerrit.wikimedia.org/r/1192562 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[07:47:21] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1156.eqiad.wmnet with reason: Maintenance
[07:47:29] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[07:47:37] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1156 (T401906)', diff saved to https://phabricator.wikimedia.org/P83520 and previous config saved to /var/cache/conftool/dbconfig/20251001-074736-fceratto.json
[07:47:40] <stashbot>	 T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906
[07:48:25] <wikibugs>	 (03PS4) 10Krinkle: varnish: Enable unified mobile routing on idwiki, frwiki, dewiki [puppet] - 10https://gerrit.wikimedia.org/r/1192266 (https://phabricator.wikimedia.org/T403510)
[07:48:51] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T401906)', diff saved to https://phabricator.wikimedia.org/P83521 and previous config saved to /var/cache/conftool/dbconfig/20251001-074850-fceratto.json
[07:52:17] <wikibugs>	 (03PS3) 10Krinkle: varnish: Enable unified mobile routing on eswiki, ruwiki, jawiki [puppet] - 10https://gerrit.wikimedia.org/r/1192268 (https://phabricator.wikimedia.org/T403510)
[07:59:20] <wikibugs>	 (03PS2) 10Jelto: gitlab: remove packages from daily full backups [puppet] - 10https://gerrit.wikimedia.org/r/1192535 (https://phabricator.wikimedia.org/T378922)
[08:00:05] <jouncebot>	 hashar and brennen: Deploy window MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251001T0800)
[08:03:22] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov1007 - https://phabricator.wikimedia.org/T400412#11232192 (10jcrespo) I've finished the install following the manual migration to puppet7 instructions show at T349619, but I wonder why it wanted to setup puppet 5 in the first pla...
[08:03:58] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P83522 and previous config saved to /var/cache/conftool/dbconfig/20251001-080357-fceratto.json
[08:08:59] <logmsgbot>	 !log bwojtowicz@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' .
[08:10:27] <Emperor>	 !log restart swift on ms-fe2012 T360913
[08:10:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:10:31] <stashbot>	 T360913: Swift proxy server misbehaviour (no longer calling `accept`?) - https://phabricator.wikimedia.org/T360913
[08:13:27] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-eqiad
[08:16:32] <wikibugs>	 (03PS3) 10Jelto: gitlab: remove packages from daily full backups [puppet] - 10https://gerrit.wikimedia.org/r/1192535 (https://phabricator.wikimedia.org/T378922)
[08:16:35] <wikibugs>	 (03Abandoned) 10Slyngshede: Revert "P:puppetserver::volatile Include XCheeseScore private repo" [puppet] - 10https://gerrit.wikimedia.org/r/1191239 (owner: 10Slyngshede)
[08:19:06] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P83523 and previous config saved to /var/cache/conftool/dbconfig/20251001-081905-fceratto.json
[08:19:33] <hashar>	 I have blocked MediaWiki train https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/thread/3GE63T2LXDKYKM24QO26I3O7FWGWCANF/
[08:19:46] <hashar>	 cause of some internal code needing adjustemnt for 1.44.0-wmf.21
[08:19:57] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-eqiad
[08:19:59] <hashar>	 err 1.45.0-wmf.21
[08:20:13] <wikibugs>	 (03CR) 10Jelto: [C:03+2] gitlab: remove packages from daily full backups [puppet] - 10https://gerrit.wikimedia.org/r/1192535 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[08:29:32] <wikibugs>	 (03PS3) 10Arnaudb: gerrit: fix allowlist for mod_qos [puppet] - 10https://gerrit.wikimedia.org/r/1192831 (https://phabricator.wikimedia.org/T406017)
[08:29:32] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] "pre-shoting the safety revert after this is submitted" [puppet] - 10https://gerrit.wikimedia.org/r/1192831 (https://phabricator.wikimedia.org/T406017) (owner: 10Arnaudb)
[08:30:02] <wikibugs>	 (03PS1) 10Arnaudb: Revert "gerrit: fix allowlist for mod_qos" [puppet] - 10https://gerrit.wikimedia.org/r/1192835
[08:30:13] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov1007 - https://phabricator.wikimedia.org/T400412#11232273 (10jcrespo) 05Open→03Resolved Regarding dbprov1007, work is completed. I also removed dbprov1007 from old puppet master (5). But feel free to coordinate with Infra...
[08:31:09] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists: Set up a new mailing list for Wales - https://phabricator.wikimedia.org/T406101#11232275 (10Aklapper) @Gemma_Coleman: Hi! Per https://meta.wikimedia.org/wiki/Mailing_lists#Create_a_new_list, please see https://meta.wikimedia.org/wiki/Special:MyLanguage/Mailing_lists/Standardiz...
[08:32:25] <wikibugs>	 (03PS1) 10Btullis: Add the JupyterHub.template_paths value to the config file [puppet] - 10https://gerrit.wikimedia.org/r/1192836 (https://phabricator.wikimedia.org/T403863)
[08:33:41] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] Revert "gerrit: fix allowlist for mod_qos" [puppet] - 10https://gerrit.wikimedia.org/r/1192835 (owner: 10Arnaudb)
[08:33:54] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7166/co" [puppet] - 10https://gerrit.wikimedia.org/r/1192836 (https://phabricator.wikimedia.org/T403863) (owner: 10Btullis)
[08:34:13] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T401906)', diff saved to https://phabricator.wikimedia.org/P83524 and previous config saved to /var/cache/conftool/dbconfig/20251001-083412-fceratto.json
[08:34:18] <stashbot>	 T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906
[08:34:29] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1162.eqiad.wmnet with reason: Maintenance
[08:34:36] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1162 (T401906)', diff saved to https://phabricator.wikimedia.org/P83525 and previous config saved to /var/cache/conftool/dbconfig/20251001-083435-fceratto.json
[08:35:03] <wikibugs>	 (03PS1) 10Arnaudb: Revert^2 "gerrit: fix allowlist for mod_qos" [puppet] - 10https://gerrit.wikimedia.org/r/1192839
[08:35:50] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T401906)', diff saved to https://phabricator.wikimedia.org/P83526 and previous config saved to /var/cache/conftool/dbconfig/20251001-083549-fceratto.json
[08:39:44] <wikibugs>	 (03PS5) 10Krinkle: varnish: Enable unified mobile routing on idwiki, frwiki, dewiki [puppet] - 10https://gerrit.wikimedia.org/r/1192266 (https://phabricator.wikimedia.org/T403510)
[08:40:21] <wikibugs>	 (03CR) 10CI reject: [V:04-1] varnish: Enable unified mobile routing on idwiki, frwiki, dewiki [puppet] - 10https://gerrit.wikimedia.org/r/1192266 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle)
[08:41:03] <wikibugs>	 (03PS6) 10Krinkle: varnish: Enable unified mobile routing on idwiki, frwiki, dewiki [puppet] - 10https://gerrit.wikimedia.org/r/1192266 (https://phabricator.wikimedia.org/T403510)
[08:41:20] <wikibugs>	 (03PS4) 10Krinkle: varnish: Enable unified mobile routing on all except en.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192271 (https://phabricator.wikimedia.org/T403510)
[08:41:50] <wikibugs>	 (03PS4) 10Krinkle: varnish: Enable unified mobile routing on eswiki, ruwiki, jawiki [puppet] - 10https://gerrit.wikimedia.org/r/1192268 (https://phabricator.wikimedia.org/T403510)
[08:41:57] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] Add the JupyterHub.template_paths value to the config file [puppet] - 10https://gerrit.wikimedia.org/r/1192836 (https://phabricator.wikimedia.org/T403863) (owner: 10Btullis)
[08:42:06] <wikibugs>	 (03PS2) 10Jcrespo: Revert "dbbackups: Partial revert of dbprov1007 setup, back to dbprov1003" [puppet] - 10https://gerrit.wikimedia.org/r/1191734
[08:42:17] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists: Set up a new mailing list for Wales - https://phabricator.wikimedia.org/T406101#11232302 (10Gemma_Coleman) Hrm I read that and clearly did not understand it then! Is wikimedia-CymruWales@lists.wikimedia.org ok then? However we aren't a separate chapter which is why I didn't pr...
[08:45:52] <wikibugs>	 (03CR) 10Jcrespo: Revert "dbbackups: Partial revert of dbprov1007 setup, back to dbprov1003" [puppet] - 10https://gerrit.wikimedia.org/r/1191734 (owner: 10Jcrespo)
[08:46:36] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] Revert "dbbackups: Partial revert of dbprov1007 setup, back to dbprov1003" [puppet] - 10https://gerrit.wikimedia.org/r/1191734 (owner: 10Jcrespo)
[08:48:14] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Update the libyaml-cpp version installed on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1192560 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis)
[08:48:29] <wikibugs>	 (03PS1) 10Btullis: Add missing s to the jupyterhub template_paths configuration [puppet] - 10https://gerrit.wikimedia.org/r/1192843 (https://phabricator.wikimedia.org/T403863)
[08:49:26] <wikibugs>	 (03PS2) 10Btullis: Add missing s to the jupyterhub template_paths configuration [puppet] - 10https://gerrit.wikimedia.org/r/1192843 (https://phabricator.wikimedia.org/T403863)
[08:50:57] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P83527 and previous config saved to /var/cache/conftool/dbconfig/20251001-085056-fceratto.json
[08:51:33] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Add missing s to the jupyterhub template_paths configuration [puppet] - 10https://gerrit.wikimedia.org/r/1192843 (https://phabricator.wikimedia.org/T403863) (owner: 10Btullis)
[08:51:57] <wikibugs>	 (03PS3) 10Arnaudb: Revert^2 "gerrit: fix allowlist for mod_qos" [puppet] - 10https://gerrit.wikimedia.org/r/1192839 (https://phabricator.wikimedia.org/T406017)
[08:51:58] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] "same as the previous test, I'll issue a safety revert" [puppet] - 10https://gerrit.wikimedia.org/r/1192839 (https://phabricator.wikimedia.org/T406017) (owner: 10Arnaudb)
[08:54:46] <wikibugs>	 (03PS1) 10Arnaudb: Revert^3 "gerrit: fix allowlist for mod_qos" [puppet] - 10https://gerrit.wikimedia.org/r/1192844
[08:57:25] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=codfw
[08:57:26] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] Update the libyaml-cpp version installed on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1192560 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis)
[08:57:42] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] "template rendering is still buggy" [puppet] - 10https://gerrit.wikimedia.org/r/1192844 (owner: 10Arnaudb)
[09:00:07] <wikibugs>	 (03PS1) 10Arnaudb: Revert^4 "gerrit: fix allowlist for mod_qos" [puppet] - 10https://gerrit.wikimedia.org/r/1192845
[09:00:16] <wikibugs>	 (03PS1) 10Slyngshede: P:cache::haproxy copy private repo data [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161)
[09:01:26] <wikibugs>	 (03CR) 10DCausse: [C:03+2] flink jobs: stop search & wdqs jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192590 (https://phabricator.wikimedia.org/T404605) (owner: 10DCausse)
[09:01:35] <dcausse>	 jouncebot: nowandnext
[09:01:35] <jouncebot>	 For the next 0 hour(s) and 58 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251001T0800)
[09:01:35] <jouncebot>	 In 0 hour(s) and 58 minute(s): eqiad Wikikube kubernetes upgrade (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251001T1000)
[09:03:27] <wikibugs>	 (03PS1) 10Jcrespo: installserver: Prevent dbprov1007 & dbprov2006 from full reimage [puppet] - 10https://gerrit.wikimedia.org/r/1192848 (https://phabricator.wikimedia.org/T403166)
[09:03:45] <wikibugs>	 (03Merged) 10jenkins-bot: flink jobs: stop search & wdqs jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192590 (https://phabricator.wikimedia.org/T404605) (owner: 10DCausse)
[09:04:48] <wikibugs>	 (03PS2) 10Jcrespo: installserver: Prevent dbprov1007 & dbprov2006 from full reimage [puppet] - 10https://gerrit.wikimedia.org/r/1192848 (https://phabricator.wikimedia.org/T403166)
[09:06:06] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P83528 and previous config saved to /var/cache/conftool/dbconfig/20251001-090604-fceratto.json
[09:06:45] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[09:06:50] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[09:07:27] <wikibugs>	 (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede)
[09:08:51] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] installserver: Prevent dbprov1007 & dbprov2006 from full reimage [puppet] - 10https://gerrit.wikimedia.org/r/1192848 (https://phabricator.wikimedia.org/T403166) (owner: 10Jcrespo)
[09:10:28] <wikibugs>	 (03PS2) 10Slyngshede: P:cache::haproxy copy private repo data [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161)
[09:11:49] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[09:12:11] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[09:13:56] <jinxer-wm>	 FIRING: CirrusConsumerCloudelasticFlinkJobNotRunning: ...
[09:14:01] <wikibugs>	 (03PS5) 10Krinkle: varnish: Enable unified mobile routing on eswiki, ruwiki, jawiki [puppet] - 10https://gerrit.wikimedia.org/r/1192268 (https://phabricator.wikimedia.org/T403510)
[09:14:02] <jinxer-wm>	 cirrus_streaming_updater_cloudelastic_consumer in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerCloudelasticFlinkJobNotRunning
[09:14:02] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[09:14:29] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[09:15:52] <wikibugs>	 (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1191653 (https://phabricator.wikimedia.org/T405703) (owner: 10Jelto)
[09:16:38] <wikibugs>	 (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede)
[09:17:11] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply
[09:17:36] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply
[09:18:38] <wikibugs>	 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11232436 (10elukey) @TheDJ Hi! Quick status update so you are up to speed if anything is raised from the community (thanks a lot for what you do!).  The Service Ops team is upgrading...
[09:18:53] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] data-platform: port Zookeeper alerts [alerts] - 10https://gerrit.wikimedia.org/r/1182848 (https://phabricator.wikimedia.org/T309012) (owner: 10Tiziano Fogli)
[09:19:35] <wikibugs>	 (03PS2) 10Arnaudb: Revert^4 "gerrit: fix allowlist for mod_qos" [puppet] - 10https://gerrit.wikimedia.org/r/1192845 (https://phabricator.wikimedia.org/T406017)
[09:19:35] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] "template rendering is OK" [puppet] - 10https://gerrit.wikimedia.org/r/1192845 (https://phabricator.wikimedia.org/T406017) (owner: 10Arnaudb)
[09:20:07] <wikibugs>	 (03PS1) 10Arnaudb: Revert^5 "gerrit: fix allowlist for mod_qos" [puppet] - 10https://gerrit.wikimedia.org/r/1192849
[09:20:32] <wikibugs>	 (03Merged) 10jenkins-bot: data-platform: port Zookeeper alerts [alerts] - 10https://gerrit.wikimedia.org/r/1182848 (https://phabricator.wikimedia.org/T309012) (owner: 10Tiziano Fogli)
[09:21:13] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T401906)', diff saved to https://phabricator.wikimedia.org/P83529 and previous config saved to /var/cache/conftool/dbconfig/20251001-092112-fceratto.json
[09:21:17] <stashbot>	 T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906
[09:21:29] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1182.eqiad.wmnet with reason: Maintenance
[09:21:37] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1182 (T401906)', diff saved to https://phabricator.wikimedia.org/P83530 and previous config saved to /var/cache/conftool/dbconfig/20251001-092136-fceratto.json
[09:22:45] <jinxer-wm>	 FIRING: CirrusProducerFlinkJobNotRunning: cirrus_streaming_updater_producer in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusProducerFlinkJobNotRunning
[09:22:52] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T401906)', diff saved to https://phabricator.wikimedia.org/P83531 and previous config saved to /var/cache/conftool/dbconfig/20251001-092251-fceratto.json
[09:23:56] <jinxer-wm>	 FIRING: WcqsStreamingUpdaterFlinkJobNotRunning: WCQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=commons - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning
[09:28:39] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply
[09:28:48] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply
[09:29:10] <jinxer-wm>	 FIRING: SLOMetricAbsent: search-update-lag eqiad - https://slo.wikimedia.org/?search=search-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[09:30:56] <kostajh>	 hashar: I will deploy some wmf.20 / wmf.21 backports now, if you're not running the train now 
[09:31:04] <wikibugs>	 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11232493 (10elukey) The not great news is that I see the following error in tegola's cronjobs:  ` Error: error seeding tile ({Z:15 X:19137 Y:4191}): ERROR: permission denied for table...
[09:31:11] <wikibugs>	 (03PS1) 10Kosta Harlan: CreateAccount: Fix server side logging of CAPTCHA class [extensions/WikimediaEvents] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192852 (https://phabricator.wikimedia.org/T405239)
[09:31:21] <wikibugs>	 (03PS1) 10Kosta Harlan: CreateAccount: Fix server side logging of CAPTCHA class [extensions/WikimediaEvents] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192853 (https://phabricator.wikimedia.org/T405239)
[09:32:19] <hashar>	 kostajh: sure, please do!
[09:32:37] <kostajh>	 thanks
[09:32:40] <wikibugs>	 (03CR) 10Hashar: [C:03+1] CreateAccount: Fix server side logging of CAPTCHA class [extensions/WikimediaEvents] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192852 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan)
[09:32:45] <wikibugs>	 (03CR) 10Hashar: [C:03+1] CreateAccount: Fix server side logging of CAPTCHA class [extensions/WikimediaEvents] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192853 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan)
[09:33:08] <kostajh>	 hashar: can I backport both wmf.20 and wmf.21 together via spiderpig, or do they need to go out one at a time? 
[09:33:24] <hashar>	 I am pretty sure you can do both at the same time
[09:33:30] <hashar>	 it should +2 both of them
[09:33:33] <kostajh>	 ok, I'll try 
[09:33:35] <hashar>	 update both branches on the deploy server
[09:33:51] <hashar>	 then `scap sync-world` which grab the whole source tree
[09:33:59] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192853 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan)
[09:33:59] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192852 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan)
[09:34:04] <hashar>	 :]
[09:36:24] <wikibugs>	 (03PS3) 10DCausse: flink jobs: resume search & wdqs jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192591
[09:36:37] <wikibugs>	 06SRE, 06Commons, 10TimedMediaHandler: Videos on Commons take long to load - https://phabricator.wikimedia.org/T405760#11232520 (10TheDJ) >>! In T405760#11231516, @Prototyperspective wrote: > * I also did an Internet speed test and it was as fast as it should be and again other sites like YouTube videos load...
[09:36:44] <wikibugs>	 (03CR) 10DCausse: [C:04-1] "needs to be merged after the k8s upgrade" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192591 (owner: 10DCausse)
[09:36:56] <jinxer-wm>	 FIRING: WdqsStreamingUpdaterFlinkJobNotRunning: WDQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning
[09:37:13] <Lucas_WMDE>	 claime: any objections to adding something like “(no other deployments)” to https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251001T1000 so it’ll show up in `jouncebot: now`?
[09:37:16] <claime>	 hashar: kostajh, just a reminder that we'll start the k8s upgrade on wikikube eqiad in ~25 to 30 minutes so deployments will be suspended for the duration
[09:37:39] <claime>	 Lucas_WMDE: yes please <3
[09:37:41] <kostajh>	 claime: ok, I should be done by then
[09:37:44] <claime>	 I should have thought about that
[09:37:46] * Lucas_WMDE edits
[09:37:59] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P83532 and previous config saved to /var/cache/conftool/dbconfig/20251001-093758-fceratto.json
[09:38:22] <Lucas_WMDE>	 jouncebot: refresh
[09:38:23] <jouncebot>	 I refreshed my knowledge about deployments.
[09:38:24] <Lucas_WMDE>	 jouncebot: next
[09:38:24] <jouncebot>	 In 0 hour(s) and 21 minute(s): eqiad Wikikube kubernetes upgrade (no other deployments) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251001T1000)
[09:38:25] <kostajh>	 I'll let you know when I'm finished
[09:38:26] <Lucas_WMDE>	 whee
[09:38:55] <icinga-wm>	 RECOVERY - snapshot of s3 in eqiad on backupmon1001 is OK: Last snapshot for s3 at eqiad (db1150) taken on 2025-10-01 08:10:03 (1160 GiB, +0.0 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[09:39:19] <claime>	 kostajh: ty <3
[09:41:55] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2035:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2035 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[09:42:48] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] Revert^5 "gerrit: fix allowlist for mod_qos" [puppet] - 10https://gerrit.wikimedia.org/r/1192849 (owner: 10Arnaudb)
[09:42:54] <wikibugs>	 (03Merged) 10jenkins-bot: CreateAccount: Fix server side logging of CAPTCHA class [extensions/WikimediaEvents] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192853 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan)
[09:43:22] <wikibugs>	 (03Merged) 10jenkins-bot: CreateAccount: Fix server side logging of CAPTCHA class [extensions/WikimediaEvents] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192852 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan)
[09:43:28] <wikibugs>	 (03PS1) 10Arnaudb: Revert^6 "gerrit: fix allowlist for mod_qos" [puppet] - 10https://gerrit.wikimedia.org/r/1192854
[09:43:36] <wikibugs>	 (03PS1) 10Tiziano Fogli: zookeeper: remove check_prometheus, disable nrpe [puppet] - 10https://gerrit.wikimedia.org/r/1192855 (https://phabricator.wikimedia.org/T309012)
[09:44:02] <logmsgbot>	 !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1192853|CreateAccount: Fix server side logging of CAPTCHA class (T405239)]], [[gerrit:1192852|CreateAccount: Fix server side logging of CAPTCHA class (T405239)]]
[09:44:06] <stashbot>	 T405239: hCaptcha: Enable A/B test for frwiki - https://phabricator.wikimedia.org/T405239
[09:44:52] <jinxer-wm>	 FIRING: [28x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:45:12] <wikibugs>	 (03PS8) 10Federico Ceratto: mariadb: Update production mysql grants with unix_socket & heartbeat [puppet] - 10https://gerrit.wikimedia.org/r/868392 (https://phabricator.wikimedia.org/T304664) (owner: 10Jcrespo)
[09:45:42] <wikibugs>	 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11232563 (10elukey) So we are going to definitely show some stale tiles for the next hours @TheDJ, really sorry about it but we cannot do much at the moment.
[09:47:39] <jinxer-wm>	 FIRING: [10x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1069-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[09:48:01] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=eqiad
[09:50:38] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1192853|CreateAccount: Fix server side logging of CAPTCHA class (T405239)]], [[gerrit:1192852|CreateAccount: Fix server side logging of CAPTCHA class (T405239)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[09:50:41] <stashbot>	 T405239: hCaptcha: Enable A/B test for frwiki - https://phabricator.wikimedia.org/T405239
[09:52:39] <jinxer-wm>	 FIRING: [12x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1069-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[09:53:07] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P83533 and previous config saved to /var/cache/conftool/dbconfig/20251001-095306-fceratto.json
[09:53:54] <wikibugs>	 (03CR) 10Tiziano Fogli: mediawiki-engineering: Add REST API alerts with thresholds (034 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1192183 (https://phabricator.wikimedia.org/T405151) (owner: 10Andrea Denisse)
[09:54:43] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Continuing with sync
[09:55:48] <wikibugs>	 (03PS1) 10Elukey: Assign the ML K8s worker role to ml-serve1012 [puppet] - 10https://gerrit.wikimedia.org/r/1192856 (https://phabricator.wikimedia.org/T405891)
[09:57:08] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7169/co" [puppet] - 10https://gerrit.wikimedia.org/r/1192856 (https://phabricator.wikimedia.org/T405891) (owner: 10Elukey)
[09:57:27] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] Update eqiad to kubernetes 1.31, calico 3.29 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191653 (https://phabricator.wikimedia.org/T405703) (owner: 10Jelto)
[09:57:39] <jinxer-wm>	 FIRING: [14x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1069-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[09:57:50] <Dreamy_Jazz>	 There is a train blocker with wmf.21 that is causing lots of errors and is from private code
[09:57:53] <wikibugs>	 (03CR) 10Elukey: Assign the ML K8s worker role to ml-serve1012 [puppet] - 10https://gerrit.wikimedia.org/r/1192856 (https://phabricator.wikimedia.org/T405891) (owner: 10Elukey)
[09:58:02] <kostajh>	 claime: there's a production error tracked in T406094 that might be nice to resolve ahead of the Wikikube upgrade
[09:58:26] <claime>	 kostajh: checking
[09:58:57] <wikibugs>	 (03CR) 10Klausman: [C:03+2] Assign the ML K8s worker role to ml-serve1012 [puppet] - 10https://gerrit.wikimedia.org/r/1192856 (https://phabricator.wikimedia.org/T405891) (owner: 10Elukey)
[09:59:37] <kostajh>	 hashar: do you want to have that issue resolved before the Wikikube upgrade starts? 
[09:59:49] <logmsgbot>	 !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192853|CreateAccount: Fix server side logging of CAPTCHA class (T405239)]], [[gerrit:1192852|CreateAccount: Fix server side logging of CAPTCHA class (T405239)]] (duration: 15m 47s)
[09:59:53] <stashbot>	 T405239: hCaptcha: Enable A/B test for frwiki - https://phabricator.wikimedia.org/T405239
[10:00:04] <claime>	 kostajh: Do you have a patch ready to deploy for this?
[10:00:05] <jouncebot>	 claime, jelto, and jayme: Time to do the eqiad Wikikube kubernetes upgrade (no other deployments) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251001T1000).
[10:00:08] <kostajh>	 claime: I'm done with my backport
[10:00:33] <kostajh>	 claime: yes, there's a fix for private code I've proposed, and there's also an option to revert the public code that caused the issue. Either is fine with me. 
[10:00:36] <hashar>	 I don't know, I did not know about the upgrade
[10:00:50] <claime>	 hashar: you're not on wikitech@ ?
[10:01:08] <hashar>	 regardless I don't think it matters, brennen can run the train later tonight
[10:01:25] <jinxer-wm>	 FIRING: [25x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:01:25] <kostajh>	 I think the issue is that we will continue to get bug reports until this problem is solved 
[10:01:46] <hashar>	 I can roll back group 0 
[10:01:47] <kostajh>	 so another option would be to roll back to wmf.20, but it seems the other options are better 
[10:01:55] <hashar>	 or well, revert the faulty change
[10:02:07] <claime>	 kostajh: how confident are you about the patch?
[10:02:39] <jinxer-wm>	 FIRING: [14x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1069-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[10:03:10] <kostajh>	 claime: It's straightforward, but I'm prepared to be surprised.
[10:03:14] <hashar>	 well
[10:03:25] <hashar>	 can we rollback https://gerrit.wikimedia.org/r/c/mediawiki/extensions/LoginNotify/+/1183102
[10:03:38] <claime>	 hashar: Both options are 15 minutes minimum anyways
[10:03:44] <hashar>	 then I guess upgrade WikiKube
[10:03:48] <hashar>	 and resume the train later tonight
[10:04:01] <kostajh>	 If you want to wait a few minutes, Dreamy_Jazz is looking at the private settings patch now
[10:04:12] <claime>	 Might as well fix forward I think, jelto jayme wdyt?
[10:04:23] <hashar>	 and the private settings patch can be deployed later tonight or tomorrow and checked independently
[10:05:56] <jayme>	 I think I don't fully understand the consequences of the issue
[10:06:17] <kostajh>	 we should move to #mediawiki_security or the task to discuss it in more detail
[10:06:29] <hashar>	 +1 :)
[10:07:39] <jinxer-wm>	 FIRING: [16x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1069-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[10:07:56] <wikibugs>	 (03PS1) 10Gergő Tisza: Enable JWT session cookies on group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192857 (https://phabricator.wikimedia.org/T399631)
[10:08:15] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T401906)', diff saved to https://phabricator.wikimedia.org/P83534 and previous config saved to /var/cache/conftool/dbconfig/20251001-100814-fceratto.json
[10:08:19] <stashbot>	 T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906
[10:08:30] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1188.eqiad.wmnet with reason: Maintenance
[10:08:38] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1188 (T401906)', diff saved to https://phabricator.wikimedia.org/P83535 and previous config saved to /var/cache/conftool/dbconfig/20251001-100837-fceratto.json
[10:08:54] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[10:09:15] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[10:09:52] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T401906)', diff saved to https://phabricator.wikimedia.org/P83536 and previous config saved to /var/cache/conftool/dbconfig/20251001-100951-fceratto.json
[10:10:21] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[10:11:05] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[10:11:38] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
[10:11:57] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
[10:12:53] <wikibugs>	 (03PS8) 10Daniel Kinzler: api-gateway: Remove .tpl extension from yaml files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189440
[10:13:13] <wikibugs>	 (03PS20) 10Daniel Kinzler: Add rate limiting for REST gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189447 (https://phabricator.wikimedia.org/T405574)
[10:14:52] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[10:17:00] <wikibugs>	 06SRE, 06Commons, 10TimedMediaHandler: Videos on Commons take long to load - https://phabricator.wikimedia.org/T405760#11232673 (10Prototyperspective) Yes I know that's just the upper layer. Just mentioning this and the speed is much faster than what's needed to play Commons videos. Thanks for the elaboratio...
[10:22:16] <wikibugs>	 (03PS1) 10Dreamy Jazz: Revert "Replace LoginNotify::getInstance with service injection" [extensions/LoginNotify] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192860 (https://phabricator.wikimedia.org/T406094)
[10:22:31] <wikibugs>	 (03CR) 10Kosta Harlan: [C:03+1] Revert "Replace LoginNotify::getInstance with service injection" [extensions/LoginNotify] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192860 (https://phabricator.wikimedia.org/T406094) (owner: 10Dreamy Jazz)
[10:23:05] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [extensions/LoginNotify] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192860 (https://phabricator.wikimedia.org/T406094) (owner: 10Dreamy Jazz)
[10:23:29] <hashar>	 we ended up deciding to revert the LoginNotify patch
[10:23:38] <hashar>	 which is the easiest/safest/fastest
[10:23:50] <wikibugs>	 (03CR) 10Elukey: "Left a couple of nits, plus I have a higher level question/doubt. I usually like to have a generic class that reads from hiera an array of" [puppet] - 10https://gerrit.wikimedia.org/r/1188441 (https://phabricator.wikimedia.org/T406054) (owner: 10Herron)
[10:24:20] <hashar>	 other options were: rolling back the train,  speedy deploy the private patches   and both sounded a bit risky
[10:24:52] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[10:25:00] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P83537 and previous config saved to /var/cache/conftool/dbconfig/20251001-102458-fceratto.json
[10:29:12] <wikibugs>	 (03CR) 10Elukey: "Hi! We add this fleet wide via profile::base" [puppet] - 10https://gerrit.wikimedia.org/r/1192566 (https://phabricator.wikimedia.org/T381565) (owner: 10Ahmon Dancy)
[10:31:11] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Replace LoginNotify::getInstance with service injection" [extensions/LoginNotify] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192860 (https://phabricator.wikimedia.org/T406094) (owner: 10Dreamy Jazz)
[10:31:25] <jinxer-wm>	 FIRING: [26x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:31:29] <hashar>	 claime, kostajh: the revert has been merged, it is being deployed
[10:31:35] <claime>	 ack
[10:31:45] <logmsgbot>	 !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1192860|Revert "Replace LoginNotify::getInstance with service injection" (T406094)]]
[10:31:46] <wikibugs>	 (03PS1) 10Samtar: EventStreamConfig and stream registration for watchlist click tracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192861 (https://phabricator.wikimedia.org/T401575)
[10:36:29] <logmsgbot>	 !log hashar@deploy2002 hashar, dreamyjazz: Backport for [[gerrit:1192860|Revert "Replace LoginNotify::getInstance with service injection" (T406094)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[10:39:17] <hashar>	 I am checking whether I can still login
[10:40:07] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P83538 and previous config saved to /var/cache/conftool/dbconfig/20251001-104006-fceratto.json
[10:40:24] <hashar>	 it works
[10:40:28] <logmsgbot>	 !log hashar@deploy2002 hashar, dreamyjazz: Continuing with sync
[10:42:41] <kostajh>	 hashar: nice 
[10:45:32] <logmsgbot>	 !log hashar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192860|Revert "Replace LoginNotify::getInstance with service injection" (T406094)]] (duration: 13m 47s)
[10:46:54] <hashar>	 done
[10:50:14] <claime>	 hashar: tyvm
[10:50:24] <hashar>	 I am checking logstash
[10:51:14] <hashar>	 claime: I think we are set! thanks
[10:51:16] <hashar>	 :)
[10:52:04] <wikibugs>	 (03CR) 10Mvolz: [C:03+2] Update zotero to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191311 (owner: 10Mvolz)
[10:53:02] <jelto>	 hashar: great thank you
[10:53:44] <wikibugs>	 (03Merged) 10jenkins-bot: Update zotero to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191311 (owner: 10Mvolz)
[10:54:53] <wikibugs>	 (03PS1) 10Btullis: Vendor the base.networkpolicy module into the spark-operator chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192866 (https://phabricator.wikimedia.org/T405490)
[10:55:13] <claime>	 !log Starting eqiad Wikikube kubernetes cluster upgrade to 1.31 - T405703
[10:55:15] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T401906)', diff saved to https://phabricator.wikimedia.org/P83539 and previous config saved to /var/cache/conftool/dbconfig/20251001-105514-fceratto.json
[10:55:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:55:17] <stashbot>	 T405703: Update wikikube eqiad to kubernetes 1.31 - https://phabricator.wikimedia.org/T405703
[10:55:21] <stashbot>	 T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906
[10:55:31] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1197.eqiad.wmnet with reason: Maintenance
[10:55:39] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1197 (T401906)', diff saved to https://phabricator.wikimedia.org/P83540 and previous config saved to /var/cache/conftool/dbconfig/20251001-105538-fceratto.json
[10:56:33] <wikibugs>	 (03PS1) 10Zabe: Stop setting CategoryLinksSchemaMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192867 (https://phabricator.wikimedia.org/T299951)
[10:56:53] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T401906)', diff saved to https://phabricator.wikimedia.org/P83541 and previous config saved to /var/cache/conftool/dbconfig/20251001-105652-fceratto.json
[10:57:16] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply
[10:57:37] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply
[10:58:06] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply
[10:58:11] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:58:36] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[10:58:40] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply
[10:59:09] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply
[10:59:26] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply
[10:59:42] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-experimental: apply
[10:59:57] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply
[11:00:12] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply
[11:01:10] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/ratelimit: apply
[11:01:26] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/ratelimit: apply
[11:01:34] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ratelimit: apply
[11:01:38] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/ratelimit: apply
[11:01:42] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/ratelimit: apply
[11:02:06] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/ratelimit: apply
[11:02:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 1.542s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[11:02:39] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/zotero: apply
[11:02:48] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/zotero: apply
[11:03:03] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/zotero: apply
[11:03:05] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply
[11:03:08] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/zotero: apply
[11:03:33] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/zotero: apply
[11:03:53] <logmsgbot>	 !log cgoubert@deploy2002 Locking from deployment [ALL REPOSITORIES]: eqiad Wikikube kubernetes cluster upgrade to 1.31 - T405703
[11:03:58] <stashbot>	 T405703: Update wikikube eqiad to kubernetes 1.31 - https://phabricator.wikimedia.org/T405703
[11:04:36] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.discovery.service-route depool toolhub in eqiad: maintenance
[11:04:38] <logmsgbot>	 !log cgoubert@cumin1003 END (FAIL) - Cookbook sre.discovery.service-route (exit_code=99) depool toolhub in eqiad: maintenance
[11:05:46] <logmsgbot>	 !log cgoubert@cumin1003 conftool action : set/pooled=false; selector: dnsdisc=toolhub.*
[11:07:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 1.542s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[11:12:00] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P83542 and previous config saved to /var/cache/conftool/dbconfig/20251001-111159-fceratto.json
[11:12:06] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[11:15:44] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] admin/data: add the analytics-wikidata system user and user groups [puppet] - 10https://gerrit.wikimedia.org/r/1191349 (https://phabricator.wikimedia.org/T404073) (owner: 10Stevemunene)
[11:18:54] <logmsgbot>	 !log cgoubert@cumin1003 conftool action : set/pooled=true; selector: dnsdisc=thumbor.*,name=codfw
[11:25:10] <Amir1>	 !log dropping two unused tables in phabricator db (T403542)
[11:25:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:25:14] <stashbot>	 T403542: Drop unexpected/unneeded database tables in Phabricator - https://phabricator.wikimedia.org/T403542
[11:26:25] <jinxer-wm>	 FIRING: [27x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:27:08] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P83544 and previous config saved to /var/cache/conftool/dbconfig/20251001-112707-fceratto.json
[11:29:44] <logmsgbot>	 !log cgoubert@cumin1003 conftool action : set/pooled=false; selector: dnsdisc=swift.*,name=eqiad
[11:32:07] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists: Set up a new mailing list for Wales - https://phabricator.wikimedia.org/T406101#11232821 (10Ladsgroup) We don't really create mailing lists for a full language or a whole country. There is no germany@lists.wikimedia.org or swahili@lists.wikimedia.org. It should be either about...
[11:35:37] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply
[11:35:43] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[11:37:05] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply
[11:37:11] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[11:39:05] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply
[11:39:10] <jinxer-wm>	 FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:39:12] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[11:39:52] <jinxer-wm>	 RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:40:05] <hnowlan>	 !incidents
[11:40:05] <sirenbot>	 6807 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[11:41:13] <logmsgbot>	 !log cgoubert@cumin1003 conftool action : set/pooled=false; selector: dnsdisc=thumbor.*,name=eqiad
[11:42:15] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T401906)', diff saved to https://phabricator.wikimedia.org/P83545 and previous config saved to /var/cache/conftool/dbconfig/20251001-114214-fceratto.json
[11:42:19] <stashbot>	 T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906
[11:42:31] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1225.eqiad.wmnet with reason: Maintenance
[11:42:36] <hnowlan>	 !log manually bumped thumbor replicas in codfw to 140 
[11:42:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:42:53] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1229.eqiad.wmnet with reason: Maintenance
[11:43:00] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1229 (T401906)', diff saved to https://phabricator.wikimedia.org/P83546 and previous config saved to /var/cache/conftool/dbconfig/20251001-114259-fceratto.json
[11:44:15] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T401906)', diff saved to https://phabricator.wikimedia.org/P83547 and previous config saved to /var/cache/conftool/dbconfig/20251001-114414-fceratto.json
[11:44:52] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[11:48:38] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.k8s.wipe-cluster Wipe the K8s cluster wikikube-eqiad: eqiad Wikikube kubernetes cluster upgrade to 1.31 - T405703
[11:48:42] <stashbot>	 T405703: Update wikikube eqiad to kubernetes 1.31 - https://phabricator.wikimedia.org/T405703
[11:49:10] <jinxer-wm>	 FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:49:17] <hnowlan>	 sigh
[11:49:19] <hnowlan>	 adding more replicas
[11:49:29] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply
[11:50:58] <wikibugs>	 (03Abandoned) 10Btullis: Vendor the base.networkpolicy module into the spark-operator chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192866 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis)
[11:51:38] <claime>	 slyngs: hnowlan: there will probably be some alerts that can't be silenced by the cookbooks once the cluster is wiped btw
[11:51:57] <hnowlan>	 ack
[11:51:57] <jinxer-wm>	 FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:52:00] <slyngs>	 Noted,
[11:52:19] <hnowlan>	 trying to address the thumbor issue
[11:52:28] <jayme>	 hnowlan: you need help?
[11:53:04] <hnowlan>	 just trying to scale up but there are some scrapers we could also block 
[11:53:31] <hnowlan>	 hitting quota sigh
[11:54:10] <jinxer-wm>	 FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:54:43] <jinxer-wm>	 FIRING: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[11:54:44] <jinxer-wm>	 FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[11:55:21] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift
[11:55:22] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1255 to x3 master [puppet] - 10https://gerrit.wikimedia.org/r/1192871 (https://phabricator.wikimedia.org/T406116)
[11:55:23] <hnowlan>	 these are all thumbor related I assume 
[11:55:35] <hnowlan>	 Emperor: am I right? or a knock-on? 
[11:56:00] <claime>	 yeah, I see 500rps of 5xxs from swift on ATS
[11:56:04] <_joe_>	 yes it's that quite a few requests get 5xx
[11:56:04] <wikibugs>	 (03CR) 10Stevemunene: Define airflow-wikidata PG cluster and airflow instance (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190975 (https://phabricator.wikimedia.org/T404073) (owner: 10Stevemunene)
[11:56:18] <wikibugs>	 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11232878 (10Waddie96) @elukey @Muehlenhoff Thanks for working on this!
[11:56:57] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:57:03] <logmsgbot>	 cgoubert@cumin1003 wipe-cluster (PID 777396) is awaiting input
[11:57:04] <claime>	 !incidents
[11:57:04] <sirenbot>	 6810 (ACKED)  ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw)
[11:57:05] <sirenbot>	 6811 (ACKED)  VarnishUnavailable global sre (varnish-upload thanos-rule)
[11:57:05] <sirenbot>	 6812 (ACKED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[11:57:05] <sirenbot>	 6807 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[11:58:19] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift
[11:58:19] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.179 second response time https://wikitech.wikimedia.org/wiki/Swift
[11:58:30] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift
[11:58:43] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[11:59:03] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply
[11:59:05] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[11:59:23] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P83548 and previous config saved to /var/cache/conftool/dbconfig/20251001-115922-fceratto.json
[11:59:27] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[11:59:29] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.167 second response time https://wikitech.wikimedia.org/wiki/Swift
[11:59:29] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift
[11:59:51] <jinxer-wm>	 FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in esams #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[11:59:58] <claime>	 !incidents
[11:59:59] <sirenbot>	 6810 (ACKED)  ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw)
[11:59:59] <sirenbot>	 6811 (ACKED)  VarnishUnavailable global sre (varnish-upload thanos-rule)
[11:59:59] <sirenbot>	 6812 (ACKED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[12:00:00] <sirenbot>	 6813 (UNACKED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[12:00:00] <sirenbot>	 6807 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[12:00:07] <claime>	 !ack 6813
[12:00:08] <sirenbot>	 6813 (ACKED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[12:00:13] <hnowlan>	 trying to bump limits
[12:00:21] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:00:26] <hnowlan>	 can someone check capcaity on the cluster to see how much headroom we have? 
[12:00:29] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 16 hosts with reason: Primary switchover x3 T406116
[12:00:33] <claime>	 hnowlan: on it
[12:00:33] <stashbot>	 T406116: Switchover x3 master (db1258 -> db1255) - https://phabricator.wikimedia.org/T406116
[12:00:43] <Emperor>	 sorry, was eating, back now.
[12:01:11] <claime>	 hnowlan: 1.7kCPU max
[12:01:21] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:01:24] <claime>	 18TiB ram
[12:01:27] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 5.377 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:01:29] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:01:41] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Set db1255 with weight 0 T406116', diff saved to https://phabricator.wikimedia.org/P83549 and previous config saved to /var/cache/conftool/dbconfig/20251001-120140-ladsgroup.json
[12:01:43] <wikibugs>	 (03PS1) 10Hnowlan: admin_ng: remove thumbor limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192873
[12:02:01] <hnowlan>	 claime: thanks. If you could ^
[12:02:19] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.182 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:02:19] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.178 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:02:21] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.175 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:02:21] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 2.755 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:02:21] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.168 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:02:29] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.167 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:02:29] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:02:34] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] admin_ng: remove thumbor limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192873 (owner: 10Hnowlan)
[12:02:39] <jayme>	 hnowlan: done
[12:02:46] <hnowlan>	 thanks
[12:03:00] <wikibugs>	 (03CR) 10Hnowlan: [V:03+2 C:03+2] admin_ng: remove thumbor limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192873 (owner: 10Hnowlan)
[12:03:19] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.740 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:03:21] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 1.330 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:03:23] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 3.037 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:03:25] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 8.013 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:03:27] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 6.315 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:03:29] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.169 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:03:30] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.198 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:04:10] <_joe_>	 ok sook I see the recoveries coming
[12:04:16] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] mariadb: Promote db1255 to x3 master [puppet] - 10https://gerrit.wikimedia.org/r/1192871 (https://phabricator.wikimedia.org/T406116) (owner: 10Gerrit maintenance bot)
[12:04:21] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:04:22] <Emperor>	 I think the swift sadness is thumbor sadness being passed on
[12:04:27] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.200 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:04:29] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:04:29] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'.
[12:04:33] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 5.797 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:04:51] <jinxer-wm>	 FIRING: [4x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[12:05:05] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.169 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:05:05] <hnowlan>	 !incidents
[12:05:05] <_joe_>	 uh still?
[12:05:06] <sirenbot>	 6810 (ACKED)  ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw)
[12:05:06] <sirenbot>	 6811 (ACKED)  VarnishUnavailable global sre (varnish-upload thanos-rule)
[12:05:06] <sirenbot>	 6812 (ACKED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[12:05:06] <sirenbot>	 6813 (ACKED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[12:05:06] <sirenbot>	 6807 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[12:05:19] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.449 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:05:27] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 8.633 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:05:27] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.228 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:05:29] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 0.725 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:05:29] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[12:05:30] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 9.029 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:05:31] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[12:05:35] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 7.895 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:05:45] <hnowlan>	 thumbor queues are still awful, error rate still high 
[12:05:51] <_joe_>	 hnowlan: is thumbor still down? yeah
[12:05:58] <Amir1>	 !log Starting x3 eqiad failover from db1258 to db1255 - T406116
[12:06:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:06:02] <stashbot>	 T406116: Switchover x3 master (db1258 -> db1255) - https://phabricator.wikimedia.org/T406116
[12:06:07] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 2.258 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:06:10] <claime>	 we haven't bumped it a second time yet right?
[12:06:12] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply
[12:06:15] <hnowlan>	 doing now
[12:06:19] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.524 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:06:19] <claime>	 ack
[12:06:22] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[12:06:23] <claime>	 that should help
[12:06:27] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 6.731 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:06:27] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:06:27] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:06:30] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Promote db1255 to x3 primary T406116', diff saved to https://phabricator.wikimedia.org/P83550 and previous config saved to /var/cache/conftool/dbconfig/20251001-120629-ladsgroup.json
[12:07:21] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:07:57] <hnowlan>	 bumping one more time
[12:08:05] <hnowlan>	 we might need to roll restart to dump queues :/
[12:08:09] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply
[12:08:17] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[12:08:19] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:08:29] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.176 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:09:02] <hnowlan>	 might need a statuspage update
[12:09:04] <jayme>	 hnowlan: you could probably do that with the next deployment, right?
[12:09:51] <jinxer-wm>	 FIRING: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[12:10:00] <jayme>	 !incidents
[12:10:00] <sirenbot>	 6810 (ACKED)  ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw)
[12:10:01] <sirenbot>	 6811 (ACKED)  VarnishUnavailable global sre (varnish-upload thanos-rule)
[12:10:01] <sirenbot>	 6812 (ACKED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[12:10:01] <sirenbot>	 6813 (ACKED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[12:10:01] <sirenbot>	 6807 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[12:10:06] <hnowlan>	 jayme: how do you mean? 
[12:10:19] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:10:21] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.168 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:10:29] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.169 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:10:29] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:10:29] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:10:41] <jayme>	 hnowlan: with --state-values-set roll_restart=1 argument to helmfile
[12:10:44] <claime>	 hnowlan: helmfile -e codfw --state-values-set roll_restart=1 sync
[12:11:17] <hnowlan>	 yeah 
[12:11:19] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:11:19] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.176 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:11:21] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:11:21] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:11:21] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.179 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:11:23] <hnowlan>	 sorry I didn't understand phrasing
[12:11:27] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 8.355 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:11:27] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.214 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:11:29] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:11:29] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:11:30] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: sync
[12:11:50] <jayme>	 might as well be my phrasing :)
[12:11:59] <wikibugs>	 (03PS21) 10Daniel Kinzler: Add rate limiting for REST gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189447 (https://phabricator.wikimedia.org/T405574)
[12:12:07] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:12:21] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:12:21] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.186 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:12:21] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 1.604 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:12:21] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.176 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:12:23] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 2.087 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:12:27] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 6.532 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:12:27] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:12:37] <claime>	 swift errors seem to be going down generally though
[12:12:38] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: sync
[12:13:21] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:13:31] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 3.109 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:13:31] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.687 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:13:35] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 7.204 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:13:41] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depool db1258 T406116', diff saved to https://phabricator.wikimedia.org/P83551 and previous config saved to /var/cache/conftool/dbconfig/20251001-121339-ladsgroup.json
[12:13:44] <stashbot>	 T406116: Switchover x3 master (db1258 -> db1255) - https://phabricator.wikimedia.org/T406116
[12:14:07] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:14:10] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:14:21] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.911 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:14:21] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.254 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:14:25] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 6.642 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:14:27] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.173 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:14:29] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:14:30] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P83552 and previous config saved to /var/cache/conftool/dbconfig/20251001-121429-fceratto.json
[12:15:13] <wikibugs>	 (03PS3) 10Arnaudb: Revert^6 "gerrit: fix allowlist for mod_qos" [puppet] - 10https://gerrit.wikimedia.org/r/1192854 (https://phabricator.wikimedia.org/T406017)
[12:15:13] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] "with safety revert" [puppet] - 10https://gerrit.wikimedia.org/r/1192854 (https://phabricator.wikimedia.org/T406017) (owner: 10Arnaudb)
[12:15:23] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:15:23] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.187 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:15:25] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 1.945 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:15:26] <logmsgbot>	 !log ladsgroup@cumin1003 START - Cookbook sre.mysql.upgrade for db1258.eqiad.wmnet
[12:15:27] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 505 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:15:29] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.177 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:15:31] <wikibugs>	 (03PS1) 10Arnaudb: Revert^7 "gerrit: fix allowlist for mod_qos" [puppet] - 10https://gerrit.wikimedia.org/r/1192875
[12:15:35] <logmsgbot>	 !log ladsgroup@cumin1003 START - Cookbook sre.mysql.depool db1258 - Upgrading db1258.eqiad.wmnet
[12:15:39] <Emperor>	 the queue still looks quite large
[12:15:42] <logmsgbot>	 !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1258 - Upgrading db1258.eqiad.wmnet
[12:15:45] <claime>	 yeah
[12:16:23] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.432 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:16:25] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 4.028 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:16:27] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 5.154 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:16:29] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 6.255 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:16:51] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[12:17:04] <hnowlan>	 at this point we have lots of capacity 
[12:17:07] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:17:23] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.168 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:17:29] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:17:40] <hnowlan>	 Emperor: should we roll restart swift fes? 
[12:18:31] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:18:32] <Emperor>	 queue is high but coming down. I note that the envoy graphs for swift have a lot of Upstream request overflow [not exactly sure what that means], maybe envoy-on-swift-frontends still has a backlog of requests?
[12:18:33] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 9.671 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:18:39] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[12:18:58] <Emperor>	 ...which would suggest s roll-restart might help. On it.
[12:19:23] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.705 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:19:23] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:19:23] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-eqiad
[12:19:25] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:19:27] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:19:31] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:19:41] <logmsgbot>	 !log mvernon@cumin2002 END (ERROR) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=97) rolling restart_daemons on A:swift-fe-eqiad
[12:19:43] <Emperor>	 oh, fiddlesticks, I wanted codfw not eqiad, sorry.
[12:19:54] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw
[12:20:07] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 1.500 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:20:21] <Emperor>	 [looking at a couple of codfw frontends, envoy is oddly spiking in CPU usages]
[12:20:23] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:20:29] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 5.656 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:20:33] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 2.093 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:21:11] <logmsgbot>	 !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1258.eqiad.wmnet
[12:21:23] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.178 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:21:29] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 2.572 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:21:41] <logmsgbot>	 !log ladsgroup@cumin1003 START - Cookbook sre.mysql.pool db1258* gradually with 4 steps - Work done
[12:22:21] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:22:23] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:22:23] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 505 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:22:23] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:22:59] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:23:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[12:24:09] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:24:33] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[12:24:50] <wikibugs>	 10SRE-swift-storage, 06Commons: Commons: fault in storage backend "local-swift-codfw" - https://phabricator.wikimedia.org/T396186#11232959 (10AFBorchert) 05Resolved→03Open This problem reappears as of now repeatedly on Commons.
[12:24:52] <jinxer-wm>	 RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:25:00] <Emperor>	 the thumbor queue is still very high
[12:25:07] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.194 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:25:23] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.203 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:25:25] <Emperor>	 [rolling-restart of swift frontends about half-way there now]
[12:25:36] <jayme>	 higher than before the roll-restart actually
[12:26:01] <Emperor>	 I thought the expectation was that a roll-restart of thumbor would clear the queue?
[12:26:11] <hnowlan>	 it did 
[12:26:15] <hnowlan>	 they just filled back up 
[12:27:09] <Emperor>	 hnowlan: that's not obviously visible in e.g. https://grafana.wikimedia.org/goto/YriZfbqNR?orgId=1
[12:27:34] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw
[12:27:39] <hnowlan>	 Emperor: it is here https://grafana.wikimedia.org/goto/0_FGfb3Ng?orgId=1 
[12:28:37] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[12:29:19] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.181 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:29:37] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T401906)', diff saved to https://phabricator.wikimedia.org/P83554 and previous config saved to /var/cache/conftool/dbconfig/20251001-122936-fceratto.json
[12:29:41] <stashbot>	 T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906
[12:29:41] <logmsgbot>	 !log cgoubert@cumin1003 conftool action : set/pooled=true; selector: dnsdisc=thumbor.*,name=eqiad
[12:29:52] <jinxer-wm>	 FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:29:53] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1233.eqiad.wmnet with reason: Maintenance
[12:30:00] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1233 (T401906)', diff saved to https://phabricator.wikimedia.org/P83555 and previous config saved to /var/cache/conftool/dbconfig/20251001-122959-fceratto.json
[12:30:21] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 2.398 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:31:04] <logmsgbot>	 !log cgoubert@cumin1003 conftool action : set/pooled=true; selector: dnsdisc=swift.*,name=eqiad
[12:31:13] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:31:16] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T401906)', diff saved to https://phabricator.wikimedia.org/P83556 and previous config saved to /var/cache/conftool/dbconfig/20251001-123115-fceratto.json
[12:31:21] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.177 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:32:17] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 6.274 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:33:23] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 2.409 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:33:35] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 5.736 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:34:13] <wikibugs>	 10SRE-swift-storage, 06Commons: Commons: fault in storage backend "local-swift-codfw" - https://phabricator.wikimedia.org/T396186#11232974 (10AFBorchert) Associated dicussion at Commons: https://commons.wikimedia.org/wiki/Commons:Village_pump/Technical#Upload_problem
[12:34:51] <jinxer-wm>	 FIRING: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[12:35:16] <hnowlan>	 !incidents
[12:35:17] <sirenbot>	 6810 (ACKED)  ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw)
[12:35:17] <sirenbot>	 6811 (ACKED)  VarnishUnavailable global sre (varnish-upload thanos-rule)
[12:35:17] <sirenbot>	 6812 (ACKED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[12:35:17] <sirenbot>	 6813 (ACKED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[12:35:17] <Emperor>	 I think we need an IC and maybe a statuspage update, this is user-visible
[12:35:17] <sirenbot>	 6807 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[12:35:32] <wikibugs>	 (03PS2) 10D3r1ck01: Revert^2 "session: Enable MultiBackendSessionStore on `group1` wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192632
[12:35:36] <hnowlan>	 jelto already did a statuspage update
[12:35:57] <Emperor>	 ah, cool
[12:39:10] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:39:43] <jinxer-wm>	 RESOLVED: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[12:39:44] <jinxer-wm>	 RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[12:39:51] <jinxer-wm>	 FIRING: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[12:40:22] <wikibugs>	 (03PS1) 10Elukey: profile::maps::osm_master: refactor postgres grants [puppet] - 10https://gerrit.wikimedia.org/r/1192877 (https://phabricator.wikimedia.org/T381565)
[12:40:59] <Emperor>	 crawling through the logs on 1 proxy server, with a backtrace about write timeout; it served 206 requests in that second, of which one resulted in a 503, and that was a thumbnail write request.
[12:41:09] <wikibugs>	 (03CR) 10Elukey: "@mmuhlenhoff@wikimedia.org not sure if I am missing something, but I had to make these two workarounds to allow maps2011 to work properly." [puppet] - 10https://gerrit.wikimedia.org/r/1192877 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey)
[12:41:22] <Emperor>	 sorry, request to _read_ a thumbnail
[12:41:28] <hnowlan>	 Emperor: could we potentially have overloaded swift with read traffic that never even hit thumbor? 
[12:41:36] <Emperor>	 (i.e. where I expect it would have called out to thumbor)
[12:41:57] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:42:20] <Emperor>	 So the proxy-server backtrace for a failed write timeout corresponded as best as I can tell to an incoming GET for a thumb
[12:42:53] <hnowlan>	 thumbor has recovered, queues are at 0 and errors are reasonable
[12:42:59] <wikibugs>	 (03PS22) 10Daniel Kinzler: api-gateway: Add rate limiting for REST gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189447 (https://phabricator.wikimedia.org/T405574)
[12:43:23] <Emperor>	 swift errors are declining too
[12:44:10] <wikibugs>	 (03PS1) 10Daniel Kinzler: api-gateway: support custom rate limit groups for rest gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192879
[12:44:51] <jinxer-wm>	 RESOLVED: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[12:46:24] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P83558 and previous config saved to /var/cache/conftool/dbconfig/20251001-124622-fceratto.json
[12:47:03] <wikibugs>	 10SRE-swift-storage, 06Commons: Commons: fault in storage backend "local-swift-codfw" - https://phabricator.wikimedia.org/T396186#11233038 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon This should have been a new issue, but in any case, we published a [[https://www.wikimediastatus.net/inciden...
[12:48:13] <wikibugs>	 (03PS1) 10Ladsgroup: db1172: Upgrade to 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1192880 (https://phabricator.wikimedia.org/T406008)
[12:48:38] <Emperor>	 !incidents
[12:48:38] <sirenbot>	 6813 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[12:48:39] <sirenbot>	 6810 (RESOLVED)  ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw)
[12:48:39] <sirenbot>	 6812 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[12:48:39] <sirenbot>	 6811 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule)
[12:48:39] <sirenbot>	 6807 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[12:49:12] <Emperor>	 hnowlan: I think we're back to normal operation now; shall we close out the incident, or leave it a little first?
[12:50:43] <hnowlan>	 I think we're mostly good, we're following up on traffic patterns
[12:50:46] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1172.eqiad.wmnet with reason: Upgrade to 10.11
[12:51:21] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depool db1172 for upgrade T406008', diff saved to https://phabricator.wikimedia.org/P83559 and previous config saved to /var/cache/conftool/dbconfig/20251001-125120-ladsgroup.json
[12:51:25] <stashbot>	 T406008: Migrate s8 to 10.11 - https://phabricator.wikimedia.org/T406008
[12:53:10] <logmsgbot>	 !log cgoubert@cumin1003 conftool action : set/pooled=false; selector: dnsdisc=swift.*,name=eqiad
[12:53:26] <wikibugs>	 (03PS2) 10Ladsgroup: db1172: Upgrade to 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1192880 (https://phabricator.wikimedia.org/T406008)
[12:54:05] <wikibugs>	 (03CR) 10Ladsgroup: [V:03+2 C:03+2] db1172: Upgrade to 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1192880 (https://phabricator.wikimedia.org/T406008) (owner: 10Ladsgroup)
[12:56:00] <logmsgbot>	 !log cgoubert@cumin1003 conftool action : set/pooled=false; selector: dnsdisc=thumbor.*,name=eqiad
[13:01:32] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P83561 and previous config saved to /var/cache/conftool/dbconfig/20251001-130131-fceratto.json
[13:05:43] <wikibugs>	 (03PS2) 10DDesouza: Update and deploy reader foundational survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192635 (https://phabricator.wikimedia.org/T405410)
[13:07:00] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192635 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza)
[13:07:11] <logmsgbot>	 !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1258* gradually with 4 steps - Work done
[13:07:57] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1152 - https://phabricator.wikimedia.org/T406063#11233092 (10Jclark-ctr) Replaced Failed Drive
[13:08:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[13:10:07] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] admin_ng: Change eqiad pod ip range to 10.67.128.0/17 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191647 (https://phabricator.wikimedia.org/T375845) (owner: 10Jelto)
[13:10:13] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] Update eqiad to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191656 (https://phabricator.wikimedia.org/T405703) (owner: 10Jelto)
[13:10:15] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] Update eqiad pod ip range [puppet] - 10https://gerrit.wikimedia.org/r/1191652 (https://phabricator.wikimedia.org/T375845) (owner: 10Jelto)
[13:10:21] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-k8s-ssl_6543: Servers wikikube-worker1051.eqiad.wmnet, wikikube-worker1144.eqiad.wmnet, wikikube-worker1322.eqiad.wmnet, wikikube-worker1079.eqiad.wmnet, wikikube-worker1118.eqiad.wmnet, wikikube-worker1304.eqiad.wmnet, wikikube-worker1259.eqiad.wmnet, wikikube-worker1155.eqiad.wmnet, wikikube-worker1160.eqiad.wmnet, wikikube-worker1108.
[13:10:21] <icinga-wm>	 net, wikikube-worker1116.eqiad.wmnet, wikikube-worker1281.eqiad.wmnet, wikikube-worker1036.eqiad.wmnet, wikikube-worker1320.eqiad.wmnet, wikikube-worker1315.eqiad.wmnet, wikikube-worker1268.eqiad.wmnet, wikikube-worker1016.eqiad.wmnet, wikikube-worker1282.eqiad.wmnet, wikikube-worker1072.eqiad.wmnet, wikikube-worker1149.eqiad.wmnet, wikikube-worker1056.eqiad.wmnet, wikikube-worker1168.eqiad.wmnet, wikikube-worker1112.eqiad.wmnet, wikikube
[13:10:21] <icinga-wm>	 037.eqiad.wmnet, wikikube-worker1278.eqiad.wmnet, wikikube-worker1119.eqiad.wmnet, wikikube-worker1162.eqiad.wmnet, wikikube-worker1130.eqiad.wmnet, wikikube-worker1143.eqiad.wmnet, wik https://wikitech.wikimedia.org/wiki/PyBal
[13:10:23] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] Update eqiad to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1191653 (https://phabricator.wikimedia.org/T405703) (owner: 10Jelto)
[13:10:23] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-k8s-ssl_6543: Servers wikikube-worker1280.eqiad.wmnet, wikikube-worker1144.eqiad.wmnet, wikikube-worker1042.eqiad.wmnet, wikikube-worker1118.eqiad.wmnet, wikikube-worker1259.eqiad.wmnet, wikikube-worker1103.eqiad.wmnet, wikikube-worker1116.eqiad.wmnet, wikikube-worker1029.eqiad.wmnet, wikikube-worker1320.eqiad.wmnet, wikikube-worker1094.
[13:10:23] <icinga-wm>	 net, wikikube-worker1076.eqiad.wmnet, wikikube-worker1273.eqiad.wmnet, wikikube-worker1136.eqiad.wmnet, wikikube-worker1260.eqiad.wmnet, wikikube-worker1157.eqiad.wmnet, wikikube-worker1282.eqiad.wmnet, wikikube-worker1263.eqiad.wmnet, wikikube-worker1307.eqiad.wmnet, wikikube-worker1159.eqiad.wmnet, wikikube-worker1056.eqiad.wmnet, wikikube-worker1168.eqiad.wmnet, wikikube-worker1244.eqiad.wmnet, wikikube-worker1037.eqiad.wmnet, wikikube
[13:10:23] <icinga-wm>	 278.eqiad.wmnet, wikikube-worker1119.eqiad.wmnet, wikikube-worker1135.eqiad.wmnet, wikikube-worker1098.eqiad.wmnet, wikikube-worker1309.eqiad.wmnet, wikikube-worker1143.eqiad.wmnet, wik https://wikitech.wikimedia.org/wiki/PyBal
[13:10:36] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repool db1172 after upgrade T406008', diff saved to https://phabricator.wikimedia.org/P83563 and previous config saved to /var/cache/conftool/dbconfig/20251001-131033-ladsgroup.json
[13:10:40] <stashbot>	 T406008: Migrate s8 to 10.11 - https://phabricator.wikimedia.org/T406008
[13:10:53] <hnowlan>	 !incidents
[13:10:54] <sirenbot>	 6814 (UNACKED)  wmf - metamonitoring - prometheus - notified - vip is now DOWN
[13:10:54] <sirenbot>	 6813 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[13:10:54] <sirenbot>	 6810 (RESOLVED)  ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw)
[13:10:54] <sirenbot>	 6812 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[13:10:55] <sirenbot>	 6811 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule)
[13:10:55] <sirenbot>	 6807 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[13:11:03] <hnowlan>	 !ack 6814
[13:11:04] <sirenbot>	 6814 (ACKED)  wmf - metamonitoring - prometheus - notified - vip is now DOWN
[13:11:07] <hnowlan>	 tappof: ^ :) 
[13:11:57] <logmsgbot>	 cgoubert@cumin1003 wipe-cluster (PID 777396) is awaiting input
[13:11:58] <jinxer-wm>	 FIRING: [25x] ProbeDown: Service chart-renderer:30443 has failed probes (http_chart-renderer_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:12:36] <hnowlan>	 !incidents
[13:12:37] <sirenbot>	 6814 (ACKED)  wmf - metamonitoring - prometheus - notified - vip is now DOWN
[13:12:37] <sirenbot>	 6815 (ACKED)  [25x] ProbeDown sre (ip4 probes/service eqiad)
[13:12:37] <sirenbot>	 6813 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[13:12:37] <sirenbot>	 6810 (RESOLVED)  ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw)
[13:12:37] <sirenbot>	 6812 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[13:12:38] <sirenbot>	 6811 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule)
[13:12:38] <sirenbot>	 6807 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[13:12:47] <claime>	 ProbeDown are expected
[13:13:14] <wikibugs>	 (03PS12) 10Elukey: WIP: test upgrade-firmware for idrac 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1189502
[13:13:41] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2048.codfw.wmnet']
[13:13:43] <wikibugs>	 (03PS1) 10Arnaudb: gerrit: toggle mod_qos log only [puppet] - 10https://gerrit.wikimedia.org/r/1192882 (https://phabricator.wikimedia.org/T406017)
[13:14:51] <jinxer-wm>	 FIRING: [12x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy  - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[13:15:03] <tappof>	 hnowlan: /var/lib/o11y-metamonitoring/deadmanswitchamhook/prometheus_k8s_eqiad has a timestamp older than 600
[13:15:23] <hnowlan>	 tappof: ah nice, good to know
[13:15:26] <wikibugs>	 (03PS1) 10D3r1ck01: session: Handle an edge-case in MultiBackendSessionStore::set() [core] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192884 (https://phabricator.wikimedia.org/T402808)
[13:15:27] <claime>	 SwaggerProbeHasFailures expected
[13:15:44] <wikibugs>	 (03PS1) 10D3r1ck01: session: Handle an edge-case in MultiBackendSessionStore::set() [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192885 (https://phabricator.wikimedia.org/T402808)
[13:15:52] <hnowlan>	 thumbor looks fine
[13:16:40] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T401906)', diff saved to https://phabricator.wikimedia.org/P83564 and previous config saved to /var/cache/conftool/dbconfig/20251001-131639-fceratto.json
[13:16:45] <stashbot>	 T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906
[13:16:55] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1239.eqiad.wmnet with reason: Maintenance
[13:17:12] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [core] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192884 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01)
[13:17:13] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1254.eqiad.wmnet with reason: Maintenance
[13:17:19] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng: Change eqiad pod ip range to 10.67.128.0/17 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191647 (https://phabricator.wikimedia.org/T375845) (owner: 10Jelto)
[13:17:20] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1254 (T401906)', diff saved to https://phabricator.wikimedia.org/P83565 and previous config saved to /var/cache/conftool/dbconfig/20251001-131719-fceratto.json
[13:17:35] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192885 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01)
[13:18:35] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192632 (owner: 10D3r1ck01)
[13:18:38] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1254 (T401906)', diff saved to https://phabricator.wikimedia.org/P83566 and previous config saved to /var/cache/conftool/dbconfig/20251001-131836-fceratto.json
[13:19:10] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job probes/swagger in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:20:04] <wikibugs>	 (03Merged) 10jenkins-bot: Update eqiad to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191656 (https://phabricator.wikimedia.org/T405703) (owner: 10Jelto)
[13:20:26] <wikibugs>	 (03CR) 10CI reject: [V:04-1] WIP: test upgrade-firmware for idrac 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1189502 (owner: 10Elukey)
[13:20:47] <tappof>	 hnowlan: /buffer 4
[13:23:22] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'.
[13:24:20] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'.
[13:24:50] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[13:24:56] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[13:25:44] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1152 - https://phabricator.wikimedia.org/T406063#11233155 (10Ladsgroup) Thanks!
[13:26:25] <jinxer-wm>	 FIRING: [28x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:28:32] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[13:28:39] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[13:29:54] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[13:30:00] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[13:30:11] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[13:30:16] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
[13:30:30] <logmsgbot>	 cgoubert@cumin1003 wipe-cluster (PID 777396) is awaiting input
[13:30:31] <logmsgbot>	 !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2048.codfw.wmnet']
[13:30:41] <wikibugs>	 10SRE-SLO, 10EditCheck, 06Editing-team (Kanban Board), 07Essential-Work, 05Goal: Fix EditCheck's SLO metrics and create a dashboard for it - https://phabricator.wikimedia.org/T395444#11233174 (10ppelberg)
[13:30:51] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
[13:30:57] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[13:31:38] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[13:31:44] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[13:33:28] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[13:33:33] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'.
[13:33:45] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P83568 and previous config saved to /var/cache/conftool/dbconfig/20251001-133344-fceratto.json
[13:33:49] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'.
[13:33:53] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'.
[13:34:29] <wikibugs>	 (03PS1) 10Btullis: Remove the airflow profile from the analytics_cluster::launcher role [puppet] - 10https://gerrit.wikimedia.org/r/1192889 (https://phabricator.wikimedia.org/T402943)
[13:34:48] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[13:34:53] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'.
[13:34:54] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405973#11233204 (10phaultfinder)
[13:35:10] <wikibugs>	 (03PS1) 10Bking: wdqs-scholarly: Add wdqs2016 to load balancer pool [puppet] - 10https://gerrit.wikimedia.org/r/1192890 (https://phabricator.wikimedia.org/T405978)
[13:35:12] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'.
[13:35:18] <logmsgbot>	 !log cgoubert@cumin1003 END (FAIL) - Cookbook sre.k8s.wipe-cluster (exit_code=99) Wipe the K8s cluster wikikube-eqiad: eqiad Wikikube kubernetes cluster upgrade to 1.31 - T405703
[13:35:22] <stashbot>	 T405703: Update wikikube eqiad to kubernetes 1.31 - https://phabricator.wikimedia.org/T405703
[13:35:51] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7170/co" [puppet] - 10https://gerrit.wikimedia.org/r/1192889 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis)
[13:37:36] <wikibugs>	 (03CR) 10Btullis: [C:03+1] wdqs-scholarly: Add wdqs2016 to load balancer pool [puppet] - 10https://gerrit.wikimedia.org/r/1192890 (https://phabricator.wikimedia.org/T405978) (owner: 10Bking)
[13:38:54] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1192890 (https://phabricator.wikimedia.org/T405978) (owner: 10Bking)
[13:39:51] <jinxer-wm>	 FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in esams #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[13:40:30] <hnowlan>	 here
[13:40:43] <hnowlan>	 thumbor is fine? 
[13:41:40] <hnowlan>	 what 
[13:41:55] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2035:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2035 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[13:41:57] <hnowlan>	 stale alert paging again? 
[13:42:21] <hnowlan>	 !incidents
[13:42:22] <sirenbot>	 6814 (ACKED)  wmf - metamonitoring - prometheus - notified - vip is now DOWN
[13:42:22] <sirenbot>	 6815 (ACKED)  [25x] ProbeDown sre (ip4 probes/service eqiad)
[13:42:22] <sirenbot>	 6816 (ACKED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[13:42:22] <sirenbot>	 6813 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[13:42:23] <sirenbot>	 6810 (RESOLVED)  ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw)
[13:42:23] <sirenbot>	 6812 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[13:42:23] <sirenbot>	 6811 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule)
[13:42:23] <sirenbot>	 6807 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[13:44:37] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-ctrl[1001-1004].eqiad.wmnet
[13:44:37] <logmsgbot>	 !log cgoubert@cumin1003 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) pool for host wikikube-ctrl[1001-1004].eqiad.wmnet
[13:44:40] <SandraEbele_>	 !log Deployed refinery-source using jenkins(weekly deployment train)
[13:44:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:44:52] <jinxer-wm>	 FIRING: [28x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:46:34] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/admin 'sync'.
[13:47:25] <wikibugs>	 (03Abandoned) 10Arnaudb: Revert^7 "gerrit: fix allowlist for mod_qos" [puppet] - 10https://gerrit.wikimedia.org/r/1192875 (owner: 10Arnaudb)
[13:47:50] <wikibugs>	 (03CR) 10Bking: [C:03+1] Remove the airflow profile from the analytics_cluster::launcher role [puppet] - 10https://gerrit.wikimedia.org/r/1192889 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis)
[13:48:53] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P83569 and previous config saved to /var/cache/conftool/dbconfig/20251001-134852-fceratto.json
[13:49:02] <wikibugs>	 (03CR) 10Bking: [C:03+2] wdqs-scholarly: Add wdqs2016 to load balancer pool [puppet] - 10https://gerrit.wikimedia.org/r/1192890 (https://phabricator.wikimedia.org/T405978) (owner: 10Bking)
[13:49:51] <jinxer-wm>	 RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in esams #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[13:51:26] <logmsgbot>	 !log jelto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 239 hosts with reason: eqiad Wikikube kubernetes cluster upgrade to 1.31 - T405703
[13:51:32] <stashbot>	 T405703: Update wikikube eqiad to kubernetes 1.31 - https://phabricator.wikimedia.org/T405703
[13:51:57] <wikibugs>	 (03CR) 10Andrea Denisse: mediawiki-engineering: Add REST API alerts with thresholds (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1192183 (https://phabricator.wikimedia.org/T405151) (owner: 10Andrea Denisse)
[13:52:07] <wikibugs>	 (03PS1) 10Elukey: Set ml-serve1012 as GPU k8s worker [puppet] - 10https://gerrit.wikimedia.org/r/1192894 (https://phabricator.wikimedia.org/T405891)
[13:52:08] <wikibugs>	 06SRE, 06collaboration-services, 10envoy, 06serviceops, 13Patch-For-Review: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663#11233317 (10Eevans) The RESTBase cluster has been upgraded to v1.29.12 (sorry for the delay, I was out all last week and missed the message).
[13:52:36] <tappof>	 !incidents
[13:52:36] <sirenbot>	 6815 (ACKED)  [25x] ProbeDown sre (ip4 probes/service eqiad)
[13:52:36] <sirenbot>	 6814 (RESOLVED)  wmf - metamonitoring - prometheus - notified - vip is now DOWN
[13:52:36] <sirenbot>	 6816 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[13:52:37] <sirenbot>	 6813 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[13:52:37] <sirenbot>	 6810 (RESOLVED)  ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw)
[13:52:37] <sirenbot>	 6812 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[13:52:37] <sirenbot>	 6811 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule)
[13:52:38] <sirenbot>	 6807 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[13:53:41] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'sync'.
[13:54:09] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw average message consume rate in last 30m on alert1002 is CRITICAL: 0 le 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw
[13:54:21] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw average message produce rate in last 30m on alert1002 is CRITICAL: 0 le 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw
[13:54:57] <wikibugs>	 (03CR) 10Elukey: [C:03+2] Set ml-serve1012 as GPU k8s worker [puppet] - 10https://gerrit.wikimedia.org/r/1192894 (https://phabricator.wikimedia.org/T405891) (owner: 10Elukey)
[13:56:41] <logmsgbot>	 !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=wdqs2016\.codfw\.wmnet
[13:58:38] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/toolhub: apply
[13:59:56] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405973#11233358 (10phaultfinder)
[14:00:19] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply
[14:01:40] <jinxer-wm>	 FIRING: [5x] KubernetesRsyslogDown: rsyslog on wikikube-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[14:01:48] <logmsgbot>	 !log cgoubert@cumin1003 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=toolhub.*
[14:02:59] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/apertium: apply
[14:03:31] <wikibugs>	 (03PS3) 10Slyngshede: P:cache::haproxy copy private repo data [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161)
[14:03:50] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[14:04:01] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1254 (T401906)', diff saved to https://phabricator.wikimedia.org/P83570 and previous config saved to /var/cache/conftool/dbconfig/20251001-140400-fceratto.json
[14:04:04] <stashbot>	 T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906
[14:04:16] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1259.eqiad.wmnet with reason: Maintenance
[14:04:23] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1259 (T401906)', diff saved to https://phabricator.wikimedia.org/P83571 and previous config saved to /var/cache/conftool/dbconfig/20251001-140422-fceratto.json
[14:04:27] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/kartotherian: apply
[14:04:51] <jinxer-wm>	 FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in esams #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[14:04:52] <jinxer-wm>	 FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:04:53] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a3-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406025#11233375 (10phaultfinder)
[14:05:08] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/apertium: apply
[14:05:15] <hnowlan>	 !incidents 
[14:05:15] <sirenbot>	 6815 (ACKED)  [25x] ProbeDown sre (ip4 probes/service eqiad)
[14:05:15] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply
[14:05:15] <sirenbot>	 6817 (UNACKED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[14:05:16] <sirenbot>	 6814 (RESOLVED)  wmf - metamonitoring - prometheus - notified - vip is now DOWN
[14:05:16] <sirenbot>	 6816 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[14:05:16] <sirenbot>	 6813 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[14:05:16] <sirenbot>	 6810 (RESOLVED)  ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw)
[14:05:16] <sirenbot>	 6812 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[14:05:17] <sirenbot>	 6811 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule)
[14:05:17] <sirenbot>	 6807 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[14:05:24] <hnowlan>	 !ack 6817
[14:05:24] <sirenbot>	 6817 (ACKED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[14:05:39] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1259 (T401906)', diff saved to https://phabricator.wikimedia.org/P83572 and previous config saved to /var/cache/conftool/dbconfig/20251001-140538-fceratto.json
[14:05:51] <hnowlan>	 Emperor: is swift looking okay? seems like there's an elevated level of errors but just from esams so seems unlikely 
[14:06:00] <wikibugs>	 (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede)
[14:06:12] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/kartotherian: apply
[14:06:14] <claime>	 hnowlan: I'm deploying thumbor in eqiad rn, will be ready to repool soon tm
[14:06:22] <hnowlan>	 ack
[14:06:26] <hnowlan>	 thumbor itself looks fine
[14:06:37] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply
[14:06:53] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: apply
[14:06:53] <hnowlan>	 slyngs: could you look at the above alert please? i'm in a meeting
[14:06:59] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
[14:07:03] <slyngs>	 Sure
[14:07:56] <Emperor>	 hnowlan: meeting right now, do I need to drop?
[14:08:22] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply
[14:08:28] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
[14:08:51] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply
[14:09:09] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/chart-renderer: apply
[14:09:10] <jinxer-wm>	 RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:09:48] <hnowlan>	 Emperor: not urgent I think
[14:09:52] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service k8s-ingress-wikikube:30443 has failed probes (tcp_k8s-ingress-wikikube_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:09:55] <claime>	 Emperor: hnowlan I can repool thumbor in eqiad now
[14:10:15] <claime>	 y/n?
[14:11:24] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply
[14:11:30] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[14:11:35] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[14:11:40] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/citoid: apply
[14:11:46] <slyngs>	 Errors doesn't look elevated in Grafana
[14:11:47] <wikibugs>	 (03Abandoned) 10Elukey: WIP: test upgrade-firmware for idrac 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1189502 (owner: 10Elukey)
[14:12:57] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply
[14:13:07] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/commons-impact-analytics: apply
[14:13:30] <wikibugs>	 (03PS1) 10Elukey: WIP: sre.hardware.upgrade-firmware: add support for IDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1192898
[14:14:10] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Service chart-renderer:30443 has failed probes (http_chart-renderer_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:14:12] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2048.codfw.wmnet']
[14:14:18] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/commons-impact-analytics: apply
[14:14:18] <logmsgbot>	 !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2048.codfw.wmnet']
[14:14:21] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1192889 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis)
[14:14:52] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[14:15:06] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[14:15:28] <wikibugs>	 06SRE, 10Cloud-VPS, 06DC-Ops, 06cloud-services-team (FY2025/26-Q1), 13Patch-For-Review: Experiment with cloudcephosd1050 and cloudcephosd1051 in single-nic configuration - https://phabricator.wikimedia.org/T405478#11233469 (10taavi) p:05Triage→03Medium
[14:15:47] <slyngs>	 Ever so slightly elevated compared to the alerting limit of 3 req/s
[14:16:20] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[14:16:34] <claime>	 Ok I'm repooling thumbor and swift in eqiad hnowlan Emperor 
[14:16:45] <logmsgbot>	 !log cgoubert@cumin1003 conftool action : set/pooled=true; selector: dnsdisc=thumbor.*,name=eqiad
[14:16:54] <logmsgbot>	 !log cgoubert@cumin1003 conftool action : set/pooled=true; selector: dnsdisc=swift.*,name=eqiad
[14:16:59] <logmsgbot>	 !log cgoubert@cumin1003 conftool action : set/pooled=false; selector: dnsdisc=thumbor.*,name=codfw
[14:17:03] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/data-gateway: apply
[14:18:19] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/data-gateway: apply
[14:18:25] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply
[14:18:42] <jayme>	 !incidents
[14:18:42] <sirenbot>	 6815 (ACKED)  [25x] ProbeDown sre (ip4 probes/service eqiad)
[14:18:42] <sirenbot>	 6817 (ACKED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[14:18:43] <sirenbot>	 6814 (RESOLVED)  wmf - metamonitoring - prometheus - notified - vip is now DOWN
[14:18:43] <sirenbot>	 6816 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[14:18:43] <sirenbot>	 6813 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[14:18:43] <sirenbot>	 6810 (RESOLVED)  ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw)
[14:18:44] <sirenbot>	 6812 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[14:18:44] <sirenbot>	 6811 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule)
[14:18:44] <sirenbot>	 6807 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[14:18:47] <wikibugs>	 (03PS2) 10Elukey: WIP: sre.hardware.upgrade-firmware: add support for IDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1192898
[14:18:58] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2048.codfw.wmnet']
[14:19:33] <hnowlan>	 claime: go, sorry
[14:19:38] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply
[14:19:44] <jinxer-wm>	 FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[14:19:44] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/device-analytics: apply
[14:19:59] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c6-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406015#11233490 (10phaultfinder)
[14:20:48] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=eqiad
[14:20:57] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/device-analytics: apply
[14:21:06] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/echostore: apply
[14:21:52] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/echostore: apply
[14:22:06] <slyngs>	 The ATSBackendErrorsHigh looks to be going down
[14:22:06] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/edit-analytics: apply
[14:22:37] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=eqiad
[14:23:06] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: apply
[14:23:17] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply
[14:24:07] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/editor-analytics: apply
[14:24:22] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/editor-analytics: apply
[14:24:30] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply
[14:24:36] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply
[14:24:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[14:24:51] <jinxer-wm>	 RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in esams #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[14:24:51] <wikibugs>	 (03CR) 10CI reject: [V:04-1] WIP: sre.hardware.upgrade-firmware: add support for IDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1192898 (owner: 10Elukey)
[14:24:52] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[14:24:57] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply
[14:24:59] <logmsgbot>	 !log cgoubert@deploy2002 Unlocked for deployment [ALL REPOSITORIES]: eqiad Wikikube kubernetes cluster upgrade to 1.31 - T405703 (duration: 201m 05s)
[14:25:02] <stashbot>	 T405703: Update wikikube eqiad to kubernetes 1.31 - https://phabricator.wikimedia.org/T405703
[14:25:25] <logmsgbot>	 !log cgoubert@deploy2002 Started scap sync-world: eqiad Wikikube kubernetes cluster upgrade to 1.31 - T405703
[14:25:33] <slyngs>	 !incidents
[14:25:33] <sirenbot>	 6815 (ACKED)  [25x] ProbeDown sre (ip4 probes/service eqiad)
[14:25:34] <sirenbot>	 6817 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[14:25:34] <sirenbot>	 6814 (RESOLVED)  wmf - metamonitoring - prometheus - notified - vip is now DOWN
[14:25:34] <sirenbot>	 6816 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[14:25:34] <sirenbot>	 6813 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[14:25:34] <sirenbot>	 6810 (RESOLVED)  ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw)
[14:25:35] <sirenbot>	 6812 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[14:25:35] <sirenbot>	 6811 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule)
[14:25:35] <sirenbot>	 6807 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[14:25:37] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply
[14:26:00] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply
[14:26:08] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply
[14:26:31] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply
[14:26:51] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply
[14:27:27] <icinga-wm>	 RECOVERY - MegaRAID on db1152 is OK: OK: optimal, 1 logical, 10 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[14:28:14] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply
[14:28:19] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply
[14:29:10] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job probes/swagger in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:29:13] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply
[14:29:19] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply
[14:29:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[14:29:56] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply
[14:30:05] <jouncebot>	 claime, jelto, and jayme: I, the Bot under the Fountain, call upon thee, The Deployer, to do eqiad Wikikube kubernetes upgrade (no other deployments) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251001T1000).
[14:30:05] <jouncebot>	 Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251001T1430)
[14:30:31] <logmsgbot>	 !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2048.codfw.wmnet']
[14:30:43] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/geo-analytics: apply
[14:30:58] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/geo-analytics: apply
[14:31:18] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/image-suggestion: apply
[14:31:35] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/image-suggestion: apply
[14:31:41] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply
[14:32:12] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply
[14:32:20] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply
[14:32:23] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] Remove the airflow profile from the analytics_cluster::launcher role [puppet] - 10https://gerrit.wikimedia.org/r/1192889 (https://phabricator.wikimedia.org/T402943) (owner: 10Btullis)
[14:32:50] <wikibugs>	 (03PS1) 10Majavah: P:toolforge::k8s::haproxy: Remove no-op maxconn statement [puppet] - 10https://gerrit.wikimedia.org/r/1192899 (https://phabricator.wikimedia.org/T406010)
[14:33:14] <wikibugs>	 (03CR) 10Slyngshede: "I'm not sure if we necessarily want to dynamically load the Lua files, but it's an option." [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede)
[14:33:23] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply
[14:33:29] <wikibugs>	 (03PS2) 10Majavah: P:toolforge::k8s::haproxy: Remove no-op maxconn statement [puppet] - 10https://gerrit.wikimedia.org/r/1192899 (https://phabricator.wikimedia.org/T406010)
[14:33:31] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply
[14:33:59] * Emperor out of meeting; everything looks good now?
[14:34:12] <slyngs>	 Yes, errors are back down to normal levels
[14:34:32] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=eqiad
[14:34:51] <jinxer-wm>	 FIRING: [14x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy  - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[14:35:33] <Emperor>	 👍
[14:36:57] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:37:41] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11233546 (10elukey) I managed to have an idrac upgrade triggered by the cookbook, but it then failed when checking the state of the idrac (that was down because...
[14:37:58] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply
[14:38:06] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mathoid: apply
[14:38:27] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mathoid: apply
[14:38:34] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/media-analytics: apply
[14:38:45] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/media-analytics: apply
[14:38:51] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[14:39:51] <jinxer-wm>	 FIRING: [13x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy  - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[14:40:41] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-misc: apply
[14:40:41] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[14:40:46] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[14:40:52] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-misc: apply
[14:40:59] <wikibugs>	 (03CR) 10FNegri: [C:03+1] P:toolforge::k8s::haproxy: Remove no-op maxconn statement [puppet] - 10https://gerrit.wikimedia.org/r/1192899 (https://phabricator.wikimedia.org/T406010) (owner: 10Majavah)
[14:41:04] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[14:41:07] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[14:41:14] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[14:41:19] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[14:41:25] <jinxer-wm>	 FIRING: [29x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:42:03] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:toolforge::k8s::haproxy: Remove no-op maxconn statement [puppet] - 10https://gerrit.wikimedia.org/r/1192899 (https://phabricator.wikimedia.org/T406010) (owner: 10Majavah)
[14:43:37] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[14:43:45] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
[14:44:10] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
[14:44:14] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-cron: apply
[14:44:24] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply
[14:44:30] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply
[14:44:43] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply
[14:44:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[14:44:51] <jinxer-wm>	 FIRING: [11x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy  - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[14:44:56] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply
[14:45:12] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply
[14:45:16] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply
[14:46:23] <wikibugs>	 (03CR) 10Vgutierrez: "correct me if I'm wrong but the current implementation won't reload main.lua ever" [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede)
[14:46:46] <wikibugs>	 (03CR) 10Ahmon Dancy: osm_master: Create /etc/wikimedia directory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1192566 (https://phabricator.wikimedia.org/T381565) (owner: 10Ahmon Dancy)
[14:48:35] <wikibugs>	 (03PS1) 10Majavah: P:toolforge::prometheus: Add external label with project [puppet] - 10https://gerrit.wikimedia.org/r/1192904 (https://phabricator.wikimedia.org/T406010)
[14:49:32] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=codfw
[14:49:52] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on wikikube-worker2035 - https://phabricator.wikimedia.org/T406060#11233583 (10Jhancock.wm) machine in warranty. requested drive replacement from dell. SR216590250
[14:49:53] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7171/co" [puppet] - 10https://gerrit.wikimedia.org/r/1192904 (https://phabricator.wikimedia.org/T406010) (owner: 10Majavah)
[14:50:01] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on wikikube-worker2035 - https://phabricator.wikimedia.org/T406060#11233584 (10Jhancock.wm) a:03Jhancock.wm
[14:51:06] <wikibugs>	 (03CR) 10Majavah: osm_master: Create /etc/wikimedia directory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1192566 (https://phabricator.wikimedia.org/T381565) (owner: 10Ahmon Dancy)
[14:51:56] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c6-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406015#11233585 (10Jhancock.wm) 05Open→03Resolved
[14:52:02] <wikibugs>	 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11233586 (10elukey) Maps back to be served only by the old stack, the k8s maintenance is completed.  I am going warm up the tegola's cache in codfw properly, but we have a good indica...
[14:53:57] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405973#11233587 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[14:55:01] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a3-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406025#11233590 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[14:56:56] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1192904 (https://phabricator.wikimedia.org/T406010) (owner: 10Majavah)
[14:57:11] <wikibugs>	 (03Abandoned) 10Federico Ceratto: es2049.yaml: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1184092 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto)
[14:58:04] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge::prometheus: Add external label with project [puppet] - 10https://gerrit.wikimedia.org/r/1192904 (https://phabricator.wikimedia.org/T406010) (owner: 10Majavah)
[14:58:11] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:58:27] <wikibugs>	 (03Abandoned) 10Federico Ceratto: Prepare new es2* nodes to replace old ones [puppet] - 10https://gerrit.wikimedia.org/r/1182507 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto)
[14:58:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on ml-serve1012:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[15:01:56] <wikibugs>	 (03CR) 10Herron: [V:03+1] "This follows the multi-instance pattern from our prometheus puppetization with profiles for each instance.  The instances would be main/pi" [puppet] - 10https://gerrit.wikimedia.org/r/1188441 (https://phabricator.wikimedia.org/T406054) (owner: 10Herron)
[15:04:25] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply
[15:04:34] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[15:04:41] <wikibugs>	 (03PS1) 10Federico Ceratto: preseed.yaml: Remove es2051 from preseeding [puppet] - 10https://gerrit.wikimedia.org/r/1192905 (https://phabricator.wikimedia.org/T402859)
[15:05:21] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply
[15:07:05] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply
[15:09:10] <jinxer-wm>	 FIRING: ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-int:4446 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:09:10] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:09:52] <jinxer-wm>	 RESOLVED: ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-int:4446 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:11:58] <jinxer-wm>	 FIRING: [15x] ProbeDown: Service mw-api-ext-next:4455 has failed probes (http_mw-api-ext-next_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:12:40] <_joe_>	 claime: is that because of your work I guess?
[15:13:22] <claime>	 downtimes expired if I had to guess
[15:15:05] <wikibugs>	 (03PS1) 10Kgraessle: set AutoModeratorMultiLingualRevertRisk with available wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192907 (https://phabricator.wikimedia.org/T400727)
[15:15:49] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply
[15:16:05] <jelto>	 yep one downtime expired 4 minutes ago for this service, I can re-create it with 30m downtime
[15:16:13] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply
[15:16:17] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[15:16:25] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply
[15:16:43] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply
[15:17:16] <jelto>	 !incidents
[15:17:17] <sirenbot>	 6815 (ACKED)  [25x] ProbeDown sre (ip4 probes/service eqiad)
[15:17:17] <sirenbot>	 6817 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[15:17:17] <sirenbot>	 6814 (RESOLVED)  wmf - metamonitoring - prometheus - notified - vip is now DOWN
[15:17:17] <sirenbot>	 6816 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[15:17:18] <sirenbot>	 6813 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[15:17:18] <sirenbot>	 6810 (RESOLVED)  ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw)
[15:17:18] <sirenbot>	 6812 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[15:17:18] <sirenbot>	 6811 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule)
[15:17:19] <sirenbot>	 6807 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[15:17:28] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply
[15:17:33] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply
[15:17:39] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[15:18:12] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[15:18:22] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: apply
[15:18:37] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: apply
[15:18:51] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/page-analytics: apply
[15:19:04] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/page-analytics: apply
[15:19:09] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/proton: apply
[15:19:10] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:19:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[15:19:52] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:19:55] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/proton: apply
[15:20:20] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/push-notifications: apply
[15:20:32] <wikibugs>	 (03PS2) 10Ahmon Dancy: Add traindev-staging environment for mw-web and mw-debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187855 (https://phabricator.wikimedia.org/T402350)
[15:20:47] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/push-notifications: apply
[15:20:55] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/ratelimit: apply
[15:21:13] <wikibugs>	 (03PS3) 10Ahmon Dancy: Add traindev-staging environment for mw-web and mw-debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187855 (https://phabricator.wikimedia.org/T402350)
[15:21:20] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ratelimit: apply
[15:21:27] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply
[15:21:31] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply
[15:21:36] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/recommendation-api: apply
[15:21:53] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/recommendation-api: apply
[15:22:05] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[15:22:14] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[15:23:04] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/sessionstore: apply
[15:23:17] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/sessionstore: apply
[15:23:34] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox: apply
[15:24:09] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply
[15:24:10] <jinxer-wm>	 FIRING: [5x] ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:24:25] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply
[15:24:46] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply
[15:24:51] <jinxer-wm>	 FIRING: [4x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy  - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[15:24:52] <jinxer-wm>	 RESOLVED: [6x] ProbeDown: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:25:49] <logmsgbot>	 !log cgoubert@deploy2002 Started scap sync-world: eqiad Wikikube kubernetes cluster upgrade to 1.31 - T405703
[15:25:53] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply
[15:25:53] <stashbot>	 T405703: Update wikikube eqiad to kubernetes 1.31 - https://phabricator.wikimedia.org/T405703
[15:26:12] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply
[15:26:21] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repool db1259 after maint T401906', diff saved to https://phabricator.wikimedia.org/P83573 and previous config saved to /var/cache/conftool/dbconfig/20251001-152620-ladsgroup.json
[15:26:25] <stashbot>	 T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906
[15:26:32] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply
[15:26:45] <wikibugs>	 (03PS1) 10Sergio Gimeno: Growth: remove no longer in use GENewcomerTasksStarterDifficultyEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192913 (https://phabricator.wikimedia.org/T396382)
[15:26:47] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[15:26:53] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply
[15:27:00] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] preseed.yaml: Remove es2051 from preseeding [puppet] - 10https://gerrit.wikimedia.org/r/1192905 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto)
[15:27:16] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply
[15:27:18] <wikibugs>	 (03PS1) 10Cwhite: logstash: w3creportingapi drop canary events [puppet] - 10https://gerrit.wikimedia.org/r/1192914 (https://phabricator.wikimedia.org/T304373)
[15:27:59] <logmsgbot>	 !log cgoubert@deploy2002 Finished scap sync-world: eqiad Wikikube kubernetes cluster upgrade to 1.31 - T405703 (duration: 03m 16s)
[15:29:33] <wikibugs>	 (03PS1) 10Kgraessle: set AutoModeratorMultiLingualRevertRisk with available wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192907 (https://phabricator.wikimedia.org/T400727)
[15:29:44] <jinxer-wm>	 RESOLVED: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[15:30:05] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply
[15:30:42] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply
[15:31:03] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/termbox: apply
[15:31:42] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/termbox: apply
[15:32:04] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[15:32:14] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
[15:32:43] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply
[15:33:10] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply
[15:33:15] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply
[15:33:36] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply
[15:33:42] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[15:34:10] <jinxer-wm>	 FIRING: [2x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[15:34:21] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[15:34:26] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/zotero: apply
[15:34:44] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply
[15:34:51] <jinxer-wm>	 RESOLVED: [4x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy  - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[15:34:52] <jinxer-wm>	 FIRING: [6x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[15:34:52] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:35:40] <claime>	 !log Finished eqiad Wikikube kubernetes cluster upgrade to 1.31 - T405703
[15:35:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:35:44] <stashbot>	 T405703: Update wikikube eqiad to kubernetes 1.31 - https://phabricator.wikimedia.org/T405703
[15:36:10] <wikibugs>	 (03PS1) 10JHathaway: acme-chief: remove hiera purge guard [puppet] - 10https://gerrit.wikimedia.org/r/1192917 (https://phabricator.wikimedia.org/T401858)
[15:36:12] <wikibugs>	 (03CR) 10DCausse: [C:03+2] flink jobs: resume search & wdqs jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192591 (owner: 10DCausse)
[15:36:57] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:37:21] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:38:13] <wikibugs>	 (03Merged) 10jenkins-bot: flink jobs: resume search & wdqs jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192591 (owner: 10DCausse)
[15:38:23] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:39:15] <jinxer-wm>	 FIRING: [7x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[15:39:20] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:41:21] <claime>	 hashar: we're done with the maintenance, so the train can continue running or whatever it is it does :)
[15:41:25] <jinxer-wm>	 FIRING: [29x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:41:39] <hashar>	 claime: awesome!!  And sorry for the delaying earlier today
[15:42:29] <hashar>	 releng has its team meeting in ~ 25 minutes, I'll talk about the train and I guess it will be resumed at the usual late UTC evening window
[15:42:39] <jinxer-wm>	 FIRING: [14x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1072-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[15:42:55] <taavi>	 I sense a chance of posting train photos without being fully off-topic so: https://commons.wikimedia.org/wiki/File:ArcticRail_Dr16_2811_Tampere_2025-09-30.jpg
[15:43:33] <hashar>	 that is a nice one taavi !
[15:44:00] <wikibugs>	 (03PS1) 10Btullis: Add some WMF specific network policies to the spark-operator chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192919 (https://phabricator.wikimedia.org/T405490)
[15:46:25] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: updating reporting thresholds of PDUs in codfw - https://phabricator.wikimedia.org/T401634#11233800 (10Jhancock.wm)
[15:46:29] <hashar>	 taavi: {done} https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys
[15:46:30] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[15:46:36] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[15:47:32] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] logstash: w3creportingapi drop canary events [puppet] - 10https://gerrit.wikimedia.org/r/1192914 (https://phabricator.wikimedia.org/T304373) (owner: 10Cwhite)
[15:47:39] <jinxer-wm>	 FIRING: [14x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1072-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[15:47:42] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Service shellbox-video:4080 has failed probes (http_shellbox-video_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:47:43] <wikibugs>	 (03PS2) 10Btullis: Add some WMF specific network policies to the spark-operator chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192919 (https://phabricator.wikimedia.org/T405490)
[15:47:58] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] preseed.yaml: Remove es2051 from preseeding [puppet] - 10https://gerrit.wikimedia.org/r/1192905 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto)
[15:48:07] <hnowlan>	 !incidents
[15:48:08] <sirenbot>	 6815 (RESOLVED)  [25x] ProbeDown sre (ip4 probes/service eqiad)
[15:48:08] <sirenbot>	 6817 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[15:48:08] <sirenbot>	 6814 (RESOLVED)  wmf - metamonitoring - prometheus - notified - vip is now DOWN
[15:48:08] <sirenbot>	 6816 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[15:48:08] <sirenbot>	 6813 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[15:48:09] <sirenbot>	 6810 (RESOLVED)  ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw)
[15:48:09] <sirenbot>	 6812 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[15:48:09] <sirenbot>	 6811 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule)
[15:48:09] <sirenbot>	 6807 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[15:48:18] <hnowlan>	 ah ok
[15:49:13] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[15:49:18] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[15:51:20] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[15:51:32] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[15:51:34] <wikibugs>	 (03PS1) 10Kgraessle: Enable AutoModerator on Italian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192921 (https://phabricator.wikimedia.org/T405152)
[15:51:45] <wikibugs>	 (03PS1) 10Kosta Harlan: SimpleCaptcha::canSkipCaptcha: Remove unneeded Config parameter [extensions/ConfirmEdit] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192922
[15:52:02] <wikibugs>	 (03PS1) 10Kosta Harlan: SimpleCaptcha::canSkipCaptcha: Remove unneeded Config parameter [extensions/ConfirmEdit] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192923
[15:52:39] <jinxer-wm>	 RESOLVED: [5x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1069-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[15:52:42] <wikibugs>	 (03PS1) 10Kosta Harlan: CreateAccountInstrumentationPreAuthenticationProvider: Don't create event if user can skip CAPTCHA [extensions/WikimediaEvents] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192924 (https://phabricator.wikimedia.org/T405239)
[15:52:55] <wikibugs>	 (03PS1) 10Kosta Harlan: CreateAccountInstrumentationPreAuthenticationProvider: Don't create event if user can skip CAPTCHA [extensions/WikimediaEvents] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192925 (https://phabricator.wikimedia.org/T405239)
[15:53:08] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on wikikube-worker2035 - https://phabricator.wikimedia.org/T406060#11233835 (10Jhancock.wm) and rejected. cause the TSR report doesn't show a disk error. The report actually shows an indetereminate bus error. this could actually be the drive error but i can't tell...
[15:53:19] <kostajh>	 jouncebot: nowandnext
[15:53:19] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 6 minute(s)
[15:53:20] <jouncebot>	 In 1 hour(s) and 6 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251001T1700)
[15:53:29] <kostajh>	 hashar: are you doing the train now, or can I run some more backports? 
[15:54:01] <hashar>	 go ahead with backports!
[15:54:22] <hashar>	 we will do the train in a couple hours via https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251001T1800
[15:54:26] <kostajh>	 ok
[15:54:32] <hashar>	 I'll mention it in the releng team meeting
[15:54:53] <wikibugs>	 (03PS3) 10Btullis: Add some WMF specific network policies to the spark-operator chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192919 (https://phabricator.wikimedia.org/T405490)
[15:55:29] <wikibugs>	 (03Abandoned) 10Btullis: spark: authorize communication between executors on blockManager port [deployment-charts] - 10https://gerrit.wikimedia.org/r/902409 (https://phabricator.wikimedia.org/T331859) (owner: 10Nicolas Fraison)
[15:56:05] <wikibugs>	 (03Abandoned) 10Btullis: spark: add hadoop conf configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/902402 (https://phabricator.wikimedia.org/T332909) (owner: 10Nicolas Fraison)
[15:56:08] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192922 (owner: 10Kosta Harlan)
[15:56:08] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192924 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan)
[15:56:09] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192925 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan)
[15:56:09] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192923 (owner: 10Kosta Harlan)
[15:56:55] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply
[15:57:07] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply
[16:04:27] <wikibugs>	 (03PS1) 10Federico Ceratto: site.pp: Configure es2051 mariadb role [puppet] - 10https://gerrit.wikimedia.org/r/1192927 (https://phabricator.wikimedia.org/T402859)
[16:04:58] <jinxer-wm>	 FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wcqs1001:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[16:06:45] <wikibugs>	 (03Merged) 10jenkins-bot: SimpleCaptcha::canSkipCaptcha: Remove unneeded Config parameter [extensions/ConfirmEdit] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192922 (owner: 10Kosta Harlan)
[16:07:44] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply
[16:07:52] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply
[16:09:45] <wikibugs>	 (03Merged) 10jenkins-bot: SimpleCaptcha::canSkipCaptcha: Remove unneeded Config parameter [extensions/ConfirmEdit] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192923 (owner: 10Kosta Harlan)
[16:09:46] <wikibugs>	 (03Merged) 10jenkins-bot: CreateAccountInstrumentationPreAuthenticationProvider: Don't create event if user can skip CAPTCHA [extensions/WikimediaEvents] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192925 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan)
[16:09:47] <wikibugs>	 (03Merged) 10jenkins-bot: CreateAccountInstrumentationPreAuthenticationProvider: Don't create event if user can skip CAPTCHA [extensions/WikimediaEvents] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192924 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan)
[16:10:26] <logmsgbot>	 !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1192922|SimpleCaptcha::canSkipCaptcha: Remove unneeded Config parameter]], [[gerrit:1192924|CreateAccountInstrumentationPreAuthenticationProvider: Don't create event if user can skip CAPTCHA (T405239)]], [[gerrit:1192925|CreateAccountInstrumentationPreAuthenticationProvider: Don't create event if user can skip CAPTCHA (T405239)]], [[gerrit:1192923|Simp
[16:10:26] <logmsgbot>	 leCaptcha::canSkipCaptcha: Remove unneeded Config parameter]]
[16:10:31] <stashbot>	 T405239: hCaptcha: Enable A/B test for frwiki - https://phabricator.wikimedia.org/T405239
[16:11:09] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw average message consume rate in last 30m on alert1002 is OK: (C)0 le (W)100 le 134.8 https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw
[16:11:21] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw average message produce rate in last 30m on alert1002 is OK: (C)0 le (W)100 le 130.3 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw
[16:14:46] <wikibugs>	 (03PS3) 10Elukey: WIP: sre.hardware.upgrade-firmware: add support for IDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1192898
[16:15:12] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2048.codfw.wmnet']
[16:15:58] <jinxer-wm>	 FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1025:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[16:16:03] <jinxer-wm>	 FIRING: [11x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[16:17:07] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1192922|SimpleCaptcha::canSkipCaptcha: Remove unneeded Config parameter]], [[gerrit:1192924|CreateAccountInstrumentationPreAuthenticationProvider: Don't create event if user can skip CAPTCHA (T405239)]], [[gerrit:1192925|CreateAccountInstrumentationPreAuthenticationProvider: Don't create event if user can skip CAPTCHA (T405239)]], [[gerrit:1192923|SimpleCaptcha::canSk
[16:17:07] <logmsgbot>	 ipCaptcha: Remove unneeded Config parameter]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[16:17:11] <stashbot>	 T405239: hCaptcha: Enable A/B test for frwiki - https://phabricator.wikimedia.org/T405239
[16:19:11] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Continuing with sync
[16:19:38] <logmsgbot>	 elukey@cumin2002 upgrade-firmware (PID 503788) is awaiting input
[16:20:58] <jinxer-wm>	 FIRING: [12x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[16:21:02] <wikibugs>	 (03CR) 10CI reject: [V:04-1] WIP: sre.hardware.upgrade-firmware: add support for IDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1192898 (owner: 10Elukey)
[16:21:45] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] varnish: Enable unified mobile routing on Commons [puppet] - 10https://gerrit.wikimedia.org/r/1192265 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle)
[16:22:45] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 1792406616 and 101 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[16:23:35] <logmsgbot>	 !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192922|SimpleCaptcha::canSkipCaptcha: Remove unneeded Config parameter]], [[gerrit:1192924|CreateAccountInstrumentationPreAuthenticationProvider: Don't create event if user can skip CAPTCHA (T405239)]], [[gerrit:1192925|CreateAccountInstrumentationPreAuthenticationProvider: Don't create event if user can skip CAPTCHA (T405239)]], [[gerrit:1192923|Sim
[16:23:35] <logmsgbot>	 pleCaptcha::canSkipCaptcha: Remove unneeded Config parameter]] (duration: 13m 08s)
[16:23:39] <stashbot>	 T405239: hCaptcha: Enable A/B test for frwiki - https://phabricator.wikimedia.org/T405239
[16:23:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[16:24:58] <jinxer-wm>	 RESOLVED: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wcqs1001:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[16:25:58] <jinxer-wm>	 RESOLVED: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1025:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[16:26:03] <jinxer-wm>	 FIRING: [12x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[16:27:45] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 15000 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[16:30:58] <jinxer-wm>	 RESOLVED: [12x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[16:31:08] <wikibugs>	 (03PS1) 10JHathaway: acme_chief: delete unused files on passive hosts [puppet] - 10https://gerrit.wikimedia.org/r/1192934 (https://phabricator.wikimedia.org/T401858)
[16:31:56] <wikibugs>	 (03CR) 10BCornwall: [V:03+1 C:03+2] "Tests are happy" [puppet] - 10https://gerrit.wikimedia.org/r/1192265 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle)
[16:33:02] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-eqiad
[16:34:42] <logmsgbot>	 !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2048.codfw.wmnet']
[16:35:55] <wikibugs>	 (03PS4) 10Elukey: WIP: sre.hardware.upgrade-firmware: add support for IDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1192898
[16:39:06] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-eqiad
[16:42:34] <wikibugs>	 (03CR) 10CI reject: [V:04-1] WIP: sre.hardware.upgrade-firmware: add support for IDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1192898 (owner: 10Elukey)
[16:45:14] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for tais-lessa - https://phabricator.wikimedia.org/T405129#11234044 (10Dzahn) 05In progress→03Stalled stalled on manager approval.
[16:46:15] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for gengh - https://phabricator.wikimedia.org/T405713#11234086 (10Dzahn) 05In progress→03Stalled stalled on manager approval
[16:49:46] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] vo-escalate: absent timer [puppet] - 10https://gerrit.wikimedia.org/r/1192610 (owner: 10Herron)
[16:51:53] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to Superset for marialechnerwmde - https://phabricator.wikimedia.org/T405917#11234167 (10Dzahn) @Maria_Lechner_WMDE Please send an email to Katie Francis of Legal (https://meta.wikimedia.org/wiki/User:KFrancis_(WMF)) to get the NDA signing process started.  Once...
[16:52:33] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to Superset for marialechnerwmde - https://phabricator.wikimedia.org/T405917#11234168 (10Dzahn)
[16:53:20] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to Superset for marialechnerwmde - https://phabricator.wikimedia.org/T405917#11234182 (10Dzahn) (this should be like https://gerrit.wikimedia.org/r/c/operations/puppet/+/1191507/4/modules/admin/data/data.yaml)
[16:55:15] <hnowlan>	 !incidents
[16:55:15] <sirenbot>	 6818 (ACKED)  Manual (paged) by LSobanski (lsobanski@wikimedia.org): Test page
[16:55:15] <sirenbot>	 6815 (RESOLVED)  [25x] ProbeDown sre (ip4 probes/service eqiad)
[16:55:16] <sirenbot>	 6817 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[16:55:16] <sirenbot>	 6814 (RESOLVED)  wmf - metamonitoring - prometheus - notified - vip is now DOWN
[16:55:16] <sirenbot>	 6816 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[16:55:16] <sirenbot>	 6813 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[16:55:16] <sirenbot>	 6810 (RESOLVED)  ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw)
[16:55:17] <sirenbot>	 6812 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[16:55:17] <sirenbot>	 6811 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule)
[16:55:18] <sirenbot>	 6807 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[16:55:29] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to restricted for AramilFeraxa - https://phabricator.wikimedia.org/T405796#11234186 (10Dzahn) a:03thcipriani Hi Tyler, there is a request for the "restricted" group here. They want to run maintenance scripts on the deployment server.  Details at T405796#11221398
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251001T1700)
[17:05:31] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to restricted for AramilFeraxa - https://phabricator.wikimedia.org/T405796#11234223 (10thcipriani) >>! In T405796#11234186, @Dzahn wrote: > Hi Tyler, there is a request for the "restricted" group here. They want to run maintenance scripts on the deployment server...
[17:07:10] <wikibugs>	 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: eqiad row C/D Collaboration Services host migrations - https://phabricator.wikimedia.org/T405940#11234225 (10Dzahn) @RobH What's your preferred way to schedule this? Want to let me know which slots work for you? Or should we  just suggest something vi...
[17:08:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[17:10:20] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to restricted for AramilFeraxa - https://phabricator.wikimedia.org/T405796#11234242 (10Dzahn) Thank you, Tyler.    @FCeratto-WMF This can continue with the "verify SSH key out of band" check box.  I am not 100% sure if we also need approval from Nasma Ahmed in th...
[17:10:30] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to restricted for AramilFeraxa - https://phabricator.wikimedia.org/T405796#11234243 (10Dzahn) a:05thcipriani→03None
[17:11:19] <wikibugs>	 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: eqiad row C/D Collaboration Services host migrations - https://phabricator.wikimedia.org/T405940#11234244 (10RobH) So we won't be ready to move forward on this until after Oct 15th (our deadline for installing the new switches) but afterwards.  If you...
[17:19:36] <wikibugs>	 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Serve mobile and desktop variants through the same URL (unified mobile routing) - https://phabricator.wikimedia.org/T214998#11234267 (10Krinkle)
[17:24:51] <wikibugs>	 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: eqiad row C/D Collaboration Services host migrations - https://phabricator.wikimedia.org/T405940#11234302 (10Dzahn) @RobH We can indeed do the gitlab-runners first and separate them. Let's do that.  I am suggesting October 16th, 15:00 - 15:30 UTC, con...
[17:29:04] <wikibugs>	 (03CR) 10BCornwall: [V:03+1 C:03+2] varnish: Enable unified mobile routing on idwiki, frwiki, dewiki [puppet] - 10https://gerrit.wikimedia.org/r/1192266 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle)
[17:31:56] <wikibugs>	 (03PS1) 10Ssingh: team-sre: cdn: add wdqs-main.discovery.wmnet to ignored backends [alerts] - 10https://gerrit.wikimedia.org/r/1192940 (https://phabricator.wikimedia.org/T406141)
[17:35:00] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] zuul: let zuul-scheduler also reach zookeeper outside container [puppet] - 10https://gerrit.wikimedia.org/r/1192615 (https://phabricator.wikimedia.org/T405118) (owner: 10Dzahn)
[17:39:18] <wikibugs>	 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: eqiad row C/D Collaboration Services host migrations - https://phabricator.wikimedia.org/T405940#11234421 (10RobH) I am unable to add folks to the gcal entry, but can you add both John and Valerie to the gcal event so they are aware of the window?  Jo...
[17:40:12] <wikibugs>	 (03PS2) 10Ssingh: team-sre: cdn: ignore wdqs-main.discovery.wmnet in ATSBackendErrorsHigh [alerts] - 10https://gerrit.wikimedia.org/r/1192940 (https://phabricator.wikimedia.org/T406141)
[17:44:52] <jinxer-wm>	 FIRING: [28x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:45:41] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10vm-requests: eqiad: 2 VM request for hCaptcha - https://phabricator.wikimedia.org/T406166 (10ssingh) 03NEW
[17:45:58] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad: 2 VM request for hCaptcha - https://phabricator.wikimedia.org/T406166#11234464 (10ssingh)
[17:46:21] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10vm-requests: codfw: 2 VM request for hCaptcha - https://phabricator.wikimedia.org/T406167 (10ssingh) 03NEW
[17:47:33] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10vm-requests: codfw: 2 VM request for hCaptcha - https://phabricator.wikimedia.org/T406167#11234481 (10ssingh)
[17:48:07] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad: 2 VM request for hCaptcha - https://phabricator.wikimedia.org/T406166#11234487 (10ssingh)
[17:48:25] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: codfw: 2 VM request for hCaptcha - https://phabricator.wikimedia.org/T406167#11234491 (10ssingh)
[18:00:05] <jouncebot>	 hashar and brennen: Deploy window MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251001T1800)
[18:00:09] <brennen>	 o/
[18:01:55] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2035:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2035 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[18:06:40] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 to 1.45.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192947 (https://phabricator.wikimedia.org/T405677)
[18:06:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Initiated by brennen@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192947 (https://phabricator.wikimedia.org/T405677) (owner: 10TrainBranchBot)
[18:07:31] <wikibugs>	 (03Merged) 10jenkins-bot: group1 to 1.45.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192947 (https://phabricator.wikimedia.org/T405677) (owner: 10TrainBranchBot)
[18:07:36] <wikibugs>	 (03CR) 10Ottomata: "Thank you! <3" [puppet] - 10https://gerrit.wikimedia.org/r/1192914 (https://phabricator.wikimedia.org/T304373) (owner: 10Cwhite)
[18:09:01] <kostajh>	 brennen: can you please let me know when you're done, as I'd like to deploy a config patch before the UTC late window, if possible 
[18:09:39] <brennen>	 kostajh: will do.
[18:10:15] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for gengh - https://phabricator.wikimedia.org/T405713#11234595 (10DSantamaria) Approved!
[18:14:52] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[18:18:22] <logmsgbot>	 !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.21  refs T405677
[18:18:29] <stashbot>	 T405677: 1.45.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T405677
[18:21:16] <wikibugs>	 (03PS1) 10Ottomata: EventStreamConfig - Enable hive ingestion for eventgate-logging-external based streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192950 (https://phabricator.wikimedia.org/T304373)
[18:21:48] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to Superset for marialechnerwmde - https://phabricator.wikimedia.org/T405917#11234616 (10WMDE-leszek) >  But they are saying no ssh needed.. so we can safely assume they mean the lowest of the 3 levels.  This is indeed what we want here....
[18:22:42] <brennen>	 kostajh: let's give it a couple of minutes to settle and then i'd say go ahead w/your config deploy.
[18:22:48] <wikibugs>	 (03PS2) 10Ottomata: EventStreamConfig - Enable hive ingestion for eventgate-logging-external based streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192950 (https://phabricator.wikimedia.org/T304373)
[18:23:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[18:24:52] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[18:31:14] <brennen>	 kostajh: things looking ok from my end, all yours.
[18:36:27] <wikibugs>	 06SRE: Make the shell group analytics-privatedata-users less confusing - https://phabricator.wikimedia.org/T405517#11234672 (10Novem_Linguae) I added "level 1", "level 2", "level 3" to the doc page at https://wikitech.wikimedia.org/w/index.php?title=Data_Platform/Data_access&diff=prev&oldid=2347634. Let's see if...
[18:42:08] <kostajh>	 brennen: thanks!
[18:45:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190992 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan)
[18:46:11] <wikibugs>	 (03Merged) 10jenkins-bot: hCaptcha: Enable A/B test for frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1190992 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan)
[18:46:45] <logmsgbot>	 !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1190992|hCaptcha: Enable A/B test for frwiki (T405239)]]
[18:46:52] <stashbot>	 T405239: hCaptcha: Enable A/B test for frwiki - https://phabricator.wikimedia.org/T405239
[18:53:03] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1190992|hCaptcha: Enable A/B test for frwiki (T405239)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[18:53:10] <stashbot>	 T405239: hCaptcha: Enable A/B test for frwiki - https://phabricator.wikimedia.org/T405239
[18:58:11] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:58:35] <wikibugs>	 (03CR) 10MusikAnimal: "One thing I forgot to mention at T402967 is we need to make bureaucrats and Community Wishlist managers themselves capable of assigning `c" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192663 (https://phabricator.wikimedia.org/T402967) (owner: 10Tim Starling)
[18:59:03] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on ml-serve1012:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[19:01:29] <wikibugs>	 (03CR) 10MusikAnimal: "Also, maybe we should *remove* `manage-wishlist` from the `sysop` group (via configuration)? That was put there as it's a sensible thing t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192663 (https://phabricator.wikimedia.org/T402967) (owner: 10Tim Starling)
[19:06:48] <kostajh>	 still testing 
[19:08:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[19:08:48] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Continuing with sync
[19:09:15] <wikibugs>	 (03PS1) 10Scott French: mw-*: Tune 8.3 releases to prevent deployment timeouts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192954 (https://phabricator.wikimedia.org/T405955)
[19:09:49] <wikibugs>	 (03PS6) 10Krinkle: varnish: Enable unified mobile routing on eswiki, ruwiki, jawiki [puppet] - 10https://gerrit.wikimedia.org/r/1192268 (https://phabricator.wikimedia.org/T403510)
[19:10:31] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[19:10:37] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[19:10:45] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[19:10:53] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[19:10:58] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[19:11:04] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[19:11:09] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[19:13:09] <logmsgbot>	 !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1190992|hCaptcha: Enable A/B test for frwiki (T405239)]] (duration: 26m 24s)
[19:13:17] <stashbot>	 T405239: hCaptcha: Enable A/B test for frwiki - https://phabricator.wikimedia.org/T405239
[19:14:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 20.35% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[19:17:43] <wikibugs>	 (03CR) 10Jon Harald Søby: [C:04-1] "According to the bug, they want to change the portal talk namespace from "Portal vaten" to "Werênayışê portali". That change still needs t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191861 (https://phabricator.wikimedia.org/T328207) (owner: 10Cappybaraa)
[19:19:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.43% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[19:21:22] <wikibugs>	 (03PS1) 10Scott French: P:conftool::hiddenparma: enable known_client_expression_validation [puppet] - 10https://gerrit.wikimedia.org/r/1192620 (https://phabricator.wikimedia.org/T403220)
[19:25:20] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for gengh - https://phabricator.wikimedia.org/T405713#11234810 (10Dzahn) 05Stalled→03In progress
[19:26:00] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for gengh - https://phabricator.wikimedia.org/T405713#11234811 (10Dzahn) a:05DSantamaria→03None Thanks. Checking the approval box and setting to "in progress":)
[19:26:12] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for gengh - https://phabricator.wikimedia.org/T405713#11234815 (10Dzahn)
[19:28:07] <wikibugs>	 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: eqiad row C/D Collaboration Services host migrations - https://phabricator.wikimedia.org/T405940#11234823 (10Dzahn) Done! Added both in gcal just now.
[19:28:55] <wikibugs>	 (03PS1) 10Kgraessle: Set AutoModeratorMultiLingualRevertRisk with available wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192956 (https://phabricator.wikimedia.org/T400727)
[19:30:27] <wikibugs>	 (03CR) 10BCornwall: [V:03+1 C:03+2] varnish: Enable unified mobile routing on eswiki, ruwiki, jawiki [puppet] - 10https://gerrit.wikimedia.org/r/1192268 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle)
[19:30:37] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Check list of PXE miss-configs for codfw - https://phabricator.wikimedia.org/T401442#11234832 (10Jhancock.wm) 05Open→03Resolved
[19:32:18] <wikibugs>	 (03PS1) 10Herron: vopsbot: switch rotation for 247 oncall [puppet] - 10https://gerrit.wikimedia.org/r/1192957
[19:38:19] <wikibugs>	 (03PS5) 10Krinkle: varnish: Enable unified mobile routing on all except en.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192271 (https://phabricator.wikimedia.org/T403510)
[19:38:25] <wikibugs>	 (03PS4) 10Krinkle: varnish: Enable unified mobile routing on en.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1192272 (https://phabricator.wikimedia.org/T403510)
[19:39:52] <jinxer-wm>	 FIRING: [7x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[19:41:23] <wikibugs>	 (03CR) 10Scott French: [C:03+1] Add traindev-staging environment for mw-web and mw-debug (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187855 (https://phabricator.wikimedia.org/T402350) (owner: 10Ahmon Dancy)
[19:41:40] <jinxer-wm>	 FIRING: [28x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:43:48] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on dbproxy1024 - https://phabricator.wikimedia.org/T405804#11234873 (10Jclark-ctr) @Ladsgroup   Drive sdb will have to added to md0 https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Sw_raid_rebuild_directions    ` jclark@dbproxy1024:~$ cat /proc/mdstat...
[19:44:54] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on dbproxy1024 - https://phabricator.wikimedia.org/T405804#11234877 (10Jclark-ctr) @Ladsgroup   Drive sdb will have to added to md0 https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Sw_raid_rebuild_directions    ` jclark@dbproxy1024:~$ cat /proc/mdstat...
[19:49:56] <mutante>	 !log cloud
[19:50:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:50:09] <mutante>	 I did not mean to do that :)
[19:58:17] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1152 - https://phabricator.wikimedia.org/T406063#11234914 (10Jclark-ctr) drive listed as online in  idrac and part of raid 10
[19:58:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1152 - https://phabricator.wikimedia.org/T406063#11234915 (10Jclark-ctr) 05Open→03Resolved
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251001T2000).
[20:00:05] <jouncebot>	 danisztls, xSavitar, and tgr: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:22] <xSavitar>	 o/
[20:01:17] <xSavitar>	 I'll self-deploy my patches today
[20:02:27] <xSavitar>	 Does danisztls happen to be around? :)
[20:04:20] <xSavitar>	 I'll go ahead and when they come online, they can deploy right after me.
[20:05:32] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by derick@deploy2002 using scap backport" [core] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192884 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01)
[20:05:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by derick@deploy2002 using scap backport" [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192885 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01)
[20:12:21] <wikibugs>	 06SRE, 10DNS, 06Traffic, 10wikimediafoundation.org, 07IPv6: wikimediafoundation.org does not support IPv6 - https://phabricator.wikimedia.org/T403269#11234952 (10BCornwall) 05Open→03In progress p:05Triage→03Low a:03BCornwall I've contacted @Varnent to try and get the right IPv6 address.
[20:13:45] <wikibugs>	 (03PS3) 10Cappybaraa: diqwiki: Add namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191861 (https://phabricator.wikimedia.org/T328207)
[20:16:47] <wikibugs>	 (03PS4) 10Cappybaraa: diqwiki: Add namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191861 (https://phabricator.wikimedia.org/T328207)
[20:18:49] <wikibugs>	 (03Merged) 10jenkins-bot: session: Handle an edge-case in MultiBackendSessionStore::set() [core] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1192884 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01)
[20:19:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[20:20:32] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#11234974 (10Reedy) Yeah... https://lists.wikimedia.org/postorius/lists/mediawiki-announce.lists.wikimedia.org/ is stil...
[20:20:53] <wikibugs>	 (03Merged) 10jenkins-bot: session: Handle an edge-case in MultiBackendSessionStore::set() [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192885 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01)
[20:21:31] <logmsgbot>	 !log derick@deploy2002 Started scap sync-world: Backport for [[gerrit:1192884|session: Handle an edge-case in MultiBackendSessionStore::set() (T402808)]], [[gerrit:1192885|session: Handle an edge-case in MultiBackendSessionStore::set() (T402808)]]
[20:21:37] <stashbot>	 T402808: Deploy separate anonymous session backend to Wikimedia production, in log-only mode - https://phabricator.wikimedia.org/T402808
[20:27:48] <logmsgbot>	 !log derick@deploy2002 derick, d3r1ck01: Backport for [[gerrit:1192884|session: Handle an edge-case in MultiBackendSessionStore::set() (T402808)]], [[gerrit:1192885|session: Handle an edge-case in MultiBackendSessionStore::set() (T402808)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:28:07] <stashbot>	 T402808: Deploy separate anonymous session backend to Wikimedia production, in log-only mode - https://phabricator.wikimedia.org/T402808
[20:28:29] * xSavitar testing...
[20:29:56] <xSavitar>	 All looks good.
[20:30:07] <logmsgbot>	 !log derick@deploy2002 derick, d3r1ck01: Continuing with sync
[20:31:15] <arlolra_>	 I have an UBN I'd like to deploy at the end of the window, fyi
[20:31:39] <danisztls>	 sorry, I'm late
[20:32:00] <danisztls>	 I will deploy mine after yall
[20:32:56] <xSavitar>	 danisztls, sure! Will signal you once I'm done.
[20:33:05] <xSavitar>	 arlolra_ Ack!
[20:33:37] <xSavitar>	 danisztls, almost done syncing backports then I'll deploy a config patch next (which should be faster I hope)
[20:33:50] <wikibugs>	 (03PS1) 10Arlolra: Revert "Add parsoid support in ProofreadPage extension" [extensions/ProofreadPage] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192971
[20:34:28] <logmsgbot>	 !log derick@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192884|session: Handle an edge-case in MultiBackendSessionStore::set() (T402808)]], [[gerrit:1192885|session: Handle an edge-case in MultiBackendSessionStore::set() (T402808)]] (duration: 12m 57s)
[20:34:33] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/ProofreadPage] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192971 (owner: 10Arlolra)
[20:34:34] <stashbot>	 T402808: Deploy separate anonymous session backend to Wikimedia production, in log-only mode - https://phabricator.wikimedia.org/T402808
[20:34:56] <danisztls>	 xSavitar: thanks! I just found out that the team want to postpone the deployment though. 
[20:35:13] <xSavitar>	 danisztls: Okay
[20:35:37] <wikibugs>	 (03CR) 10C. Scott Ananian: [C:03+1] Revert "Add parsoid support in ProofreadPage extension" [extensions/ProofreadPage] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192971 (owner: 10Arlolra)
[20:36:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by derick@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192632 (owner: 10D3r1ck01)
[20:37:54] <wikibugs>	 (03Merged) 10jenkins-bot: Revert^2 "session: Enable MultiBackendSessionStore on `group1` wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192632 (owner: 10D3r1ck01)
[20:38:27] <logmsgbot>	 !log derick@deploy2002 Started scap sync-world: Backport for [[gerrit:1192632|Revert^2 "session: Enable MultiBackendSessionStore on `group1` wikis"]]
[20:43:31] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Replace old ingestion wiki list file with new autoupdated file [puppet] - 10https://gerrit.wikimedia.org/r/1191750 (owner: 10Snwachukwu)
[20:44:10] <wikibugs>	 (03CR) 10LWatson: "Are you referring specifically to the "MediaWiki train"? I ask because there's a Web team deployment included in the schedule https://wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189281 (https://phabricator.wikimedia.org/T404398) (owner: 10Eric Gardner)
[20:44:51] <logmsgbot>	 !log derick@deploy2002 d3r1ck01, derick: Backport for [[gerrit:1192632|Revert^2 "session: Enable MultiBackendSessionStore on `group1` wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:44:55] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:45:22] * xSavitar testing...
[20:46:25] <jinxer-wm>	 FIRING: [29x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:46:45] <xSavitar>	 All seems fine. Syncing
[20:46:53] <logmsgbot>	 !log derick@deploy2002 d3r1ck01, derick: Continuing with sync
[20:46:54] <wikibugs>	 (03CR) 10LWatson: "Is there another way to verify how many trains have passed that the extension was included in like a phab tag?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189281 (https://phabricator.wikimedia.org/T404398) (owner: 10Eric Gardner)
[20:47:18] <xSavitar>	 We will monitor Grafana and logstash shortly after just in case. Cc tgr_ 
[20:47:30] <wikibugs>	 (03PS1) 10MusikAnimal: metawiki: Configure permissions for CommunityRequests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192972 (https://phabricator.wikimedia.org/T402967)
[20:48:24] <wikibugs>	 (03CR) 10MusikAnimal: "I've submitted I2d282523aab1 for the above." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192663 (https://phabricator.wikimedia.org/T402967) (owner: 10Tim Starling)
[20:48:33] <wikibugs>	 (03CR) 10LWatson: "Disregard, I see a note about this in the task" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189281 (https://phabricator.wikimedia.org/T404398) (owner: 10Eric Gardner)
[20:49:09] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2025-09-24-083919 to 2025-09-30-194529 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192973 (https://phabricator.wikimedia.org/T378558)
[20:49:25] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2025-09-24-180530 to 2025-09-25-181720 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192974 (https://phabricator.wikimedia.org/T378558)
[20:51:13] <logmsgbot>	 !log derick@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192632|Revert^2 "session: Enable MultiBackendSessionStore on `group1` wikis"]] (duration: 12m 46s)
[20:52:32] <xSavitar>	 arlolra_, do you want to take over?
[20:52:36] <xSavitar>	 I'm done deploying
[20:52:40] <arlolra_>	 sure, thanks
[20:52:45] <xSavitar>	 yw!
[20:53:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy2002 using scap backport" [extensions/ProofreadPage] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192971 (owner: 10Arlolra)
[20:54:43] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Add parsoid support in ProofreadPage extension" [extensions/ProofreadPage] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192971 (owner: 10Arlolra)
[20:55:17] <logmsgbot>	 !log arlolra@deploy2002 Started scap sync-world: Backport for [[gerrit:1192971|Revert "Add parsoid support in ProofreadPage extension"]]
[20:57:04] <TimStarling>	 anyone else in this deployment queue after arlolra_ ?
[20:58:41] <xSavitar>	 TimStarling, no afaik.
[20:59:04] <xSavitar>	 danisztls mentioned they've postponed their deply.
[20:59:19] <xSavitar>	 s/deply/deployment
[20:59:35] <logmsgbot>	 !log arlolra@deploy2002 arlolra: Backport for [[gerrit:1192971|Revert "Add parsoid support in ProofreadPage extension"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:00:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251001T2100)
[21:00:47] <logmsgbot>	 !log arlolra@deploy2002 arlolra: Continuing with sync
[21:01:08] <James_F>	 TimStarling: We're deploying in our window, but at first just services.
[21:01:27] <wikibugs>	 (03CR) 10Tim Starling: [C:03+2] Configure CommunityRequests virtual domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192669 (https://phabricator.wikimedia.org/T402967) (owner: 10Tim Starling)
[21:02:06] <TimStarling>	 I just need that config change, I'm merging it to be on the safe side, pretty sure to go out if I merge it
[21:02:30] <James_F>	 Oh, sure, no rush at our end, as long as we can deploy in ~ 30 mins. :-)
[21:02:53] <wikibugs>	 (03Merged) 10jenkins-bot: Configure CommunityRequests virtual domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192669 (https://phabricator.wikimedia.org/T402967) (owner: 10Tim Starling)
[21:03:28] <wikibugs>	 (03CR) 10Ecarg: [C:03+2] wikifunctions: Upgrade evaluators from 2025-09-24-083919 to 2025-09-30-194529 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192973 (https://phabricator.wikimedia.org/T378558) (owner: 10Jforrester)
[21:04:39] <TimStarling>	 OK so there was a backport window just ending? I have the "WMF Deployments" google calendar but I guess it's out of date
[21:05:04] <logmsgbot>	 !log arlolra@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192971|Revert "Add parsoid support in ProofreadPage extension"]] (duration: 09m 47s)
[21:05:13] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2025-09-24-083919 to 2025-09-30-194529 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192973 (https://phabricator.wikimedia.org/T378558) (owner: 10Jforrester)
[21:05:13] <James_F>	 TimStarling: Yeah, the https://wikitech.wikimedia.org/wiki/Deployments page is the source of truth. I didn't know there was a GCal form of it.
[21:05:20] <James_F>	 It sounds like it's out of date.
[21:06:01] <wikibugs>	 (03PS1) 10Bking: dse-k8s-eqiad: explicitly set quotas for opensearch-test namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192978 (https://phabricator.wikimedia.org/T397246)
[21:06:08] <TimStarling>	 the Google Calendar is linked from that page, in the grey box
[21:06:30] <James_F>	 Huh, you're right. Banner blindness is amazing.
[21:06:31] <logmsgbot>	 !log ecarg@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[21:07:29] <logmsgbot>	 !log ecarg@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[21:07:52] <logmsgbot>	 !log tstarling@deploy2002 Started scap sync-world: Backport for [[gerrit:1192669|Configure CommunityRequests virtual domain (T402967)]]
[21:07:59] <stashbot>	 T402967: Deploy CommunityRequests extension to prod - https://phabricator.wikimedia.org/T402967
[21:09:02] <logmsgbot>	 !log ecarg@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[21:09:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[21:10:06] <logmsgbot>	 !log ecarg@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[21:10:35] <logmsgbot>	 !log tstarling@deploy2002 tstarling: Backport for [[gerrit:1192669|Configure CommunityRequests virtual domain (T402967)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:10:44] <logmsgbot>	 !log ecarg@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[21:11:01] <logmsgbot>	 !log tstarling@deploy2002 tstarling: Continuing with sync
[21:11:40] <logmsgbot>	 !log ecarg@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[21:12:28] <wikibugs>	 (03CR) 10Ecarg: [C:03+2] wikifunctions: Upgrade orchestrator from 2025-09-24-180530 to 2025-09-25-181720 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192974 (https://phabricator.wikimedia.org/T378558) (owner: 10Jforrester)
[21:14:19] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2025-09-24-180530 to 2025-09-25-181720 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192974 (https://phabricator.wikimedia.org/T378558) (owner: 10Jforrester)
[21:15:18] <logmsgbot>	 !log ecarg@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[21:15:28] <logmsgbot>	 !log tstarling@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192669|Configure CommunityRequests virtual domain (T402967)]] (duration: 07m 36s)
[21:15:34] <stashbot>	 T402967: Deploy CommunityRequests extension to prod - https://phabricator.wikimedia.org/T402967
[21:15:46] <logmsgbot>	 !log ecarg@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[21:16:42] <logmsgbot>	 !log ecarg@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[21:17:18] <logmsgbot>	 !log ecarg@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[21:17:42] <logmsgbot>	 !log ecarg@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[21:18:17] <logmsgbot>	 !log ecarg@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[21:20:36] <wikibugs>	 (03PS7) 10Jforrester: Enable Wikifunctions client mode on Wiktionaries, Part III, and Incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172048 (https://phabricator.wikimedia.org/T397401)
[21:20:49] <James_F>	 TimStarling: All done? I can wait.
[21:24:01] <TimStarling>	 yes all done for now
[21:24:05] <James_F>	 Thanks!
[21:24:18] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172048 (https://phabricator.wikimedia.org/T397401) (owner: 10Jforrester)
[21:25:09] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Wikifunctions client mode on Wiktionaries, Part III, and Incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172048 (https://phabricator.wikimedia.org/T397401) (owner: 10Jforrester)
[21:25:41] <logmsgbot>	 !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1172048|Enable Wikifunctions client mode on Wiktionaries, Part III, and Incubator (T397401 T401682)]]
[21:25:49] <stashbot>	 T397401: If we follow Parsoid’s rollout and integrate Wikifunctions on most Wiktionaries and some low-traffic Wikipedias, we will get the testing we need to confidently roll out to larger wikis. - https://phabricator.wikimedia.org/T397401
[21:25:50] <stashbot>	 T401682: Wikimania in-person request: Enable Wikifunctions client mode on the Wikimedia Incubator - https://phabricator.wikimedia.org/T401682
[21:30:00] <logmsgbot>	 !log jforrester@deploy2002 jforrester: Backport for [[gerrit:1172048|Enable Wikifunctions client mode on Wiktionaries, Part III, and Incubator (T397401 T401682)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:30:49] <wikibugs>	 (03CR) 10Bking: [C:03+2] dse-k8s-eqiad: explicitly set quotas for opensearch-test namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192978 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking)
[21:30:49] <wikibugs>	 (03CR) 10LWatson: [C:03+1] "I reviewed the code and verified that two deployment trains have passed: `wmf/1.45.0-wmf.20` and `mw1.45.0-wmf.21`." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189281 (https://phabricator.wikimedia.org/T404398) (owner: 10Eric Gardner)
[21:31:02] <wikibugs>	 (03CR) 10Bking: [C:03+2] "self-merging in the interest of time" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192978 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking)
[21:31:03] <logmsgbot>	 !log jforrester@deploy2002 jforrester: Continuing with sync
[21:32:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[21:34:54] <wikibugs>	 (03CR) 10LWatson: Deploy ReaderExperiments to Beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189288 (https://phabricator.wikimedia.org/T404398) (owner: 10Eric Gardner)
[21:35:20] <logmsgbot>	 !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1172048|Enable Wikifunctions client mode on Wiktionaries, Part III, and Incubator (T397401 T401682)]] (duration: 09m 39s)
[21:35:28] <stashbot>	 T397401: If we follow Parsoid’s rollout and integrate Wikifunctions on most Wiktionaries and some low-traffic Wikipedias, we will get the testing we need to confidently roll out to larger wikis. - https://phabricator.wikimedia.org/T397401
[21:35:29] <stashbot>	 T401682: Wikimania in-person request: Enable Wikifunctions client mode on the Wikimedia Incubator - https://phabricator.wikimedia.org/T401682
[21:36:58] <logmsgbot>	 !log ryankemper@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[21:37:34] <wikibugs>	 (03CR) 10LWatson: [C:03+1] Enable ReaderExperiments on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189293 (https://phabricator.wikimedia.org/T404398) (owner: 10Eric Gardner)
[21:37:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[21:38:06] <wikibugs>	 (03CR) 10Jforrester: [C:03+1] "Confirmed that the ReaderExperiments repo is cloned and live in wmf.20 and wmf.21 in production, so this is safe to merge as-is now and wo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189281 (https://phabricator.wikimedia.org/T404398) (owner: 10Eric Gardner)
[21:38:22] <wikibugs>	 (03Merged) 10jenkins-bot: dse-k8s-eqiad: explicitly set quotas for opensearch-test namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192978 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking)
[21:39:28] <logmsgbot>	 !log ryankemper@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[21:40:06] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[21:40:13] <wikibugs>	 (03CR) 10LWatson: [C:03+1] "Looks good based on the example given https://www.mediawiki.org/wiki/Writing_an_extension_for_deployment#Deploy_to_Beta_Cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189288 (https://phabricator.wikimedia.org/T404398) (owner: 10Eric Gardner)
[21:40:37] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[21:40:41] <wikibugs>	 (03CR) 10LWatson: [C:03+1] Load ReaderExperiments extension in CommonSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189294 (https://phabricator.wikimedia.org/T404398) (owner: 10Eric Gardner)
[21:44:52] <jinxer-wm>	 FIRING: [28x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:44:55] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:46:22] <wikibugs>	 (03PS3) 10MusikAnimal: Enable CommunityRequests on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192663 (https://phabricator.wikimedia.org/T402967) (owner: 10Tim Starling)
[21:46:25] <jinxer-wm>	 FIRING: [29x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:48:13] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] site.pp: Configure es2051 mariadb role [puppet] - 10https://gerrit.wikimedia.org/r/1192927 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto)
[21:48:18] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] site.pp: Configure es2051 mariadb role [puppet] - 10https://gerrit.wikimedia.org/r/1192927 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto)
[21:49:12] <wikibugs>	 (03CR) 10Bvibber: Deploy ReaderExperiments to Beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189288 (https://phabricator.wikimedia.org/T404398) (owner: 10Eric Gardner)
[21:53:38] <wikibugs>	 (03PS2) 10MusikAnimal: metawiki: Configure permissions for CommunityRequests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192972 (https://phabricator.wikimedia.org/T402967)
[21:55:42] <bvibber>	 Everything look clear for a small config deploy? Got setup ReaderExperiments on Beta ready to roll :D
[21:55:49] <bvibber>	 I can spiderpig it up :D
[21:56:05] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1018.eqiad.wmnet with OS bullseye
[21:56:09] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192663 (https://phabricator.wikimedia.org/T402967) (owner: 10Tim Starling)
[21:56:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192972 (https://phabricator.wikimedia.org/T402967) (owner: 10MusikAnimal)
[21:56:23] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2017.codfw.wmnet with OS bullseye
[21:56:57] <wikibugs>	 (03Merged) 10jenkins-bot: Enable CommunityRequests on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192663 (https://phabricator.wikimedia.org/T402967) (owner: 10Tim Starling)
[21:57:02] <wikibugs>	 (03Merged) 10jenkins-bot: metawiki: Configure permissions for CommunityRequests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192972 (https://phabricator.wikimedia.org/T402967) (owner: 10MusikAnimal)
[21:57:38] <logmsgbot>	 !log tstarling@deploy2002 Started scap sync-world: Backport for [[gerrit:1192663|Enable CommunityRequests on metawiki (T402967)]], [[gerrit:1192972|metawiki: Configure permissions for CommunityRequests (T402967)]]
[21:57:44] <stashbot>	 T402967: Deploy CommunityRequests extension to prod - https://phabricator.wikimedia.org/T402967
[22:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251001T2200)
[22:01:55] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2035:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2035 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[22:02:50] <logmsgbot>	 !log tstarling@deploy2002 musikanimal, tstarling: Backport for [[gerrit:1192663|Enable CommunityRequests on metawiki (T402967)]], [[gerrit:1192972|metawiki: Configure permissions for CommunityRequests (T402967)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[22:02:57] <stashbot>	 T402967: Deploy CommunityRequests extension to prod - https://phabricator.wikimedia.org/T402967
[22:04:00] <logmsgbot>	 !log tstarling@deploy2002 musikanimal, tstarling: Continuing with sync
[22:08:21] <logmsgbot>	 !log tstarling@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192663|Enable CommunityRequests on metawiki (T402967)]], [[gerrit:1192972|metawiki: Configure permissions for CommunityRequests (T402967)]] (duration: 10m 42s)
[22:08:27] <stashbot>	 T402967: Deploy CommunityRequests extension to prod - https://phabricator.wikimedia.org/T402967
[22:10:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189281 (https://phabricator.wikimedia.org/T404398) (owner: 10Eric Gardner)
[22:10:32] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189288 (https://phabricator.wikimedia.org/T404398) (owner: 10Eric Gardner)
[22:10:32] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189293 (https://phabricator.wikimedia.org/T404398) (owner: 10Eric Gardner)
[22:10:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189294 (https://phabricator.wikimedia.org/T404398) (owner: 10Eric Gardner)
[22:11:17] <wikibugs>	 (03Merged) 10jenkins-bot: Add ReaderExperiments extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189281 (https://phabricator.wikimedia.org/T404398) (owner: 10Eric Gardner)
[22:11:22] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy ReaderExperiments to Beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189288 (https://phabricator.wikimedia.org/T404398) (owner: 10Eric Gardner)
[22:11:25] <wikibugs>	 (03Merged) 10jenkins-bot: Enable ReaderExperiments on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189293 (https://phabricator.wikimedia.org/T404398) (owner: 10Eric Gardner)
[22:11:28] <wikibugs>	 (03Merged) 10jenkins-bot: Load ReaderExperiments extension in CommonSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189294 (https://phabricator.wikimedia.org/T404398) (owner: 10Eric Gardner)
[22:12:02] <logmsgbot>	 !log bvibber@deploy2002 Started scap sync-world: Backport for [[gerrit:1189281|Add ReaderExperiments extension (T404398)]], [[gerrit:1189288|Deploy ReaderExperiments to Beta cluster (T404398)]], [[gerrit:1189293|Enable ReaderExperiments on Beta (T404398)]], [[gerrit:1189294|Load ReaderExperiments extension in CommonSettings-labs.php (T404398)]]
[22:12:08] <stashbot>	 T404398: Image Browsing: Deploy the prototype to Beta - https://phabricator.wikimedia.org/T404398
[22:13:06] <TimStarling>	 !log migrating wishes to CommunityRequests with migrateFromGadget.php
[22:13:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:14:52] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[22:24:52] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[22:38:07] <icinga-wm>	 PROBLEM - Ubuntu mirror in sync with upstream on mirror1001 is CRITICAL: /srv/mirrors/ubuntu is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors
[22:39:42] <logmsgbot>	 !log bvibber@deploy2002 egardner, bvibber: Backport for [[gerrit:1189281|Add ReaderExperiments extension (T404398)]], [[gerrit:1189288|Deploy ReaderExperiments to Beta cluster (T404398)]], [[gerrit:1189293|Enable ReaderExperiments on Beta (T404398)]], [[gerrit:1189294|Load ReaderExperiments extension in CommonSettings-labs.php (T404398)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes
[22:39:42] <logmsgbot>	 can now be verified there.
[22:39:48] <stashbot>	 T404398: Image Browsing: Deploy the prototype to Beta - https://phabricator.wikimedia.org/T404398
[22:40:03] <logmsgbot>	 !log bvibber@deploy2002 egardner, bvibber: Continuing with sync
[22:43:33] <wikibugs>	 (03CR) 10LWatson: [C:03+1] Deploy ReaderExperiments to Beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1189288 (https://phabricator.wikimedia.org/T404398) (owner: 10Eric Gardner)
[22:47:48] <logmsgbot>	 ryankemper@cumin2002 reimage (PID 671971) is awaiting input
[22:48:25] <jinxer-wm>	 FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag
[22:52:35] <logmsgbot>	 !log bvibber@deploy2002 Finished scap sync-world: Backport for [[gerrit:1189281|Add ReaderExperiments extension (T404398)]], [[gerrit:1189288|Deploy ReaderExperiments to Beta cluster (T404398)]], [[gerrit:1189293|Enable ReaderExperiments on Beta (T404398)]], [[gerrit:1189294|Load ReaderExperiments extension in CommonSettings-labs.php (T404398)]] (duration: 40m 32s)
[22:52:42] <stashbot>	 T404398: Image Browsing: Deploy the prototype to Beta - https://phabricator.wikimedia.org/T404398
[22:53:36] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'.
[22:54:21] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'.
[22:54:42] <bvibber>	 whew that took a long time. localization cache update :D
[22:57:20] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmde and nda for Maria Lechner WMDE - https://phabricator.wikimedia.org/T406106#11235324 (10KFrancis) Hi all, may I move forward with processing the NDA?
[22:58:11] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:59:03] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on ml-serve1012:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[23:06:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[23:09:19] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmde and nda for Maria Lechner WMDE - https://phabricator.wikimedia.org/T406106#11235332 (10Dzahn) This ticket is mostly a duplicate of T405917 now. (but don't worry about it too much, not a big deal, it is being handled either way)  What is actually needed here:...
[23:11:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[23:14:11] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1018.eqiad.wmnet with OS bullseye
[23:15:25] <wikibugs>	 (03PS1) 10Dzahn: wikistats: move backup dir out of git repo path [puppet] - 10https://gerrit.wikimedia.org/r/1192985 (https://phabricator.wikimedia.org/T401859)
[23:16:20] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] wikistats: move backup dir out of git repo path [puppet] - 10https://gerrit.wikimedia.org/r/1192985 (https://phabricator.wikimedia.org/T401859) (owner: 10Dzahn)
[23:16:39] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs2017.codfw.wmnet with OS bullseye
[23:21:44] <wikibugs>	 (03PS1) 10Dzahn: wikistats: use wmflib::debian_php_version to pick PHP version [puppet] - 10https://gerrit.wikimedia.org/r/1192986
[23:22:26] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] wikistats: use wmflib::debian_php_version to pick PHP version [puppet] - 10https://gerrit.wikimedia.org/r/1192986 (owner: 10Dzahn)
[23:24:08] <wikibugs>	 (03PS1) 10Bking: dse-k8s-eqiad: bump up minimum pod resources for opensearch-test ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192987 (https://phabricator.wikimedia.org/T397246)
[23:25:13] <wikibugs>	 (03PS2) 10Bking: dse-k8s-eqiad: bump up minimum pod resources for opensearch-test ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192987 (https://phabricator.wikimedia.org/T397246)
[23:26:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[23:31:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[23:33:41] <wikibugs>	 (03PS1) 10Dzahn: wikistats: do not ensure dir that is already used with git::clone [puppet] - 10https://gerrit.wikimedia.org/r/1192989 (https://phabricator.wikimedia.org/T401859)
[23:34:35] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmde and nda for Maria Lechner WMDE - https://phabricator.wikimedia.org/T406106#11235387 (10KFrancis) Thank you, @Aklapper.  The agreement has been sent out for signatures.  I'll confirm when it's complete.
[23:36:30] <wikibugs>	 (03PS2) 10Dzahn: wikistats: do not ensure dir that is already used with git::clone [puppet] - 10https://gerrit.wikimedia.org/r/1192989 (https://phabricator.wikimedia.org/T401859)
[23:36:40] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] wikistats: do not ensure dir that is already used with git::clone [puppet] - 10https://gerrit.wikimedia.org/r/1192989 (https://phabricator.wikimedia.org/T401859) (owner: 10Dzahn)
[23:38:04] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1192990
[23:38:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1192990 (owner: 10TrainBranchBot)
[23:38:33] <wikibugs>	 (03CR) 10Pppery: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191861 (https://phabricator.wikimedia.org/T328207) (owner: 10Cappybaraa)
[23:39:52] <jinxer-wm>	 FIRING: [7x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[23:53:04] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1192990 (owner: 10TrainBranchBot)
[23:59:50] <wikibugs>	 (03PS1) 10MusikAnimal: Increase timeout for MessageIndex lock [extensions/Translate] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192992 (https://phabricator.wikimedia.org/T402967)