[06:39:37] vgutierrez: o/ yes it turned out to be an issue between the docker registry and swift, requests were being rate-limited and IIUC the swift client silently tried to retry/backoff increasing latency [06:39:46] of course zero logging related to that [08:52:39] 06Traffic, 06MW-Interfaces-Team, 10RESTBase Sunsetting, 06serviceops: Switchover plan from RESTbase to REST Gateway for rest_v1/page/html and rest_v1/page/title endpoints - https://phabricator.wikimedia.org/T374683#10227914 (10akosiaris) Hi, I just had enough time to review this. This can't be implemente... [08:58:55] 10netops, 06Infrastructure-Foundations, 06SRE: cloudsw1-d5-eqiad instability Aug 6 2024 - https://phabricator.wikimedia.org/T371879#10227934 (10cmooney) 05Open→03Resolved Closing this one, things have been ok since upgrade/reset. [09:13:07] 10netops, 10Ceph, 06Infrastructure-Foundations, 10Data-Platform-SRE (2024.09.28 - 2024.10.18): cephosd advertised v6 prefix flapping - https://phabricator.wikimedia.org/T376697#10227973 (10BTullis) a:03BTullis [09:26:15] elukey: sounds like some fun time debugging ;P [09:28:11] 10netops, 10Ceph, 06Infrastructure-Foundations, 10Data-Platform-SRE (2024.09.28 - 2024.10.18): cephosd advertised v6 prefix flapping - https://phabricator.wikimedia.org/T376697#10228035 (10cmooney) p:05Medium→03Low [09:31:18] vgutierrez: definitely, you can imagine my mental sanity after that :D [09:31:38] that's one of my features.. I don't have any sanity to lose anymore [09:32:00] you are a wise person [09:59:50] 10netops, 10Ceph, 06Infrastructure-Foundations, 10Data-Platform-SRE (2024.09.28 - 2024.10.18): cephosd advertised v6 prefix flapping - https://phabricator.wikimedia.org/T376697#10228256 (10BTullis) >>! In T376697#10225721, @cmooney wrote: > The envoy service it connects to seems stable for longer though.... [10:15:17] 10netops, 10Ceph, 06Infrastructure-Foundations, 10Data-Platform-SRE (2024.09.28 - 2024.10.18): cephosd advertised v6 prefix flapping - https://phabricator.wikimedia.org/T376697#10228372 (10BTullis) Ah, it looks like `check_rise` and `check_interval` are currently hardcoded in the template: https://github.c... [10:35:36] 10netops, 10Ceph, 06Infrastructure-Foundations, 10Data-Platform-SRE (2024.09.28 - 2024.10.18), 13Patch-For-Review: cephosd advertised v6 prefix flapping - https://phabricator.wikimedia.org/T376697#10228450 (10cmooney) >>! In T376697#10228372, @BTullis wrote: > Ah, it looks like `check_rise` and `check_in... [11:22:06] 10netops, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 06SRE: openstack: initial IPv6 support in neutron - https://phabricator.wikimedia.org/T375847#10228569 (10aborrero) 05In progress→03Resolved I think we can consider this to be completed. We may reopen if required. [11:52:04] 10netops, 10Ceph, 06Infrastructure-Foundations, 10Data-Platform-SRE (2024.09.28 - 2024.10.18): cephosd advertised v6 prefix flapping - https://phabricator.wikimedia.org/T376697#10228652 (10BTullis) An interesting data point is that in our radosgw logs we have only 200 responses recorded from the `check_htt... [12:11:53] 06Traffic, 06Data-Platform, 10Data Products (Data Products Sprint 20 🎯), 10Data-Engineering (Q2 2024 October 1st - December 31th), 13Patch-For-Review: NEW BUG REPORT - Issues in calculation logic for unique devices tables - https://phabricator.wikimedia.org/T375527#10228684 (10Milimetric) [13:19:31] 06Traffic, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Authdns: automate reverse DNS zone delegation for k8s pod IP ranges - https://phabricator.wikimedia.org/T376291#10229001 (10cmooney) The above patch is my current best-stab at accomplishing this. I won't have a huge amount of time to look... [13:48:31] 06Traffic, 06Commons, 10MediaWiki-File-management, 10SRE-swift-storage: Commons' file is inaccessible for some users - https://phabricator.wikimedia.org/T377202#10229170 (10Bugreporter) [13:52:37] 10netops, 10Ceph, 06Infrastructure-Foundations, 10Data-Platform-SRE (2024.09.28 - 2024.10.18): cephosd advertised v6 prefix flapping - https://phabricator.wikimedia.org/T376697#10229187 (10BTullis) 05Open→03Resolved I'll mark this as done, for now. Haven't had any more occurrences since this mornin... [14:04:19] 06Traffic, 06Commons, 10SRE-swift-storage: Commons' file is inaccessible for some users - https://phabricator.wikimedia.org/T377202#10229245 (10Aklapper) Removing #mediawiki-file-management as I doubt that this is a bug in MediaWiki core code. I get a `404` error here (Central Europe). [14:45:18] 10netops, 10fundraising-tech-ops, 06Infrastructure-Foundations: Test prototype fundraising pybal replacement based on haproxy + anycast-healthchecker. - https://phabricator.wikimedia.org/T373942#10229563 (10Jgreen) 05Open→03In progress p:05Triage→03Medium [15:22:40] 06Traffic, 06Commons, 10MediaWiki-File-management, 10SRE-swift-storage: Commons' file is inaccessible for some users - https://phabricator.wikimedia.org/T377202#10229798 (10MatthewVernon) The problem is that this object has been uploaded to codfw OK, but not eqiad; all being equal, this will get picked up... [15:35:24] FIRING: SystemdUnitFailed: acme-chief-certs-sync.service on acmechief1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:35:38] uh? :) [15:36:05] vgutierrez: was affected by ganeti1034 hang [15:36:09] we force-rebooted the host [15:36:24] so acmechief1002 keyholder hasn't been rearmed? [15:36:35] right [15:36:36] fixing it [15:37:01] done [15:37:02] ah, sorry didn't get a keyholder not armed alert [15:37:03] thx [15:37:07] thx [15:37:39] see -operations for more context if you're curious [15:39:19] my bad, we did get the alert, was missed in the backlog [15:40:24] RESOLVED: SystemdUnitFailed: acme-chief-certs-sync.service on acmechief1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:18:31] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10230470 (10Papaul) [17:37:27] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10230551 (10Papaul) [18:26:23] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack pfw3 and old fasw decommission - https://phabricator.wikimedia.org/T377254 (10Papaul) 03NEW [18:52:29] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack pfw3 and old fasw decommission - https://phabricator.wikimedia.org/T377254#10230870 (10Papaul) [19:00:24] 06Traffic, 06Data-Platform, 10Data Products (Data Products Sprint 20 🎯), 10Data-Engineering (Q2 2024 October 1st - December 31th), 13Patch-For-Review: NEW BUG REPORT - Issues in calculation logic for unique devices tables - https://phabricator.wikimedia.org/T375527#10230899 (10Mayakp.wiki) [20:22:31] 06Traffic, 10ops-magru, 06SRE: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10231262 (10RobH) [21:57:42] 10Wikimedia-Apache-configuration, 06serviceops, 06SRE, 10Wikimedia-Portals, 13Patch-For-Review: www.wikipedia.org: prefilling the search box with the "search" URL parameter does not work - https://phabricator.wikimedia.org/T318285#10231718 (10Pppery) [22:59:11] 10netops, 06Infrastructure-Foundations: Arelion IPv6 transit renumbering - https://phabricator.wikimedia.org/T365697#10231895 (10Papaul) @ayounsi I double check again that all the old IPV6 were removed. But we have the old IPV6 that were still in cache so here are the 2 commands I used to fix the issue. I...