[01:15:43] FYI, eqiad has been repooled for edge traffic; summary in -sre [01:18:36] 06Traffic, 06DC-Ops, 10ops-esams, 10ops-magru, 06SRE: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10178169 (10ssingh) Apologies for the long text that follows but the TL;DR is that we think that issues in `magru` are not confined to just the CPU on the affected hosts bu... [04:03:18] 06Traffic, 06DC-Ops, 10ops-esams, 10ops-magru, 06SRE: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10178217 (10wiki_willy) Thanks for providing all the details on this, @ssingh. @RobH - as we chatted about earlier today, we could ask Ascenty to double-check that there a... [07:02:29] 06Traffic, 13Patch-For-Review: Remove RSA certificates and use only ECDSA certificates - https://phabricator.wikimedia.org/T370837#10178279 (10Vgutierrez) I just -2ed the gerrit change cause we don't currently have information about which certificate is being used. TLSv1.2 includes the authentication mechanism... [07:20:36] 06Traffic: HAproxy misreports the authentication mecanism in TLSv1.3 traffic - https://phabricator.wikimedia.org/T375711 (10Vgutierrez) 03NEW [07:21:29] 06Traffic: HAproxy misreports the authentication mecanism in TLSv1.3 traffic - https://phabricator.wikimedia.org/T375711#10178315 (10Vgutierrez) p:05Triage→03High [07:40:54] 06Traffic: HAproxy and varnish misreport the authentication mechanism used in TLSv1.3 traffic - https://phabricator.wikimedia.org/T375711#10178365 (10Vgutierrez) [09:34:34] elukey: increasing timeouts in docker_registry_ha was enough? [09:36:07] vgutierrez: I am still testing the internal endpoint, and had a chat with Janis about why the CDN one was added in the first place. The reason may have been to leverage CDN's caching for docker images (since docker report has to pull a lot of them when inspecting etc..) but it may not make sense right now. [09:36:24] long term I hope that https://phabricator.wikimedia.org/T375645 may decrease those horrible timings [09:36:49] but it is not easy, the TL;DR is that we cannot easily purge old stuff from catalog/swift [09:37:11] so calls to the catalog return everything that was ever added to the registry, even if we cleaned up a lot [09:40:18] elukey: how often docker-report runs at the moment? [09:40:56] the small report daily IIRC, the bigger one weekly [09:41:37] so I'd say that given the TTL cap of 1 day at the CDN, that's hitting a lot of cache misses [09:42:28] probably yes, this is why we thought that it doesn't make much sense to use the CDN endpoint [09:42:37] if the test works I'll update the puppet code [09:43:28] nice, thx [11:18:47] 06Traffic, 13Patch-For-Review: HAproxy and varnish misreport the authentication mechanism used in TLSv1.3 traffic - https://phabricator.wikimedia.org/T375711#10178936 (10Vgutierrez) 05Open→03Resolved ` - ReqUnset x-tls-prot: h1 - ReqUnset x-tls-vers: TLSv1.3 - ReqUnset x-tls-ses... [11:29:49] FIRING: [2x] PyBalBGPUnstable: PyBal BGP sessions on instance lvs4008 are failing - https://wikitech.wikimedia.org/wiki/PyBal#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable [11:34:49] RESOLVED: [3x] PyBalBGPUnstable: PyBal BGP sessions on instance lvs4008 are failing - https://wikitech.wikimedia.org/wiki/PyBal#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable [12:11:07] 10netops, 06Infrastructure-Foundations, 10cloud-services-team (FY2024/2025-Q1-Q2): cloud: edge network suffers downtime if one cloudsw is down - https://phabricator.wikimedia.org/T375259#10179065 (10ayounsi) It would be useful to capture more data (eg. packet capture) next time this happens. The ICMP no rout... [12:28:10] 10netops, 06Infrastructure-Foundations, 10cloud-services-team (FY2024/2025-Q1-Q2): cloud: edge network suffers downtime if one cloudsw is down - https://phabricator.wikimedia.org/T375259#10179194 (10ayounsi) A few more info thanks to @aborrero on IRC. After 185.15.56.244, the packets towards 185.15.56.57 ar... [12:47:06] 10netops, 06Infrastructure-Foundations, 10cloud-services-team (FY2024/2025-Q1-Q2): cloud: edge network suffers downtime if one cloudsw is down - https://phabricator.wikimedia.org/T375259#10179269 (10ayounsi) Actually... `ssh: connect to host login.toolforge.org port 22: No route to host` is a red hearing, S... [13:03:55] 10netops, 06Infrastructure-Foundations, 10cloud-services-team (FY2024/2025-Q1-Q2): cloud: edge network suffers downtime if one cloudsw is down - https://phabricator.wikimedia.org/T375259#10179344 (10aborrero) In case they are useful, keepalived VRRP logs can be seen here: {P69421} [15:01:05] 06Traffic, 06DC-Ops, 10ops-codfw: cp2037 hardware issues: A fatal error was detected on a component at bus 174 device 0 function 0 - https://phabricator.wikimedia.org/T375766 (10ssingh) 03NEW [15:06:48] 06Traffic, 06DC-Ops, 10ops-codfw: cp2037 hardware issues: A fatal error was detected on a component at bus 174 device 0 function 0 - https://phabricator.wikimedia.org/T375766#10180060 (10ssingh) I think this server is out of warranty but I may be mistaken. [15:49:29] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Decom asw-c-codfw switch stack - https://phabricator.wikimedia.org/T375418#10180297 (10Papaul) [15:51:04] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Decom asw-d-codfw switch stack - https://phabricator.wikimedia.org/T375419#10180308 (10Papaul) [15:51:05] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Decom asw-c-codfw switch stack - https://phabricator.wikimedia.org/T375418#10180305 (10Papaul) 05Open→03Resolved This is complete [15:52:24] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Decom asw-d-codfw switch stack - https://phabricator.wikimedia.org/T375419#10180309 (10Papaul) 05Open→03Resolved This is complete [15:56:47] 06Traffic, 06DC-Ops, 10ops-codfw, 06SRE: cp2037 hardware issues: A fatal error was detected on a component at bus 174 device 0 function 0 - https://phabricator.wikimedia.org/T375766#10180345 (10Papaul) @Jhancock.wm can you please clear all the logs on this server and upgrade the BIOS and IDRAC please. tha... [16:02:28] 06Traffic, 06DC-Ops, 10ops-codfw, 06SRE: cp2037 hardware issues: A fatal error was detected on a component at bus 174 device 0 function 0 - https://phabricator.wikimedia.org/T375766#10180393 (10Jhancock.wm) a:03Jhancock.wm [16:12:48] bblack: the only other domain missing from the files on disk, some aliases are in secret puppet, is wmfusercontent.org. That domain doesn't have mx records, should it? [16:13:03] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 3 others: codfw:frack:rack/install/configuration new switches - https://phabricator.wikimedia.org/T374587#10180446 (10Papaul) setup/configuration of both switches done. Just need to add the switches to monitoring was we have pfw1... [16:18:01] jhathaway: yeah it probably should, because we do TLS certificates for it, and so they need to be verifiable via postmaster@ and other similar ones. [16:18:26] not that email is a great way to verify anything, but I think it's kind of the rules of the internet that those addresses should be reachable for canonical domains you care about [16:18:48] ok, wasn't sure if the CAA record obviated that need [16:19:03] no, CAA just acts as an additional limiter [16:19:17] ok, I'll craft a patch [16:19:32] jhathaway: CAA records all point to dns-admin@wm.org just as an FYI [16:20:04] but digicert ignores that and just tries (postmaster|hostmaster|admin|...)@yourdomain.org [16:20:12] there's like 5 of those canonical emails it tries [16:20:16] got it [16:20:36] which, after that whois debacle recently [16:20:43] I mean, email seems way shadier than even whois :P [16:22:07] at least we have the CT Log system these days, and browser vendors reporting oddballs to the CT system. it helps a lot to make such attacks visible. [16:43:39] 06Traffic, 06DC-Ops, 10ops-esams, 10ops-magru, 06SRE: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10180539 (10RobH) Draft body of support request for magru temp investigation: https://docs.google.com/document/d/1T-XwSS_Rwfb9nfC1aHQW4AjptLjxiviyZfGFdFcowZY/edit?usp=shar... [17:00:37] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: cr3-ulsfo incident 22 Sep 2024 - https://phabricator.wikimedia.org/T375345#10180628 (10RobH) Inbound shipment ticket 00980858 for UPS 1Z20506Y0100053206 (already delivered today and got the shipment notice last night). Next step is sc... [17:02:30] 06Traffic, 06DC-Ops, 10ops-codfw, 06SRE: cp2037 hardware issues: A fatal error was detected on a component at bus 174 device 0 function 0 - https://phabricator.wikimedia.org/T375766#10180634 (10Jhancock.wm) firmware updated and event log cleared. [17:03:13] 06Traffic, 06WMF-Legal, 13Patch-For-Review, 07Privacy: Add no-transform to Cache-Control header - https://phabricator.wikimedia.org/T218618#10180631 (10BCornwall) 05In progress→03Resolved a:03BCornwall [19:15:38] 06Traffic, 06DC-Ops, 10ops-esams, 10ops-magru, 06SRE: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10181005 (10RobH) Opened ticket CS1011077 for the above updated google doc draft. [20:14:59] 06Traffic, 06SRE, 10WikimediaDebug: With XWikimediaDebug enabled, wikitech.wikimedia.org gets redirected to foundation.wikimedia.org - https://phabricator.wikimedia.org/T375795 (10Urbanecm_WMF) 03NEW [20:20:34] 06Traffic, 06SRE, 10WikimediaDebug: With XWikimediaDebug enabled, wikitech.wikimedia.org gets redirected to foundation.wikimedia.org until Wikitech is on k8s - https://phabricator.wikimedia.org/T375795#10181160 (10bd808) [20:21:46] 06Traffic, 06SRE, 10WikimediaDebug: With XWikimediaDebug enabled, wikitech.wikimedia.org gets redirected to foundation.wikimedia.org until Wikitech is on k8s - https://phabricator.wikimedia.org/T375795#10181157 (10bd808) I don't know if there is a task for this yet, but it is known. The bug here is that we c... [20:25:03] 06Traffic, 06SRE, 10WikimediaDebug: With XWikimediaDebug enabled, wikitech.wikimedia.org gets redirected to foundation.wikimedia.org until Wikitech is on k8s - https://phabricator.wikimedia.org/T375795#10181161 (10Urbanecm_WMF) >>! In T375795#10181157, @bd808 wrote: > I don't know if there is a task for this... [20:30:40] 06Traffic, 06SRE, 10WikimediaDebug: With XWikimediaDebug enabled, wikitech.wikimedia.org gets redirected to foundation.wikimedia.org until Wikitech is on k8s - https://phabricator.wikimedia.org/T375795#10181167 (10bd808) >>! In T375795#10181161, @Urbanecm_WMF wrote: > Interesting, good to know. This is fairl...