[08:09:47] 10netops, 06Infrastructure-Foundations: Enable and scrape gNMIc api Prometheus endpoint - https://phabricator.wikimedia.org/T375361 (10ayounsi) 03NEW [08:39:05] 10netops, 06Infrastructure-Foundations, 06SRE: cr3-ulsfo incident 22 Sep 2024 - https://phabricator.wikimedia.org/T375345#10166669 (10ayounsi) a:03ayounsi [08:54:10] 10netops, 06Infrastructure-Foundations, 06SRE: cr3-ulsfo incident 22 Sep 2024 - https://phabricator.wikimedia.org/T375345#10166734 (10ayounsi) Opened high priority JTAC case 2024-0923-266479 and attached logs/debug output. [10:31:45] 10netops, 10CFSSL-PKI, 06Infrastructure-Foundations: sre.network.tls cookbook - CFSSL error: bad request - https://phabricator.wikimedia.org/T375179#10166957 (10ayounsi) 05Open→03Resolved [13:28:35] 10netops, 06Infrastructure-Foundations: Enable and scrape gNMIc api Prometheus endpoint - https://phabricator.wikimedia.org/T375361#10167428 (10ayounsi) 05Open→03Resolved a:03ayounsi Basic demo dashboard : https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?orgId=1&var-site=All [13:31:04] 10netops, 06Infrastructure-Foundations, 06SRE: cr3-ulsfo incident 22 Sep 2024 - https://phabricator.wikimedia.org/T375345#10167444 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a9eff4bb-15d3-41a4-8dd6-65ccc0663c06) set by ayounsi@cumin1002 for 3 days, 0:00:00 on 1 host(s) and their serv... [13:38:55] hi Traffic! I was wondering if any of you had thoughts on some of the DNS questions in T344171 [13:38:56] T344171: Reverse DNS for k8s pods IPs - https://phabricator.wikimedia.org/T344171 [13:41:21] cdanis: sorry, I thought you all had a reached a resolution [13:41:31] and hence I didn't address at least the dnsdist bits and in general [13:41:34] happy to look today again [13:42:26] sukhe: well, seems everyone is interested in just doing regular delegation of the appropriate PTR zones, and there's still some questions about how to actually serve that on the k8s side [13:43:10] the mechanism I'm proposing has perhaps 99% availability which I think is fine [13:44:42] authdns delegation to that seems good to me [13:44:58] if there's a ton of subnets to delegate, we can use the template's for-loops to generate them if it makes sense [13:45:27] (just make sure coredns port 53 is accessible from all our global recursors, at least) [13:47:08] thanks :) [14:33:47] 06Traffic: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078#10167699 (10Vgutierrez) 05In progress→03Resolved purged 0.24 survived to `Consumer group session timed out (in join-state steady) after 10458 ms without a successful response from the group coordinator... [14:34:11] \o/ [14:51:05] hello traffic - wanted to get your take on the situation with ulsfo and the switchover this week. [14:51:05] as part of day 1 (tomorrow) the procedure calls for depooling eqiad, but, if ulsfo is still depooled at that point, that would leave codfw as the only cache site up in NA, which seems non-ideal ... [14:51:05] first off, I wanted to confirm that indeed we should avoid that :) [14:51:41] swfrench-wmf: we discussed it very briefly and yes, that at least is a concern for me [14:52:03] sukhe: ack, thank you for confirming [14:52:15] (ulsfo is failing over to codfw as well) [14:52:24] exactly, yeah [14:52:53] so, we have a couple of options as to how to proceed, but it seems like the simplest is: [14:52:53] 1. if ulsfo is still depooled at of tomorrow, do not depool eqiad [14:52:53] 2. if ulsfo is pooled, but there is lingering concern that cr3-ulsfo could misbehave again, depool eqiad with the requirement that it's repooled if ulsfo must be depooled [14:53:44] it seems like as per https://phabricator.wikimedia.org/T375345#10167444, ulsfo is still likely to be depooled [14:54:11] yeah, I was just chatting with X.ioNoX about that [14:54:42] I am fine with repooling ulsfo (maybe even today?) and seeing how it behaves perhaps [14:55:03] out of curiosity, will the safety checks in sre.dns.admin flags automatically block depooling both? (ulsfo and eqiad) [14:55:16] swfrench-wmf: no, only if all three US sites are depooled [14:55:29] and plus, you can override them with the --emergency-depool-policy flag [14:55:30] ah, got it [14:55:43] --emergency-depool-policy [14:55:43] If passed, override the depool threshold and ignore all depool safety checks (default: False) [14:55:51] we discussed possibly blocking the combination of eqiad+codfw down, but even that's a matter of debate [14:56:00] yeah, I knew there was an override, but wasn't sure what the "design threshold" was [14:56:17] XioNoX: do you think we can try repooling ulsfo? [14:56:26] so re: the router itself, my understanding is that we're waiting on instructions from juniper, which will likely involve a reboot [14:56:28] and seeing what happens (today vs doing it tomorrow before the event) [14:56:54] (based on discussion with X.ioNoX a short time ago) [14:57:03] FWIW - I think we can probably operate ok on codfw-only. The problem it introduces is that during the period we have only codfw (in NA), if we have a temporary/emergent reason to depool codfw itself, we have no NA fallbacks. [14:57:36] sukhe: not now, maybe after a cr3-ulsfo reboot if it behaves correctly, still waiting on JTAC [14:57:41] although arguably, ulsfo itself isn't a sufficient fallback for the whole region, either. [14:57:42] did we double-check on egress bandwidth? [14:57:58] so having ulsfo back online doesn't necessarily fix the situation. [14:58:13] I'll reboot it anyway tomorrow morning my time even if I don't hear back from JTAC on time. They just assigned a new technician on the case [14:58:37] on the other hand, since our edge depool of eqiad is a voluntary one for testing, probably if codfw needed to go offline for an emergent reason, we'd repool eqiad first anyways? [14:58:39] we can also decide to repool ulsfo with only 1 router if we terribly need it [14:58:40] XioNoX: ok, silly question, can we do that reboot right now perhaps and then pool ulsfo and see how it behaves? (I am not sure what the conversation was with JTAC and hence) [14:59:45] sukhe: I have to step away in a few minutes, so no. Most likely JTAC will tell us to reboot it, but I didn't yet in case they want to look at it first [15:00:29] the router itself crashes every few hours, so after a reboot we will have to watch it before repooling [15:00:30] all that rambling could maybe be summed up as: it's probably fine to leave ulsfo dead and proceed to depool eqiad edge and go codfw-only, so long as we're ok with the recourse for any problem that would require codfw depooling is that we repool eqiad (and thus that we remain in a state where that's ok) [15:01:17] (for the front edge in all of the above, of course) [15:02:16] and I kind of think that's the implicit case in every dc-switch period where we depool one core front edge for a while, regardless of ulsfo state. Maybe we just don't always explicitly think it through. [15:02:58] (because there's no way ulsfo by itself takes the whole normal load of ulsfo+eqiad+codfw. probably ok on edge servers, but probably not ok on transport/transit issues, etc?) [15:07:03] I guess that is the main concern if we are OK on the network side? the hardware _should_ be fine [15:07:20] which perhaps then we can do the reboot or just run on one cr*? [15:08:20] sukhe: running ulsfo on just 1x cr* implies some impairment of the site's normal transit/peering capacity, and I think even at full capacity I'd question whether it can take the whole combined load of the US sites. [15:08:37] all that is to say again, if this is about ulsfo being a fallback for eqiad+codfw down, I'm not sure that's really a thing anyways. [15:09:35] it's somewhat-checkable though. look at a week's graphs for transit/peering at the 3x US sites, vs capacity. [15:09:50] (keeping mind that it won't balance itself perfectly, you need headrom for some transits to load up heavier than others) [15:09:55] *headroom [15:10:45] my intuition would be that with ulsfo perfectly-healthy, if we depooled eqiad+codfw, something in ulsfo would saturate [15:15:06] (sorry meeting) [15:37:36] thanks, all - so, I was thinking in terms of whether codfw could survive being the sole caching site in NA, in terms of hardware + ingress/egress capacity (rather than ulsfo being a viable fallback for a joint failure of the other two). [15:37:36] in any case, I think the bottom line is that we'll have at least one site down or at risk of being down again (ulsfo), and one site "electively' down (eqiad). [15:37:37] if we're fairly confident that codfw can handle the load on its own (but not 100%), we can always depool eqiad as planned on the condition that we defer disruptive maintenance that could put that at risk. [15:40:49] we should maybe question what we gain by electively depooling a core for a week during these tests. Given the current arrangement, I think the main thing it's testing is exactly that: that our transit/peering works ok with 1/2 core sites in the US. [15:41:21] arguably, we could (and sometimes organically do anyways) test that async from the switchover process and not make it part of this at all. [15:41:45] eqiad is also special right now for WME [15:42:01] well for lots of things, but I mean for public edge traffic depooling [15:42:08] yes but I meant for particularly that [15:42:14] since they're wanting to scrape originals soon(?) [15:42:34] oh sorry, when you wrote "WME" my brain somehow translated that to WMCS :) [15:42:53] https://phabricator.wikimedia.org/T370294 [15:43:37] I'm not sure when they planned on starting [15:43:50] yeah, netops should maybe confirm that timing, for all I know they already did [15:45:11] but still: I think in general edge-depooling doesn't really have to strictly follow any schedule related to dc-switchover, and maybe shouldn't be part of this process anyways (although maybe replace with some other desire to test on some schedule, separately) [15:45:16] it's worth discussing anyways. [15:45:20] I don't see any traffic to upload-lb from the IPs they indicated in the past week, so I'm guessing not yet [16:07:30] back [16:08:05] so the conclusion is that we revisit this in the morning? [16:09:18] and maybe cr3-ulsfo works fine post-reboot in which case there are no other issues [16:21:11] 06Traffic, 06DC-Ops, 10ops-esams, 06SRE: cp307[12] thermal issues - https://phabricator.wikimedia.org/T374986#10168315 (10RobH) Ongoing conversations via email with support, they've moved onto scheduling an onsite. Sent all location details over along with a proposed maint window of October 2nd. (Everyth... [17:06:55] sukhe: +1 to revisiting tomorrow morning. I'll aim to be around early. [17:07:13] ok thanks swfrench-wmf. works for us and so will be we [18:12:31] 06Traffic: Support RFC 8914 [Extended DNS Errors] in Wikimedia DNS - https://phabricator.wikimedia.org/T375200#10168780 (10ssingh) ` ;; EDE: 9 (DNSKEY Missing) ` With Wikimedia DNS above ^ One more: ` ;; TLS session (TLS1.3)-(ECDHE-SECP256R1)-(ECDSA-SECP256R1-SHA256)-(AES-256-GCM) ;; ->>HEADER<<- opcode: QUER... [18:13:34] 06Traffic: Support RFC 8914 [Extended DNS Errors] in Wikimedia DNS - https://phabricator.wikimedia.org/T375200#10168783 (10ssingh) 05Open→03Resolved a:03ssingh Rolled out everywhere. [18:31:08] 06Traffic: Support RFC 8914 [Extended DNS Errors] in internal recursors - https://phabricator.wikimedia.org/T375414 (10ssingh) 03NEW [18:49:19] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw: Decom asw-c-codfw switch stack - https://phabricator.wikimedia.org/T375418 (10Papaul) 03NEW [18:49:25] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw: Decom asw-c-codfw switch stack - https://phabricator.wikimedia.org/T375418#10168978 (10Papaul) p:05Triage→03Medium [18:50:15] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw: Decom asw-d-codfw switch stack - https://phabricator.wikimedia.org/T375419 (10Papaul) 03NEW [18:50:31] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw: Decom asw-d-codfw switch stack - https://phabricator.wikimedia.org/T375419#10168991 (10Papaul) p:05Triage→03Medium [19:49:38] cdanis: thanks for the WME ingest pointer. confirmed the event date with them (not during that week so will ask them for later) [19:50:00] cool [19:50:05] thanks :) [19:50:20] ah [19:50:28] sukhe: actually I have a message from them 1h ago saying they just started? [19:51:19] not on backfill, just on new uploads I think [19:51:30] oh? that's not what I heard! [19:52:18] thanks checking on the channel you added [20:06:40] FIRING: [2x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [20:11:40] FIRING: [8x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [20:21:40] FIRING: [11x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [20:26:40] FIRING: [16x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [20:41:40] FIRING: [10x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [20:46:40] RESOLVED: [8x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [22:20:05] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Decom asw-c-codfw switch stack - https://phabricator.wikimedia.org/T375418#10169564 (10Papaul) [22:21:21] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Decom asw-d-codfw switch stack - https://phabricator.wikimedia.org/T375419#10169567 (10Papaul) [22:34:35] 06Traffic, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Some sites try and fail to serve favicon.ico - https://phabricator.wikimedia.org/T374997#10169615 (10matmarex) Thanks for investigating, and for the patch. That list is definitely not complete, it's just the top entries I saw in the logs last Tuesda... [22:37:55] 06Traffic, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Some sites try and fail to serve favicon.ico - https://phabricator.wikimedia.org/T374997#10169618 (10matmarex) > nothing is really broken because of this bug To be clear, our wikis are not even using the broken favicon URLs. If you go to https://do... [23:51:43] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Decom asw-d-codfw switch stack - https://phabricator.wikimedia.org/T375419#10169742 (10Papaul)