[05:28:37] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995#9943050 (10Marostegui) [08:02:48] JFYI I noticed the probes failing for librenms.w.o and debug/fix in https://phabricator.wikimedia.org/T369008 tl;dr is that apache was sending a mixed rsa/ec intermediate chain [08:03:07] I don't have time/bandwidth to investigate further other than my "fix" [08:04:27] thanks, I'll watch the task [08:04:47] (staring at it hard enough it should fix by itself, isnt' it?) [08:16:12] lol that's what I thought too, didn't work [08:49:40] hmmm [08:49:45] how that chain is being built? [08:50:24] IIRC librenms.wikimedia.org cert is handled by acme-chief [08:51:54] godog: nowadays auth_mechanism == sso? [08:52:18] https://www.irccloud.com/pastebin/Mg206dZU/ [08:57:07] godog: sounds a lot like https://community.letsencrypt.org/t/apache-chain-issues-with-dual-rsa-ecdsa-certificates/153960 [08:58:46] godog: could you try using SSLCertificateFile pointing to .chained.crt file provided by acme-chief? [09:03:33] I've provided more context on the CR itself [09:22:32] vgutierrez: thank you for taking a look, I'll try live real quick [09:22:54] 10Acme-chief, 06Traffic, 10Gerrit, 10observability: Stop using SSLCertificateChainFile on RSA+EC setups - https://phabricator.wikimedia.org/T369014 (10Vgutierrez) 03NEW [09:23:01] 10Acme-chief, 06Traffic, 10Gerrit, 10observability: Stop using SSLCertificateChainFile on RSA+EC setups - https://phabricator.wikimedia.org/T369014#9943552 (10Vgutierrez) p:05Triage→03High [09:23:06] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322#9943538 (10cmooney) So the change to the timeout has made a big difference, but there are still some small gaps: {F56165130} {F5616524... [09:23:27] vgutierrez: so no SSLCertificateChainFile and SSLCertificateFile with the .chained. ? [09:23:30] godog: I've found some other offenders checking the puppet repo, so I've filled https://phabricator.wikimedia.org/T369014 [09:23:43] SSLCertificateFile /etc/acmecerts/${'cert_name'}/live/ec-prime256v1.chained.crt [09:23:43] SSLCertificateFile /etc/acmecerts/${'cert_name'}/live/rsa-2048.chained.crt [09:23:48] like that [09:23:52] ack thank you [09:24:22] indeed that works [09:24:34] nice :D [09:24:49] a good riddle is to figure out why this stuff broke 10d ago for librenms, I'm not going to dig into that rabbit hole tho [09:25:11] godog: side effect of Let's Encrypt intermediate CAs update [09:27:38] 10Acme-chief, 06Traffic, 10Gerrit, 10observability: Stop using SSLCertificateChainFile on RSA+EC setups - https://phabricator.wikimedia.org/T369014#9943571 (10Vgutierrez) [09:27:56] I'll fix the o11y stuff now while we're at it [09:29:06] 10Acme-chief, 06Traffic, 10Gerrit, 10observability: Stop using SSLCertificateChainFile on RSA+EC setups - https://phabricator.wikimedia.org/T369014#9943572 (10Vgutierrez) [09:29:08] godog: thx [09:34:41] 10Acme-chief, 06Traffic, 10Gerrit, 10observability, 13Patch-For-Review: Stop using SSLCertificateChainFile on RSA+EC setups - https://phabricator.wikimedia.org/T369014#9943584 (10Vgutierrez) [09:44:40] 10Acme-chief, 06Traffic, 06Data-Persistence, 10Gerrit, and 2 others: Stop using SSLCertificateChainFile on RSA+EC setups - https://phabricator.wikimedia.org/T369014#9943638 (10Vgutierrez) [09:46:08] 10Acme-chief, 06Traffic, 06Data-Persistence, 10Gerrit, and 2 others: Stop using SSLCertificateChainFile on RSA+EC setups - https://phabricator.wikimedia.org/T369014#9943644 (10fgiunchedi) [09:50:07] 10Acme-chief, 06Traffic, 06Data-Persistence, 10Gerrit, and 3 others: Stop using SSLCertificateChainFile on RSA+EC setups - https://phabricator.wikimedia.org/T369014#9943653 (10Vgutierrez) [09:57:11] godog: btw.. more context on why we are seeing this issue right now... in early June Let's Encrypt stopped using their old intermediate CA R3, R3 issued both RSA and EC certs... in the new setup they use E6 for EC and R10 for RSA, hence the issue [09:59:25] vgutierrez: hah! all checks out, thank you for investigating [11:01:38] FIRING: [4x] LVSRealserverMSS: Unexpected MSS value on 208.80.154.240:443 @ cp1111 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=eqiad&var-cluster=cache_upload - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [11:04:19] ^^ it could be me restarting haproxy to upgrade [11:06:38] RESOLVED: [4x] LVSRealserverMSS: Unexpected MSS value on 208.80.154.240:443 @ cp1111 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=eqiad&var-cluster=cache_upload - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [11:38:45] Hey there traffic team [11:38:53] I raised a task just now: https://phabricator.wikimedia.org/T369025 [11:39:01] we are maxing out our backhaul from codfw to eqsin it seems [11:39:25] but it's been busy for the past week or so, and got worse today [11:39:55] nothing on fire but we probably should rate-limit them, and also perhaps check the geo db records for the blocks as ideally they wouldn't be going to eqsin [11:55:43] hi topranks tnx [12:01:12] fabfur: cheers, let me know if I can assist with anything [12:01:54] so the question here is why do we send this traffic to eqsin? [12:02:22] well we don't know where their DNS queries are coming from [12:02:38] they're clearly not coming from the same ranges they are making the HTTP queries [12:03:33] I guess the kind of thing Alt-Svc will one day help us with [12:07:20] 10Acme-chief, 06Traffic, 06Data-Persistence, 10Gerrit, and 3 others: Stop using SSLCertificateChainFile on RSA+EC setups - https://phabricator.wikimedia.org/T369014#9944094 (10Vgutierrez) [12:09:33] yeah I think we can't really answer that question, only make guesses. The most important part is to prevent that scrapping from saturating our links [12:13:04] 10Acme-chief, 06Traffic, 06Data-Persistence, 10Gerrit, and 3 others: Stop using SSLCertificateChainFile on RSA+EC setups - https://phabricator.wikimedia.org/T369014#9944098 (10Marostegui) Orchestrator looking fine after a reload of its apache [12:59:51] 10Acme-chief, 06Traffic, 06Data-Persistence, 10Gerrit, and 2 others: Stop using SSLCertificateChainFile on RSA+EC setups - https://phabricator.wikimedia.org/T369014#9944390 (10Vgutierrez) 05Open→03Resolved [13:41:23] vgutierrez: FYI lvs1013-15 are in eqiad rack E1 which we're doing a switch upgrade in later, so there will be a bit of disruption to comms for them [13:41:35] no problem, thanks for the heads up topranks [13:58:46] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977#9944795 (10cmooney) [14:00:03] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993#9944811 (10cmooney) [14:00:40] vgutierrez: actually postponing that upgrade for today so won't be any outage [14:00:47] ack [14:15:56] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994#9944984 (10cmooney) [14:30:04] 10netops, 06cloud-services-team, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9945048 (10aborrero) 05Open→03Stalled marking as stalled, because the work on ceph nodes wont be progressing for a while. [14:51:49] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Get test host connected to codfw row c/d lsw's - https://phabricator.wikimedia.org/T367512#9945244 (10Jhancock.wm) @cmooney got sretest2002 on lsw-d4, ports 44 and 45. 10G card. [14:58:29] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Get test host connected to codfw row c/d lsw's - https://phabricator.wikimedia.org/T367512#9945280 (10cmooney) >>! In T367512#9945244, @Jhancock.wm wrote: > @cmooney got sretest2002 on lsw-d4, ports 44 and 45. 10G card. Awesome thank... [15:12:22] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322#9945320 (10cmooney) All seems ok following the increase: {F56173453 width=500} FWIW the scraping is now taking longer, indicating that... [16:03:51] hi traffic team! https://gerrit.wikimedia.org/r/c/operations/puppet/+/1030591 came in via the puppet window but it's deeper magic than I'm comfortable rubber-stamping -- can I ask one of you to review and merge it? [16:22:41] * vgutierrez looking [16:26:02] thanks <3 [16:32:48] 37 tests failed, 0 tests skipped, 0 tests passed [16:32:57] something is definitely off there [16:35:52] well, if you're going to be *picky* about it, yeah [16:36:01] thanks again :) [16:36:06] a silly syntax error [16:36:09] replied to the CR [17:52:38] 06Traffic, 06Data-Engineering, 10Observability-Logging, 13Patch-For-Review: Upgrade hosts to haproxy 2.8.10 - https://phabricator.wikimedia.org/T367756#9946300 (10Fabfur) 05Open→03Resolved All cp hosts has been upgraded to 2.8.10 [18:28:38] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Get test host connected to codfw row c/d lsw's - https://phabricator.wikimedia.org/T367512#9946538 (10cmooney) @Jhancock.wm can you confirm what position in the rack the server is in? I assumed based on the first port it's in U45 so I... [19:09:02] 10netops, 06Infrastructure-Foundations, 06SRE: Add new elements to automation to support new switches in codfw C/D - https://phabricator.wikimedia.org/T369106 (10cmooney) 03NEW p:05Triage→03Medium [19:10:45] 10netops, 06Infrastructure-Foundations, 06SRE: Add new elements to automation to support new switches in codfw C/D - https://phabricator.wikimedia.org/T369106#9946735 (10cmooney) [19:10:46] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Codfw row C/D switch installation & configuration - https://phabricator.wikimedia.org/T364095#9946736 (10cmooney) [19:55:28] 06Traffic: [ncmonitor] ncredir should second-level domains and check whether they're used - https://phabricator.wikimedia.org/T369114 (10BCornwall) 03NEW [19:55:58] 06Traffic: [ncmonitor] ncredir should second-level domains and check whether they're used - https://phabricator.wikimedia.org/T369114#9947017 (10BCornwall) p:05Triage→03Medium [20:16:10] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Add new elements to automation to support new switches in codfw C/D - https://phabricator.wikimedia.org/T369106#9947129 (10cmooney) [20:48:48] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Get test host connected to codfw row c/d lsw's - https://phabricator.wikimedia.org/T367512#9947208 (10cmooney) Also @Jhancock.wm when next on site can you check the mgmt / idrac connection for this one? It doesn't seem to be trying to... [21:31:34] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: hw troubleshooting: Faulty 40GBase-LR4 link from cloudsw1-d5-eqiad to cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T367199#9947342 (10Jhancock.wm) a:03VRiley-WMF [23:18:54] FIRING: SystemdUnitFailed: haproxy_stek_job.service on cp6007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed