[00:47:58] 10Traffic, 10Operations, 10Goal, 10Performance-Team (Radar), 10Wikimedia-Incident: Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 (10Krinkle) 05Resolved→03Open Re-opening and tracking as on-going perf incident per the above. As @Gilles mentioned, it would help if we can at least isolate/... [00:48:00] 10HTTPS, 10Traffic, 10Operations, 10Security: Investigate our mitigation strategy for HTTPS response length attacks - https://phabricator.wikimedia.org/T92298 (10Krinkle) [00:52:11] 10Traffic, 10Operations, 10Performance-Team, 10Goal, 10Wikimedia-Incident: Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 (10Krinkle) [00:52:31] 10Traffic, 10Operations, 10Goal, 10Performance-Team (Radar), 10Wikimedia-Incident: Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 (10Krinkle) [00:55:47] 10Traffic, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 (10Krinkle) [08:46:35] XioNoX: there is potential mgmt network issues for eqiad A6 [08:46:54] things flopping up and down for a few hosts there [08:59:14] FYI, vhtcpd.prom is reporting stale in icinga, I'm guessing expected [09:03:10] jynus: where? [09:03:32] see CCed ticket with details [09:05:45] ok, thx [09:05:59] yeah that's for DCops [09:06:07] interface seems to be flopping [09:06:27] but I saw it occuring on other hosts too [09:06:43] https://phabricator.wikimedia.org/T250652#6070579 [09:07:11] msw-c6-eqiad died not long ago, msw1-a6-eqiad is 9 years old, so it's possible that it's going to die too [09:07:23] so this is all FYI [09:07:30] no impact on us [09:07:48] yep, thx for the head's up [09:07:50] no *production* impact [09:33:08] made it high priority to replace the switch, iirc DCops have a spare [09:34:45] so only really good actionable here is, is there a good way to differenciate switch issues for host (mgmt) issues? [09:35:11] s/for/from/ [09:35:24] I guess that is more of a question for dcops [09:35:25] sorry [09:35:58] jynus: rule of thumb, if more than 1 host is having an issue it's a switch issue [09:35:58] also as you say, logs, but those seem nonexistent :-D [09:36:28] those switches are not-managed, so no logs indeed [09:36:53] to be fair, they are mgmt interfaces, which is a lower priority [09:37:45] mw1309.mgmt, mw1311.mgmt now [09:37:51] will check if same location [09:38:37] yeah, A6, so 100% convinced :-D [09:38:54] I will update the ticket title [09:39:11] thx [09:39:48] we can argue, but failing mgmt switch is still important, especially in those times where it's not easy to access the DC [09:39:56] +1 [09:40:33] I see it as "first thing to do when you get to the DC" and not "go to the DC right now" too [09:40:38] yep [09:40:54] with lower I meant "ok if they are old" [09:41:05] not as much the onsite work [09:45:02] godog: ah thanks for spotting the vhtcpd stale files. That's due to the ongoing migration to purged. What's the best way to clear those in puppet? [09:52:45] godog: is it just a matter of removing the .prom file? [09:54:13] ema: it is yeah [09:54:53] excellent [09:57:59] I think having puppet do the needful would be good in this case [10:08:47] godog: agreed. I gave it a go: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/591010 [10:11:26] neat, +1 [10:31:13] 10netops, 10Operations: Investigate unicast RPF loose mode - https://phabricator.wikimedia.org/T244147 (10ayounsi) 05Open→03Resolved Default changed to sample + discard on all routers. [10:31:16] 10netops, 10Operations: Investigate unicast RPF loose mode - https://phabricator.wikimedia.org/T244147 (10ayounsi) [11:15:06] godog: alerts are gone :) [11:36:32] 10Traffic, 10Analytics, 10Operations: Create replacement for Varnishkafka - https://phabricator.wikimedia.org/T237993 (10ema) >>! In T237993#6066376, @elukey wrote: > There is currently too much data that flows to kafka, for cp3050 we have 36GB * 12 partitions for a single day, definitely too much. How much... [11:42:25] ema: \o/ thank you sir [11:47:34] 10Traffic, 10Operations, 10Performance-Team (Radar): Support brotli compression - https://phabricator.wikimedia.org/T137979 (10Gilles) Linkedin saw a 2-6% improvement on page load time: https://engineering.linkedin.com/blog/2017/05/boosting-site-speed-using-brotli-compression [11:49:07] 10Traffic, 10Operations, 10Performance-Team (Radar): Support brotli compression - https://phabricator.wikimedia.org/T137979 (10Gilles) @ema I assume that ATS frontends as currently deployed support Brotli, right? [14:06:48] bblack: around? I'm curious if you've been following any of the nic-saturation-exporter work, and if you think it makes any sense at all for it to monitor/export things about virtual vlan interfaces (like we have on LVSen) [14:23:48] 10Traffic, 10Operations, 10Goal, 10Performance-Team (Radar), 10Wikimedia-Incident: Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 (10ema) >>! In T170567#6069972, @Krinkle wrote: > Re-opening and tracking as on-going perf incident per the above. As @Gilles mentioned, it would help if we can a... [14:29:35] 10Traffic, 10Operations, 10Performance-Team (Radar): Support brotli compression - https://phabricator.wikimedia.org/T137979 (10ema) >>! In T137979#6071140, @Gilles wrote: > @ema I assume that ATS frontends as currently deployed support Brotli, right? We would need to enable the [[https://docs.trafficserver.... [15:05:02] 10netops, 10Operations, 10observability: Investigate Juniper structured logs - https://phabricator.wikimedia.org/T250703 (10ayounsi) p:05Triage→03Low [15:05:43] cdanis: for saturation I don't know if vlans makes physical sense [15:05:52] bblack: yeah, I decided they didn't [15:06:07] as long as we can get the data for the underlying physical port [15:06:26] https://gerrit.wikimedia.org/r/c/operations/puppet/+/591054 [15:06:49] ok cool [15:07:01] none of those things make sense wrt saturation :) [15:07:22] (and I believe, thanks to eyeballing cumin output, that the list there is comprehensive) [15:16:55] 10Traffic, 10Operations, 10Performance-Team (Radar): Edge cache response time per server should be monitored - https://phabricator.wikimedia.org/T238086 (10Gilles) a:03Gilles [15:17:11] 10Traffic, 10Operations, 10Performance-Team (Radar): Edge cache response time per server should be monitored - https://phabricator.wikimedia.org/T238086 (10Gilles) p:05Medium→03High [16:12:07] 10Traffic, 10Operations: Implement TTL cap for ats-be - https://phabricator.wikimedia.org/T249627 (10ema) 05Open→03Resolved