[09:27:55] hi! when you get the chance, I have another try for varnishlogconsumers logging at https://gerrit.wikimedia.org/r/c/operations/puppet/+/572012 [10:11:22] godog: another try? what went wrong with the previous attempt? [10:16:59] vgutierrez: journald on < buster has a 2k line length limitation, some logs were hitting that and breaking json [10:17:50] ack [10:43:55] 10Wikimedia-Apache-configuration, 10Commons, 10SDC General, 10WikibaseMediaInfo, and 5 others: Make /entity/ alias work for Commons - https://phabricator.wikimedia.org/T222321 (10Lucas_Werkmeister_WMDE) [11:28:34] 10Traffic, 10Operations: Session resumption seems to be broken in ATS for TLSv1.3 - https://phabricator.wikimedia.org/T245419 (10Vgutierrez) [11:30:23] 10Traffic, 10Operations: Session resumption seems to be broken in ATS for TLSv1.3 - https://phabricator.wikimedia.org/T245419 (10Vgutierrez) p:05Triage→03High [11:31:01] 10Traffic, 10Operations: Session resumption seems to be broken in ATS for TLSv1.3 - https://phabricator.wikimedia.org/T245419 (10Vgutierrez) [12:39:41] 10Traffic, 10Operations, 10Patch-For-Review: ats-tls performance issues under production load - https://phabricator.wikimedia.org/T244538 (10Vgutierrez) `Tested again on cp1075 (now running buster) before disabling DNS on ats-tls and after: `name=before vgutierrez@cp1075:~$ ./hey -c 1 -z 10s https://en.wiki... [12:39:41] Krinkle: ^^ check those figures :) [12:39:43] vgutierrez: slowest went from 78ms to 28ms. Hello 50 ms :D [12:39:43] That's one slow local DNS lookup stuck under heavy ish load? [12:39:43] And double the rps nice [12:39:43] yup, 2.54x [12:39:43] I'm still very puzzled at how local dns could take that long surely seems fixable e.g. for other users of TLS if they needed this and maybe us in the future after removal of varnish. It's surprisingly consistent at 50 whenever the load is high enough to trigger it which suggests to me like its something other than computation logic unless it's super complex, but some kind of queue or timeout [12:39:44] Or maybe 50 happens to be the result of something specific to our CPU and hardware and is just a coincidence that it's such a round number for us [12:39:44] indeed, but in the current scenario it's better if ats-tls behaves as transparent as possible [12:39:44] Thanks vgutierrez looking forward to seeing it rolled out. Looks like it requires Buster? [12:39:44] it's already rolled out [12:39:44] all the cp cluster is running buster since last Thursday ~2pm [12:39:44] this has been applied today at... ~06:20 UTC + 30 minutes of puppet runs across the cp fleet [12:39:45] https://phabricator.wikimedia.org/T242093 --> yeah, Thursday at 1pm actually :) [12:39:46] I'll try to find the graphs in a bit. At first glance I don't see the improvement from today. Strange [12:39:47] vgutierrez: did that involve a kernel upgrade implicitly? [12:39:47] yes [12:39:47] 4.9 -> 4.19 [12:39:47] cool. [12:39:47] right now we're running 4.19.98-1 [12:39:47] Is it possible the new kernel might not be as tuned yet? Eg some default differences that we haven't pinned in puppet [12:39:47] it could be [12:39:47] but that cp1075 test [12:39:47] was with the same kernel [12:39:47] I ran the before test, applied puppet, and ran the after test [12:39:48] so the only difference is disabling DNS on ats-tls [12:39:48] https://phabricator.wikimedia.org/T244538#5889836 <-- that one :) [12:39:48] Yeah, that one makes sense [12:39:48] so that got applied cluster wide this morning [12:39:48] 07:30 UTC if you wanna play safe [12:39:49] the stretch upgrade happened gradually between Thursday 6th and Thursday 13th IIRC [12:39:49] *buster upgrade [17:30:24] https://grafana.wikimedia.org/d/000000230/navigation-timing-by-continent?orgId=1&from=now-2d&to=now&fullscreen&panelId=2&var-metric=responseStart&var-location=Europe&var-prop=p75 [17:30:35] Maybe an unrelated regression? [17:30:49] Quite a lot of volatility today compare to earlier [17:35:28] (nvm, seeing the reason now) [22:51:28] 10Traffic, 10Cloud-VPS, 10DNS, 10Maps, and 2 others: multi-component wmflabs.org subdomains doesn't work under simple wildcard TLS cert - https://phabricator.wikimedia.org/T161256 (10TheDJ) [23:27:43] vgutierrez: with the above issue resolved and some more hours passed, I think we can say with relatively certainty that nothing has changed in terms of response start latency [23:27:54] not seeing any difference compared to a day, week or month ago [23:28:47] globally, mobile, desktop ;; median, p75 [23:28:56] but I'll wait until tomrrow to be completley sure [23:29:13] the roll out at 6-7AM was exactly at the daily peak where the aggregate latenc is worst so it's hard to be sure right now [23:31:57] https://grafana.wikimedia.org/d/000000038/navigation-timing-by-platform?orgId=1&fullscreen&panelId=28&var-metric=responseStart&var-platform=desktop&var-users=anonymous&var-percentile=p50 [23:43:14] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review, 10Wikimedia-Incident: 15% response start regression as of 2019-11-11 (Varnish->ATS) - https://phabricator.wikimedia.org/T238494 (10Krinkle) I've updated some of the navtiming dashboards in Grafana to include a comparison line for "1 year ag...