[00:35:51] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review, 10Wikimedia-Incident: 15% response start regression as of 2019-11-11 (Varnish->ATS) - https://phabricator.wikimedia.org/T238494 (10Krinkle) Looks like we've recovered about 5-6%, but still consistantly regressed by 9-10% overall, e.g. 484ms... [00:37:36] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review, 10Wikimedia-Incident: 15% response start regression as of 2019-11-11 (Varnish->ATS) - https://phabricator.wikimedia.org/T238494 (10Krinkle) Currently open patches: >>! From T238494#5750475: > [operations/puppet@production] mediawiki::webse... [00:44:48] ema: Not sure if this was previously realised, but the latency regression seems quite consistent at +50ms [00:44:48] https://phabricator.wikimedia.org/T238494#5854535 [00:45:18] it's not very widely distributed or inconsistent by page, area or network latency. [00:45:33] which I suppose further confirms that there is some specific work ATS is doing that is slow. [00:45:45] 50ms is an eternity for a CPU, I wonder what it is doing? [00:46:30] Krinkle: for cached responses or misses? [00:47:15] volans_: I assume cached, given misses are not enough to affect our global median. [00:48:02] since November, we have regression by 50ms for the fetching of the HTML response to /wiki/:pagename globally [00:48:08] at both p50 and p75. [00:50:35] gilles: sounds a lot more serious now, not sure how I didn't realise this before. [01:00:00] For Europe specifically, the 50ms is a +26% latency regression at 50th percentile. [01:00:02] crap [01:01:05] no surprise obviously given a nearby DC. the 50ms is compared to shutting down esams [01:01:25] well, almost, I think esams-eqiad is maybe a little more than 50ms? [01:01:36] I'm no expert in the diffs, but are we caching less items? [01:02:07] bast3004-bast1002 is 83.3ms [01:02:13] thx [01:02:29] you mean that maybe we have more cache misses at the varnish-fe or ats-be layer? [01:02:48] so skewing the latencies to include a little bit of MW backend responses [01:02:56] that's possible in theory. I don't know if we have stats of that. [01:03:08] afaik we have varnish-fe stats [01:03:15] not sure about varnish-be vs ats-be [01:03:32] at varnish-fe layer nothing should have changed AFAIK [01:04:04] yeah, but it might tell us about more cache misses. [01:04:11] or... no, I guess not. [01:04:22] it won't tell us whether the varnish-be was a hit or miss previously, from -fe it's a miss either way [01:04:42] yep [01:04:58] I'm unable to find a dash with varnish or ats cache hit/miss stats [01:05:06] but I know about one for load.php reqs in particular [01:05:19] which also measures MW backend req/s [01:05:37] just found [01:05:38] https://grafana.wikimedia.org/d/w4TRwaxZz/local-backend-hitrate-varnish-vs-ats?orgId=1 [01:05:40] Krinkle: maybe https://grafana.wikimedia.org/d/000000304/prometheus-varnish-dc-stats?orgId=1 [01:06:06] MW backend request rate for load.php hasn't significantly changed since November 2019, except for this week, which is unrelated [01:06:21] https://grafana.wikimedia.org/d/000000066/resourceloader?orgId=1&fullscreen&panelId=46&from=1570407314525&to=1580911823196 [01:07:08] if there was something in generally less caching in ats-be I'd expect this to have increased [01:08:05] volans: do we stil have varnish backends live? [01:08:19] depends if what you're tracking manage to always be in the cache ;) [01:09:03] yeah but if e.g. general cache capacity went down by some GBs or if some percent got lost in /dev/null failed to store in some race conditions e.g. [01:09:27] Krinkle: I thought not anymore, but I might be proven wrong. That dashboard seems it was a temporary one during the transition [01:09:51] volans: that dash shows "Local backend eqiad" as having ATS and Varnish [01:09:55] s/eqiad/esams/ [01:10:02] the eqiad one has the Varnish line stop in December [01:10:10] s/December/Jan 24/ [01:10:21] (sorry) [01:10:41] they both pick varnish_x_cache so no idea [01:10:45] but the esams graph still have a live Varnish line [01:10:46] if you look at the prometheus query [01:11:00] I think it was hardcoded (which hosts was what) [01:11:14] right [01:13:41] so these are probably all ATS now and maybe some of them didn't get re-used in ATS world and stopped [01:14:36] I guess you need to ask e.ma about that, also it's just one of the possibilities [01:14:47] for the added latency [01:16:14] We collect ServerTiming metadata now as part of our perf navtiming pipeline, which captures the X-Cache-Status value [01:16:34] so we can correlate if we are getting more or fewer miss/hit than before and if the regression is specific to them. [01:17:18] server-timing: cache;desc="miss" [01:17:18] x-cache: cp3052 miss, cp3060 miss [01:17:20] for example [01:18:25] gilles: could you share an example query that does a join with NavigationTiming and another one of our schemas like ServerTiming? Not sure off-hand how to do that exactly. [01:18:36] * Krinkle signs off for now [01:18:37] o/ [08:51:03] Krinkle: some examples https://phabricator.wikimedia.org/T240323#5727638 [09:33:08] 10Traffic, 10Operations: Investigate side-effects of enabling KA between ats-tls and varnish-fe - https://phabricator.wikimedia.org/T244464 (10Vgutierrez) [09:45:25] 10Traffic, 10Discovery, 10Operations, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Lea_Lacroix_WMDE) Over the past weeks, we noticed a huge increase of content in Wikidata. Maybe that's... [09:56:11] vgutierrez: did nginx reuse conns to varnish-fe? eg prior to ats-tls migration [09:56:19] Krinkle: nope [09:56:24] k [09:56:42] nginx was injecting a "Connection: Close" header on every request towards varnish-fe [09:59:00] 10Traffic, 10Operations, 10Patch-For-Review: Investigate side-effects of enabling KA between ats-tls and varnish-fe - https://phabricator.wikimedia.org/T244464 (10Vgutierrez) p:05Triage→03Medium [10:10:28] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp3065.esams.wmnet ` The log can be found in `/var/log/wmf-auto-reima... [10:11:39] 10Traffic, 10Discovery, 10Operations, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10ArielGlenn) >>! In T243701#5855352, @Lea_Lacroix_WMDE wrote: > Over the past weeks, we noticed a huge i... [10:13:11] 10Traffic, 10Discovery, 10Operations, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10JeanFred) [10:27:29] 10Traffic, 10Discovery, 10Operations, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10JeanFred) >>>! In T243701#5834751, @jcrespo wrote: >> While I understand the need of "slowing bot edit... [10:44:18] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3065.esams.wmnet'] ` Of which those **FAILED**: ` ['cp3065.esams.wmnet'] ` [10:45:49] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp3065.esams.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002061045_vgutie... [11:18:48] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3065.esams.wmnet'] ` Of which those **FAILED**: ` ['cp3065.esams.wmnet'] ` [11:26:19] 10Traffic, 10Operations, 10Inuka-Team (Kanban), 10MW-1.35-notes (1.35.0-wmf.16; 2020-01-21), 10Performance-Team (Radar): Code for InukaPageView instrumentation - https://phabricator.wikimedia.org/T238029 (10phuedx) >>! In T238029#5819088, @SBisson wrote: > A very special thanks to @phuedx for reviewing a... [13:03:05] 10Traffic, 10Operations: Upgrade ncredir cluster to buster - https://phabricator.wikimedia.org/T243391 (10Vgutierrez) @bblack confirmed that we can drop RSA support on ncredir during the All Hands [14:46:59] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp4025.ulsfo.wmnet ` The log can be found in `/var/log/wmf-auto-reima... [15:17:38] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4025.ulsfo.wmnet'] ` and were **ALL** successful. [15:30:21] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp4024.ulsfo.wmnet ` The log can be found in `/var/log/wmf-auto-reima... [15:36:04] 10Traffic, 10Operations: Upgrade ncredir cluster to buster - https://phabricator.wikimedia.org/T243391 (10Vgutierrez) [15:45:20] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [15:46:02] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [15:58:06] gilles: thx, running queries now [16:00:55] 10netops, 10Operations: cr3-knams:xe-0/1/3 down - https://phabricator.wikimedia.org/T244497 (10ayounsi) p:05Triage→03High [16:00:58] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4024.ulsfo.wmnet'] ` and were **ALL** successful. [16:04:00] 10netops, 10Operations: cr3-knams:xe-0/1/3 down - https://phabricator.wikimedia.org/T244497 (10ayounsi) Other side doesn't receive the light though: ` ayounsi@asw2-esams> show interfaces diagnostics optics xe-6/0/4 Physical interface: xe-6/0/4 Laser output power : 1.3540 mW / 1.3... [16:04:04] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp4023.ulsfo.wmnet ` The log can be found in `/var/log/wmf-auto-reima... [16:06:12] 10Traffic, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Improve ATS backend connection reuse against origin servers - https://phabricator.wikimedia.org/T241145 (10ema) >>! In T241145#5856431, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://too... [16:15:21] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [16:16:44] gilles: for re-use, I've persisted it at https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging#Hive [16:30:39] 10Traffic, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Improve ATS backend connection reuse against origin servers - https://phabricator.wikimedia.org/T241145 (10Gilles) therockapplauds We will keep an eye on the trend in coming days to check how much of a dent it made in the perf regres... [16:35:50] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4023.ulsfo.wmnet'] ` and were **ALL** successful. [16:38:39] 10Traffic, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Refactor pybal/LVS config for shared failover - https://phabricator.wikimedia.org/T165765 (10BBlack) Status update: what's missing here is codfw, which will happen when we finish its hardware upgrade switch to lvs2007-10 [16:38:45] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [16:49:34] 10Traffic, 10Operations: Upgrade ncredir cluster to buster - https://phabricator.wikimedia.org/T243391 (10Vgutierrez) [17:18:38] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review, 10Wikimedia-Incident: 15% response start regression as of 2019-11-11 (Varnish->ATS) - https://phabricator.wikimedia.org/T238494 (10Krinkle) In an IRC conversation with @volans we considered whether it's possible that the 50ms regression com... [17:18:57] volans: ema: See comment at https://phabricator.wikimedia.org/T238494#5856847 - I've somewhat ruled out any influence from app servers and ats-backend. [17:19:40] Also I'm curious if you know what "hit-local" means in this context vs "hit-front" [17:20:40] hit-front is a varnish-fe hit, hit-local means varnish-fe missed but ats-be (same dc, "local") hit [17:20:58] I see [17:21:07] we used to have a "hit-remote" as well with the tiered varnishes, but that doesn't really exist anymore (e.g. if esams front+back missed, but eqiad remote backend caches hit) [17:21:28] looks like we didn't see much hit-remote in the past, and now we're seeing less hit-local. [17:21:54] yeah the loss of the hit-remote hits was expected tradeoff [17:21:58] miss definitely more common, but seem orthogonal to the latency regression for what it's worth [17:22:21] bblack: yeah, it was already marginal before the migration. [17:22:35] mostly where it was useful was not in normal runtime [17:23:00] e.g. if we wiped all the esams caches (or kept it offline long enough for them all to expire), you'd see a big spike in hit-remote when we first refilled them, helping avoid killing the applayer [17:23:07] right, if we want to depool esams for a few hours now, we'd more strongly have to consider wiping it entirely, or very gradually repooling? [17:23:23] given missed purges etc. [17:23:32] depends on the scenario, but yeah [17:23:41] previously wiping would be less controversial given we could back fill [17:24:06] even then, for a long time we had eqiad in opprtunistic-hit mode for the pass-through from e.g. esams [17:24:27] so it only would pick up eqiad hit-remote for the global-hottest objects, but eqiad wasn't storing something that only esams was hitting but not local eqiad clients [17:24:48] right yeah makes sense [17:25:00] the eqiad ats-be cluster isn't "bigger" than the pops, is it? [17:25:01] the scenario is whether we lost purges or not, and whether we wiped (or effectively via expired) the caches [17:25:20] in terms of number of backends/shards and overall unique objects it can store [17:25:21] (keeping in mind, ats-be doesn't lose cache on daemon/host restart) [17:25:33] it's not bigger than esams [17:25:37] cool [17:25:47] we basically have two cache cluster sizes: 8 node or 6 node [17:25:59] eqiad, codfw, and esams all have the 8-node size, and eqsin and ulsfo have the 6-node size [17:26:00] anyhow, so we're back to figuring out why ats-fe is more or less consistently +50ms slower to respond to cache hits. [17:26:11] ats-be? [17:26:16] frontend [17:26:26] ats-tls [17:26:27] ok [17:26:29] right [17:26:33] ats-tls [17:26:46] probably part of it is that we've observed it's recycling connections like crazy [17:27:01] but I think there was some patchwork testing things about that this morning, which I didn't follow closely [17:27:26] it's been almost three months now that page view perf for Europe is pretty close to not having esams in terms of latency. [17:28:20] are you saying the problem only exhibits in esams? [17:28:41] no, it's just more pronounced there in terms of percentage, given proximity to the data centre [17:28:56] I guess I don't follow that [17:29:18] but if there's a +50ms perf regression purely in ats-tls->v-fe for hits, I would expect it to be universal to the sites [17:29:24] it would be odd for it to be specific to esams [17:29:39] connectivity is best in Europe on average in terms of absolute numbers so +50ms is percentage-wise biggest in Europe. [17:30:17] 50ms is ~25% of 200ms, but only 9% of e.g. 490ms [17:30:26] either way, if it's an internal software problem, they'd still see the same +50ms in eqiad, so it's not in fact anything like not having esams (in which case they'd have their +50ms plus the jump across the pond) [17:30:30] see from https://phabricator.wikimedia.org/T238494#5854535 down [17:30:46] yes nothing is specific to esams, it is seen globally [17:31:11] we've reduced it to a few countries and regions to be able to rule out unrelated factors and to confirm that it's an absolute delay added and not something proportional to something else [17:31:24] (the jump being ~83ms on our transport, but probably various for random internet routing) [17:31:40] global ttfb p75 is up from 484ms to 535ms [17:31:57] europe ttfb p50 from 190ms to 241ms [17:32:01] etc [17:33:50] yeah but if you're going to drill this down, focus on the ms and the hits [17:33:56] like that table in the task [17:34:26] it does seem odd there's different results for different countries [17:34:51] is the data in that table all targeting esams? (I mean, doesn't include cases where the source country was FR, but they were misdirected somewhere other than esams?) [17:35:24] it's global, no dc filter at play [17:35:38] should check that if you're not, just to be sure [17:36:00] geodns routing can be inaccurate at times, and then there's also depool events we do for various reasons [17:36:37] https://grafana.wikimedia.org/d/000000230/navigation-timing-by-continent?orgId=1&from=1568937600000&to=1580947200000&var-metric=responseStart&var-location=All&var-prop=p75 [17:36:50] but still assuming all that's a non-issue, FR is showing hit-front +50ms at p50, DE is +39ms [17:36:58] bblack: no routing is influenced here, the country codes are the ones from Geo.country [17:37:01] DE had lower absolute numbers [17:37:10] see the Grafana graphs for easier display [17:37:13] it's very clear [17:37:30] the numbers are from real devices, they vary by the minute [17:37:33] not every day is equal [17:37:40] what I mean is [17:37:52] sometimes, on some days, or minutes, or hours, or even persistently [17:37:56] no two requests are equal, so can't compare them directly. [17:37:59] clients in FR or DE will connect to eqiad instead of esams [17:38:32] to get better statistical isolation, while pulling e.g. the "hit-front" data, if we can scope it down to just cp3 for these EU countries, it might create more-consistent data [17:38:39] globally all traffic latency is up by 50ms, in Europe overall its 49-51ms depending on the day. For individual countries it seems to be between 30 and 60 depending on the day. [17:38:54] That and of course people having different devices compared to then, and it being a different season. [17:39:00] 3 months is a lot of time [17:40:07] do you mean to rule out that the latency regression is caused by a majority if global traffic being routed to a dc that's +50ms on average further away to reach? [17:40:12] if/of* [17:40:17] no [17:40:24] because that seems unlikely :) [17:40:32] I mean to clean up the stats to better pinpoint causes [17:41:12] your table of data is saying ats-tls+v-fe are adding 39ms to DE requests and 50ms to FR requests [17:41:27] I've narrowed it down by country and by x-cache-status value and the regression is still visible within that subset. I'm not sure how to narrow it down further in a way that helps us determine the cause within this data which is client-side collected. [17:41:36] dc selection or host selection isn't gathered [17:41:42] the question is whether the increase is relative to the absolute numbers (DE is lower), or if it's a fixed +Xms [17:41:59] it's pretty much +50ms accross the board. [17:42:46] the 6K hit-front samples from FR say +50ms [17:42:47] https://grafana.wikimedia.org/d/000000232/navigation-timing-by-country?orgId=1&var-metric=responseStart&var-country=Germany&var-prop=median&from=now-6M&to=now&fullscreen&panelId=7 [17:42:50] This is just Germany [17:42:52] over multiple days [17:42:53] the 9K hit-front samples from DE say +39ms [17:42:59] that's just 1 day [17:43:21] the last table's purpose was to confirm it isn't due to a shift from cache-hit to cache-miss or something like that [17:43:57] bblack: it's up from ~ 128 to ~170. [17:44:05] but you can pick any two days and get a differnet diff naturally [17:44:33] * Krinkle meeting [17:44:38] that makes a big difference to us, vs "it's a flat +50ms" [17:45:47] if I were to add sleep(50ms) to nginx 3 months ago and let it say, I'd expect to see data consistent with what we have. [17:46:12] yes, perhaps in the averages you're looking at [17:46:38] I'm not interested in proving a point, I'm interested in data that helps me narrow down what the likely causes are internally [17:47:45] yeah, I'm just not sure how to help further beyond this. I don't have any more data that can help you from here short of having captured backtraces from nginx then vs now. [17:48:10] it was pretty much immediate after the sswitch (note, the graphs have moving averages so it's more gradual than it actually was, can be disabled if you edit the graph) [17:51:33] anyways, my best random-ish thoughts on things that could affect hit-front this way are: [17:52:18] 1) Even though nginx did do conn-per-req to v-fe before, new conns may be more-costly to ats-tls than they were to nginx, and we don't seem to have expected levels of conn reuse there... [17:53:01] 2) It could be that ats-tls is doing some store+forward stuff, since it's technically a proxycache configured as a pure proxy, whereas nginx was trying harder at being a passthrough. [17:53:29] we had done what we could to minimize any nginx-level buffering, but ats might be slurping responses before it sends the first byte to the client or something [17:53:59] in which case, the increase would also be sensitive to total response size [17:54:44] we could probably test that emprically, by trying an ats-tls config against a test backend that intentionally sends slow headers and/or content, to see how the buffering works. [17:55:18] if it's sensitive to client-side latency, that's a whole different thing maybe [17:56:05] (IOW, could be TLS-related) [17:56:10] (or TCP tuning related) [18:39:15] https://phabricator.wikimedia.org/P10323 [18:40:32] ^ even over the loopback, with a small file that's always a cache-hit (favicon clocking in at 2734 bytes), latency is inconsistent with a p90 at 22ms and p99 at 50ms [18:40:45] p50 is sub-ms [18:41:03] that probably says something right there, and is an easier starting point to work against [18:46:28] Interesting :) [18:47:04] I noticed a separate task about conn re-use as well, although from that I gathered yesterday that wasn't changed (much) between nginx vs ats-tls [18:47:51] T244464 [18:47:51] T244464: Investigate side-effects of enabling KA between ats-tls and varnish-fe - https://phabricator.wikimedia.org/T244464 [18:48:03] bblack: related to T241145 ? [18:48:04] T241145: Improve ATS backend connection reuse against origin servers - https://phabricator.wikimedia.org/T241145 [18:49:02] conn reuse hasn't changed a lot from nginx to ats-tls, if anything it's probably a little better, since nginx was explicit one conn-per-req [18:49:22] but ats-tls may have some higher conn setup costs for whatever reason and be suffering from the lack of reuse more [18:49:30] RE: "we don't seem to have expected levels of conn reuse there" [18:49:43] was it under-expected before the migration as well? [18:50:03] I suppose it could still help gain perf win, even if it's not what caused it. all wins welcome to get us closer to where we were [18:50:08] nginx was actually explicitly configured to never reuse [18:50:32] for every request from a client (client-side connections obviously can persist), it would open a fresh connection to varnish-fe for one req and then close it [18:51:33] I'm a noob on what I'm about to mention, but, - afaik we don't to tls between ats-tls and varnish-fe / ats-be and then toward appserver-nginx we do have tls again, right? Would something like TLS 1.3/ 0rtt be an option there if not already in use? [18:51:35] over the loopback, it's not much overhead. we had tried making it reuse before, but ran into issues that caused random 50x's [18:51:56] I realise now this wouldn't help with cache hits, just random thought. [18:52:40] yeah we're planning to look at that, but tls1.3 for those appserver connections is a little ways out, we have to upgrade operating systems first to get the right library support (in progress) before we can start trying that [18:53:15] the story of varnish seems to be converging on "we tried , then it serves random 5xx's" [18:53:19] for the ats-tls (or nginx) -> varnish-fe conns over the loopback, there's no tls (varnish-fe doesn't support it, which is why there's always a tls proxy in front in the first place) [18:53:48] the old issues with conn reuse into the front of varnish-fe were more-subtle [18:54:08] right yeah makes sense. [18:54:21] the basic problem is that not all http request-level errors can be cleanly recovered with an http connection [18:54:27] which is fine if it's one connection per client [18:54:29] new TCP connections one would think wouldn't take up more than 1 millisecond on localhost. [18:54:48] assuming no bottle neck is reached that results in queueing. [18:54:51] but if you're funneling N true clients through a proxy and having them share a connection into the next layer.... [18:55:23] hm.. head-of-line blocking with blind connection re-use could backfire [18:55:35] one client's request could cause a backend response that's a broken one at the request level, which breaks the connection's state, and then that client eventually gets a timeout, and then an unrelated client's request tries to go down the same connection and gets an error for no good reason, etc [18:55:46] but again, within localhost seems unlikely? I have no idea. [18:56:06] e.g. the gzip length errors we saw before [18:56:09] right [18:56:25] that flip-flopped back many times over the last month. [18:56:45] I believe we're now for 24h+ on the former varnish setting which is to tell app servers /not/ to compress [18:57:28] T242478 [18:57:29] T242478: Production load.php spends ~ 10% time doing output compression within PHP - https://phabricator.wikimedia.org/T242478 [18:57:40] the length error thing is tricky [18:58:05] This was somewhat worse than how it was in 2016 before we turned it off on app servers because in 2016 it was Apache doing it, whereas now (after we had turned it off at all layers) MW PHP itself took it upon itself to handle it. [18:58:17] imagine an http server responds to a request with "CL: 1000", then sends 2500 bytes of content [18:58:25] so this length error should now finally be fixed again [18:58:37] if the connection is shared between proxied requests for client A and client B [18:59:28] hm.. connection re-use for TCP/HTTP 1.1, that's basically pooling right? It grabs one for the next request. Not multi-plexed like HTTP 2 [18:59:36] A sends legit request, appserver sends CL:1000 + 2500 bytes. the proxy in the middle sends the first 1000 bytes to client A (they get a broken response, but there was no good answer there). Then client B sends a req, and the proxy forwards the req over the same connection to the backend (not realizing 1500 more bytes are pending) [18:59:55] and then it tries to read the response headers for client B's req, but it gets the random junk data leftover from the last request instead [19:00:06] so it 503's to B over the server's misbehavior [19:00:15] but B's request didn't even cause the issue, so it's very hard to debug [19:00:28] hm.. the protocol doesnt' protect again that with some kind of packet number or ID of sorts? [19:00:33] against* [19:00:35] not in http [19:00:48] what if there is no content-length? [19:01:04] that's valid right? [19:01:09] well there's 3 basic ways, in http/1.1, to deal with content length [19:01:20] gotta go, meeting [19:02:25] 1) Chunked encoding (best) - this one prepends each data block with a length header, and has a terminal marker at the end of the data, so it can detect over/under-length as a timeout or error condition at the per-request level and not break the outer connection. [19:03:24] 2) Non-chunked, with a CL header - there's no termination marker. if not enough bytes are sent to match CL, something will eventually time out. if too many bytes are sent for the CL, they'll stack up in a socket buffer somewhere "unexpected" and break connection reuse on some future request (if a shared conn from proxying, possibly for an unrelated client) [19:04:04] 3) Non-chunked, no CL header - this is called "close delimited" and it's a relic from old HTTP, but still works and we still see it some cases. It's legit, and server has to close the conn after it's done sending the response, and that's how you know how many bytes there were :/ [19:05:02] also, the initial response headers themselves are similarly-tricky (all the above was about body bytes) [19:05:20] they're terminated by a double-newline, and there's all kinds of ways that can go wrong with a broken server [19:06:49] any time you end up funneling multiple real-world clients' requests over shared internal proxy connections, you're accepting that some things of this nature above can happen, basically. [19:08:59] up in our layer, you can sort of sanitize all of this, at some cost [19:09:23] we can have our backend-most caches fully-slurp some or all responses and give them a corrected form, before upper-layer proxies see them. [19:10:01] (we had some VCL code about this in the varnish-be days, where if it was case (3) above, we'd slurp it and add the CL header before forward up to caches/proxies closer to the user) [19:10:26] (and ditto for case (1) just so we could have a CL value to make size-based caching decisions) [19:10:52] but having any proxy layer do a full slurp to fixup these issues means that layer becomes store-and-forward for those requests, which definitely will hurt perf. [19:11:28] vs the perf-optimal scenario where every layer streams data through asap with minimal local buffering, but then can't see a whole response before it has already taken actions and/or forwarded on the initial part. [19:14:04] H/2 changes everything at the connection level with real framing and all that jazz [19:14:25] H/2 for our internal connections, though, is not a solution that's on the immediate radar [19:14:40] varnish-fe doesn't support TLS or H/2, so we can't use it on our traffic layering anyways [19:14:54] but even for ats-be->app where we probably could use it, there's a lot to ponder. [19:15:33] the world of reusable http/1.1 connections and opening pools of them to funnel requests through is just beter-known territory at this point, lots of unknows for the H/2 version of that world. [19:15:49] since H/2 can multiplex many streams in one connection and interleave their data chunks [19:16:00] it may be a net-negative if used naively [19:16:11] (for these internal proxy shared-connection cases) [19:16:46] maybe if we could configure to use the H/2 protocol, but explicitly limit it to one stream per connection (basically just using it for its framing advantages and header compression), that might be something. [19:24:58] https://phabricator.wikimedia.org/P10324 [19:25:36] ^ like the last paste, but adding the XFP header to fake HTTPS to v-fe (otherwise it would 301->HTTPS), and comparing HTTPS (ats-tls) to HTTP (varnish-fe) both over loopback [19:26:19] the max for varnish-fe is 1.58ms, but even the p99 is 228us [19:26:26] (TIL wrk, thanks!) [19:26:51] whereas the ats-tls is still showing the awful numbers, p99 at 53ms with a max of 87ms [19:27:17] even if we assume wrk is being dumb and counting tls connection setup against the first request, that should've only hurt 1 req out of the test [19:27:43] there's a better one called wrk2 on github, but I lazily used wrk because it's in debian already [19:27:52] https://github.com/giltene/wrk2 [19:28:45] in the case we're looking at here, I think the results are clear enough that better measurements aren't going to make it go away :) [19:34:51] with cp1075 depooled from all traffic, the numbers get way better [19:35:14] p99 ~1ms, with a max of ~9ms [19:35:30] which is at least closer to what v-fe itself is doing, and not completely unreasonable to the limits of such testing [19:35:42] so it is load-sensitive in some sense [19:36:29] maybe something to do with connection pools to v-fe [19:36:59] (which we've touched on before already, but maybe for different reasons than I was first thinking) [19:37:32] maybe it just doesn't do a good enough job ensuring that no clients queue to get slots on the connections open to v-fe (and/or have to open a new one on-demand) [19:38:23] if our frontmost ATS was actually a cache of course, such problems would be minimized because most cache hits don't have to proxy out to anywhere [19:38:36] but having ats-tls be non-caching proxy opens us up to all kinds of tricky tuning on these things [19:39:24] (ats-tls is non-caching because the logic of what can and can't be cached [currently] exists only in v-fe VCL?) [19:39:35] and other related reasons [19:39:54] the most-critical blocker is all analytics still comes from v-fe, so any request that hit a cache at the ats-tls layer would be invisible to us [19:40:09] ah ofc [19:41:54] if we were on-track to move caching into ats-tls soon maybe it would be defensible to wait and solve this problem later if it still persists for misses [19:42:13] but cf the conversation about lack of hit-for-pass in ATS and how that's putting ??? on how easily we can replace varnish-fe at al [19:42:16] *all [19:43:21] without something like hit-for-pass, we can't safely coalesce with suffering massive stalls from some cases. [19:43:51] *without suffering [19:44:27] speaking of varnish-fe... did we figure out the ulsfo FDs issue?? https://grafana.wikimedia.org/d/OU_pxz8Wz/cdanis-ulsfo-vcache-open-fds?orgId=1&from=now-7d&to=now [19:44:37] (unless we could identify all uncacheable traffic at request-time with a giant set of URL regexen to maintain across all services, but blech) [19:45:00] cdanis: I don't think it's solved yet, although there was some work in that area this morning that I'm not sure about the details of [19:45:25] that and this both point at issues with the ats-tls->varnish-fe connection pooling/reuse stuff though, so that's something [19:46:07] (I think they do anyways, through my fuzzy lenses) [19:50:37] there's some ATS settings to enable keepalive on the v-fe conns, and to require a minimum number of idle conns in the pool to avoid startup costs... [19:50:54] probably some experimentation there could be fruitful [20:44:55] we are playing with KA on ats-tls on cp4031 [20:45:22] still that one is being affected by the FD issue [20:45:57] today at 14:48 it can be seen how ats-tls begins to fail reusing connections against varnish-fe [20:46:17] so I have on my to-do for tomorrow trying to pin that down [20:47:06] yeah the FD issue could still be separate, but if KA reduces the spam of accept/close to v-fe that might still be a win on other levels [20:47:23] yeah [20:47:55] enabling it is pretty easy right now (after I moved to hiera some previously hardcoded settings) [20:48:52] is it just "enabled" or also with the idle pool stuff? [20:49:12] just enabled [20:49:39] basically ats-tls connections against varnish-fe were reduced by a 50% [20:50:08] at least till 14:48 were cp4031 got the "malicious" traffic [20:50:15] vgutierrez: it can wait for tomorrow, but I think the settings are wrong? [20:50:19] CONFIG proxy.config.http.keep_alive_enabled_out INT 1 [20:50:19] CONFIG proxy.config.http.keep_alive_post_out INT 1 [20:50:28] ^ those are for the client-facing side, not the origin-facing side [20:50:39] err nope [20:50:51] oh right, _out [20:51:00] on ats _in are clients [20:51:09] _out origins [20:51:11] I got confused because the pooling thing has a completely different name [20:51:24] proxy.config.http.origin_min_keep_alive_connections [20:51:33] yeah [20:51:46] you'd think that should be proxy.config.http.min_keep_alive_connections_out :P [20:51:57] but it made me expect all of the origin ones to have origin_ prefixes heh [20:52:19] never expect consistency! [20:52:26] :) [20:54:43] vgutierrez: anyways, I'm done staring at it for today, but the pasted perf comparisons are pretty damning [20:55:23] https://phabricator.wikimedia.org/P10324 [20:55:37] this is a simple-enough test it removes a lot of other question-marks from stats [20:55:59] it's a hot cache hit, small bytes output, testing directly over localhost, just comparing v-fe port 80 to ats-tls 443 basically [20:56:22] and notably, if I depool all other traffic from the server, the perf issue shown there mostly goes away [20:56:29] it's under real load that the latency shows up there [20:57:37] I think something is queueing (users/conns/reqs against each other) or buffering in ats-tls somewhere to cause that, I just don't know what yet [20:58:50] do you have a timestamp for your test? [20:59:40] 19:24 UTC? [21:00:11] it looks like ats-tls got restarted there at ~19:36 [21:00:28] 19:24 sounds roughly correct [21:00:35] At All Hands I heard that we were looking into having a second DC in Europe. Is there a task or wiki page about that? [21:00:41] I've repeated it pretty far apart [21:01:09] bblack: I'm wondering if https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=eqiad%20prometheus%2Fops&var-instance=cp1075&var-layer=tls&fullscreen&panelId=46&from=1581004055956&to=1581019765047 [21:01:12] could be the culprit [21:01:34] vgutierrez: why did ats-tls get restarted? [21:01:49] I did a depool -> repool cycle then, but 19:36 is when I repooled [21:02:10] oh nope sorry, just depooled [21:02:23] the SSL session cache graph can be misleading :) [21:02:41] I was repeating the test then, without the real-world load in play [21:02:52] the latency goes back down with no competition [21:03:07] hmm [21:03:18] so it's not like there's a fixed delay in ATS or whatever, it's just that as the pressure of req-rate and/or concurrency increases ats-tls is slowing down. [21:03:25] I think we could re-run your test with proxy.config.ssl.session_cache.skip_cache_on_bucket_contention set to 1 [21:03:35] which probably means something something handwavy about queueing/buffering in ATS and limits [21:03:53] my test uses a single connection [21:04:08] ack, forget it then [21:04:16] it opens 1x TLS connection, then makes serial requests as fast as it can for the same small cache hit over and over [21:04:57] http/1.1 or http/2? [21:05:06] gilles: there's not, it's still early days, it's going into the annual planning stuff to talk about that and other DC expansion plans. [21:05:24] 👍 [21:05:41] But annual planning starts now, right? 😉 [21:05:50] vgutierrez: http/1.1 [21:06:05] gilles: yes :P [21:06:50] bblack: I'm wondering if ats-be shows the same performance [21:07:03] maybe [21:07:10] lets see.. [21:07:16] but we'd probably care or notice less [21:07:35] because its origin is across real latencies, and it's lower traffic amounts for true misses and uncacheables [21:07:46] I know [21:07:47] so it's harder to find and prove the same kind of issue [21:08:04] whereas for a v-fe hit it's very easy to eliminate a bunch of other variables [21:08:06] but it could tell us where to focus our efforts on tuning ATS [21:10:06] yeah [21:10:50] I think one of the easier things we can probably figure out, is if the delay is happening on the request side or the response side [21:10:54] bblack: https://phabricator.wikimedia.org/P10324#60043 [21:11:11] ats-be is way faster than ats-tls->varnish-fe [21:11:24] we can probably see that from tracing timings of the test requests at both layers: is it queueing up in ATS before it can ask v-fe over a connection, or it's later during response-side buffering? [21:12:03] vgutierrez: that's... interesting! [21:12:26] well, maybe [21:12:41] the big difference is that ats-be is going to cache favicon too [21:12:54] so ats-be doesn't have to ask anyone else anything, it's just serving a cache hit directly [21:13:05] whereas ats-tls does have to ask v-fe [21:14:02] not apples-to-apples or whatever, and still I think it's maybe-questionable that ats-be has max=50ms, p90=13ms, etc... for a hit [21:14:12] may be symptoms of the exact same problem, donno [21:15:21] yup, v-fe is more steady [21:16:20] who knows, it's disk after all, you can handwave that [21:21:17] BTW [21:21:27] for some reason ulsfo is steady right now in terms of FDs [21:21:49] cp4031 is higher than the others.. but "fixed" at 45k [21:25:15] going back to your benchmark, ats allows to log the time spent on every stage of the request [21:25:32] so it's pretty easy to log where it is spending time [21:28:07] is there a quick howto somewhere on how to check that stuff? [21:30:14] https://docs.trafficserver.apache.org/en/8.0.x/admin-guide/logging/formatting.en.html#timestamps-and-durations [21:30:35] msdms does that [21:30:59] but I'm trying to remember, cause you can get a similar output on the debug [21:31:10] and that can be enabled on-the-fly [21:32:45] right [21:36:35] traffic_logcat? [21:37:00] what I was thinking is not suitable for a production pooled host [21:37:21] in the past I've done some tests using proxy.config.http.slow.log.threshold [21:37:28] with a very log threshold like 1ms [21:37:33] *low [21:37:38] right [21:38:05] but we can emulate that [21:38:22] can traffic_ctl set that stuff at runtime? [21:38:59] yes [21:39:15] so, the slow log request logs this milestones: https://github.com/apache/trafficserver/blob/8.0.x/proxy/http/HttpSM.cc#L7008-L7054 [21:39:17] ok [21:39:40] yeah [21:39:52] so I could flip iton, run my 10s test, then look in the log for my URI and the longer times, etc [21:40:08] there will be prod noise in there, but probably can glean something [21:40:27] I'd do the following [21:41:08] create a new log format wmf-slowrequest with the same format as the slowlog [21:42:00] create a log filter matching a custom header like X-WMF-Slow-Log [21:42:14] right [21:42:16] make wrk inject that header and get the logs [21:42:30] makes sense! [21:42:41] and I'll keep that slow log on those ats instances actually [21:42:46] cause it could be handy in the future as well [21:43:42] hieradata/common/profile/trafficserver/backend.yaml --> profile::trafficserver::backend::log_formats: already contains some msdms that can be used as an example [21:43:55] and profile::trafficserver::backend::log_filters already has a HTTP Header filter [21:44:08] so you could build it using that as an example [21:44:16] otherwise I'll do it tomorrow EU morning [21:46:12] yeah, I like the idea :) [21:46:52] I'm gonna catch some sleep and work on it tomorrow if you don't do it sooner [21:47:06] vgutierrez: I'll let you do it then, I've got some other things to get to, and I'll stumble around on a bunch of basic things on the way if I try :) [21:47:14] cya tomorrow :) [21:47:17] nice :D [21:52:20] 10Traffic, 10Operations: ats-tls performance issues under production load - https://phabricator.wikimedia.org/T244538 (10Vgutierrez) [21:53:00] 10Traffic, 10Operations: ats-tls performance issues under production load - https://phabricator.wikimedia.org/T244538 (10Vgutierrez) p:05Triage→03High [23:16:39] 10Traffic, 10Operations, 10decommission, 10ops-eqiad: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10Jclark-ctr)