[00:47:12] 10Traffic, 10Gerrit, 10Operations, 10Phabricator, 10periodic-update: Phabricator and Gerrit: Improve the way that maintenance downtime is communicated to users. - https://phabricator.wikimedia.org/T180655#3798765 (10mmodell) One easy way to implement this (at least for phabricator) would be to just bring... [01:33:09] 10Traffic, 10Gerrit, 10Operations, 10Phabricator, 10periodic-update: Phabricator and Gerrit: Improve the way that maintenance downtime is communicated to users. - https://phabricator.wikimedia.org/T180655#3765427 (10Dzahn) I was thinking about "jouncebot: next" when reading this. I mean it already output... [10:18:02] XioNoX: interesting, thanks for confirming my suspicion [10:18:45] XioNoX: how can I check next time if indeed the circuit is flapping? [11:31:28] yeee! prometheus is polling cp1008 again :) [11:32:04] \o/ \o/ [12:53:20] bblack: ema: I've tried downloading something as slow as I can (308kb of JS in 79s with high latency) and it doesn't show up in the slow varnish log [12:53:47] using tc [12:55:32] is there an equivalent of the X-Wikimedia-Debug header to hit a specific varnish frontend? [14:06:45] I've also tried macOS's network link conditioner and same thing, no matter what I do Resp[2] always stays the same extremely low value, no way I can trigger a big one like in the slow log [14:11:39] gilles: nope, there's no way to hit a specific frontend [14:12:00] I ended up just watching the logs on a bunch of them in esams [14:12:14] next step I'm going to try with a tethered real)life shitty 3G connection [14:12:51] but it's a bit too shitty for me to watch the logs at the same time on the same connection, I'll have to wait until tonight when my dad is at my place and I have a second 3G connection to play with [14:13:59] gilles: I guess you could try using a tool that only fetches a few bytes at a time (or search for an existing one)? [14:14:05] actually seems like the SSH connections are holding up [14:14:15] heh, s/try using/write/ [14:14:30] ema: I tried 2 of them, but it's unclear how they behave under the hood [14:14:42] they might be simulating some part of the shittiness locally [14:15:03] anyway, real life awful 3G should be a good enough test [14:17:41] gilles: why do you want to hit a specific varnish BTW? All your requests should end up on the same one unless your IP changes [14:18:11] they rotate because I'm using wget/curl, so no persistent connection between my requests [14:19:05] mmh, no, they should not rotate. frontend selection is done by IPVS hashing on your source IP [14:21:30] one request: X-Cache: cp1052 pass, cp3040 miss, cp3040 miss [14:21:38] another one shortly after: X-Cache: cp1048 hit/1, cp3048 hit/9333, cp3047 pass [14:21:47] same client IP, using wget [14:22:38] now on my real-life shitty 3G (tm) for a download that takes minutes, I get slightly higher Resp but still nothing comparable to the download time [14:22:46] 1512051648.804439 Start: 0.000000 Req: 0.000000 Fetch: - Process: 0.000048 Resp: 1.784621 GET http://foo HTTP/1.1 200 [14:22:51] cp3040 is a text node, cp3047 is upload [14:23:04] ah ok [14:23:13] :) [14:24:04] I was watching the middle server then [14:24:09] let me watch the logs of the edge [14:29:52] yeah so, using a similar varnishncsa command as the slow log, filtering by X-Client-IP in the latest example above I see the entry in the logs on the cp3048 machine but not on cp3047 [14:30:26] and Resp is low and happens shortly after the download starts, regardless of the fact that it takes minutes for the client to complete it [14:30:55] here I'm talking about a download over real 3G that takes 4+ minutes [14:31:07] this is why I'm skeptical that high Resp can be the client's doing [14:31:11] weird, you should see the log on cp3047 too. What's the varnishncsa command you're using? Are you passing `-n frontend`? [14:31:59] I am not, I guess that's needed? :) [14:32:28] yup! otherwise by default varnishncsa is going to read from the varnish-backend shared memory [14:32:40] gotcha, let's try again this way [14:32:54] it makes sense that the middle server has a fast Resp if all it does is forward it to cp3047 [14:33:00] indeed [14:34:04] so to recap the frontend/backend thing: each cache node has two varnish instances, frontend and backend. The edge-most varnish (rightmost one in X-Cache) is always gonna be a frontend, all the other cp hosts in X-Cache are backends [14:34:42] selection of the frontend is done by hashing on the client IP, backends are chosen by hashing on the request URL [14:45:23] I'm going to try once more, but this time I see the request, however it has no Resp value [14:45:38] no Resp[2] to be precise [14:45:48] Resp[3] sorry [14:51:57] no Resp[2] either -_- [14:52:44] in the slow log the entries with high Resp are coming from non-edge servers anyway [14:57:24] if we have an entry in the logstash slow log for cp3038 with a high Resp value, isn't that the time it took to send the contents to cp3047? [14:58:53] still catching up, but 308kb is likely not enough to fill buffers [14:59:02] try a much larger file? [14:59:04] I'm trying a video now [14:59:08] still the same result [14:59:22] 182MB [14:59:38] gilles: if the cp3038 log is related to a varnish backend then yes, it's the time spent sending the response to an edge varnish frontend [15:00:33] for the 182MB file, the frontend is going to be in pass-mode [15:01:18] with the 182MB file downloaded over 3G, when I caught it on the backend server, Resp was still very low [15:01:28] ok [15:01:57] but it never appears in the fe logs? [15:02:03] I can try again, but the most I squeed out of it was 1 second or something, when the download took 4+ minutes [15:02:12] squeezed [15:02:35] it appears much later, but without the Resp value [15:02:56] what's your example video file, and which fe+be in esams does it pick? [15:03:00] essentially I can't reproduce the high Resp values we're seeing with a real-life slow connection [15:03:14] (so I can look at it myself while you retry) [15:03:19] it's good old debbie [15:03:42] I have to run pick up my son from school, I can do it one more time but then I really have to go [15:03:51] X-Cache: cp1048 hit/1, cp3048 hit/14066, cp3047 pass [15:04:06] what's the exact varnishncsa command you're using? [15:04:10] http://upload.wikimedia.org/wikipedia/commons/transcoded/c/c0/Debbie_Does_Dallas.ogv/Debbie_Does_Dallas.ogv.240p.webm?%28REINALDO-NEWS%29.webm [15:05:39] varnishncsa -n frontend -q "ReqHeader:X-Client-IP eq ${1}" -F '%{VSL:Timestamp:Start[1]}x Start: %{VSL:Timestamp:Start[3]}x Req: %{VSL:Timestamp:Req[3]}x Fetch: %{VSL:Timestamp:Fetch[3]}x Process: %{VSL:Timestamp:Process[3]}x Resp: %{VSL:Timestamp:Resp[3]}x %r %s' [15:05:56] and then I also added Resp[2] at some point [15:05:56] gilles: can you PM me your IP too to filter down? [15:06:16] it's a dynamic IP, right now 37.168.227.49 [15:06:23] should stick while I test this [15:06:31] let me know when you want me to start the download [15:07:04] ok try [15:07:21] started [15:07:22] X-Cache: cp1048 hit/1, cp3048 hit/16122, cp3047 pass [15:07:40] X-Varnish: 532322120 529029886, 285352313 279412954, 967008902 [15:07:49] but still streaming down the response, right? [15:07:58] right, for another 4ish minutes [15:08:02] ok [15:08:14] I got results on the BE immediately, still waiting on the FE log entry to show up [15:08:22] yep [15:08:24] (probably after xfer finishes, roughly) [15:08:26] that's my experience [15:08:42] actually happened some time before the download ended, for the FE [15:09:13] right, at some point Varnish will be done before the network stack is, basically [15:09:39] ok it already logged on the fe [15:09:49] the reason there's no resp value is: [15:09:51] - VSL timeout [15:09:51] - End synth [15:10:00] the shmlog timed out waiting for transaction to complete [15:10:21] at the moment I don't remember if that's something set per-varnishd in tuning, or in the shm client (varnishncsa/varnishlog) [15:10:27] so, 4 minutes is too long in a way? [15:10:47] well, too long for this particular varnishlog I was running to capture anything? [15:10:55] there should be a knob in varnishncsa/log [15:11:04] [-T ] Transaction end timeout [15:11:08] ^ varnishlog [15:11:10] so yeah that [15:11:13] wanna try again? [15:11:24] yeah ready [15:11:40] X-Varnish: 532322120 529029886, 276976297 279412954, 967129245 [15:11:45] I set "-T 900" this time [15:12:50] of course varnishncsa does not have an equivalent of -T, heh [15:13:49] I wonder, what happens if the shm rolls over before the txn completes? [15:13:52] does it just lose the record? [15:15:19] download complete [15:15:42] yeah it finsihed up with: [15:15:44] - VCL_return deliver [15:15:45] - Timestamp Process: 1512054692.904468 0.000448 0.000039 [15:15:45] - Debug "RES_MODE 2" [15:15:45] - RespHeader Connection: close [15:15:47] - Debug "Hit idle send timeout, wrote = 53544508/181263892; retrying" [15:15:50] - Debug "Hit idle send timeout, wrote = 55398618/127719384; retrying" [15:15:53] - Debug "Hit idle send timeout, wrote = 48653869/72320766; retrying" [15:15:56] - Timestamp Resp: 1512054906.872089 213.968069 213.967621 [15:15:58] - ReqAcct 441 0 441 888 191176327 191177215 [15:16:14] nice [15:16:46] so we should figure out what the txn timeout is for varnishncsa [15:16:55] probably the cases we're logging now are >60 but less than whatever that is [15:17:22] right [15:18:08] is cp3049 an edge server as well? [15:18:14] just checking an entry for the same file in the slow log [15:19:21] gilles: all servers run a varnish frontend (edge) process and a varnish backend one [15:19:33] gotcha [15:19:54] so that explains why we were not seeing crazy high values in the slow log, which I found weird if that was correlated to download time [15:19:58] it all makes sense now [15:20:16] so what we often call vsm is actually called vsl. vsm is where stat counters are stored [15:20:32] vsl is, well, where VSL records end up :) [15:20:49] vsl_space [15:20:54] it also explains why that request from the user who prompted all this research wasn't in the slow log, because it took 2+ minutes, which is probably over that limit. hard to say though if the time for that guy was spent in Resp or elsewhere [15:20:55] The amount of space to allocate for the VSL fifo buffer in the VSM memory segment. If you make this too small, [15:20:58] varnish{ncsa|log} etc will not be able to keep up. Making it too large just costs memory resources. [15:21:50] we probably want to switch to Feth[3] for the slow log as you said bblack and it'd be good to know if it's subject to the same upper limit (I'm guessing it is, since we're talking about TX as a whole) [15:21:58] bblack: ugly paste! but yeah, I assume that "not being able to keep up" translates to "it just loses the record" [15:22:14] we can probably have a lower threshold than 60s for Fetch[3] though [15:22:37] anyway, I really have to go now, I'm glad we understand all of this better now [15:22:40] ttyl [15:22:45] see you! [15:25:14] the default '-T' option is 120s [15:25:34] and varnishncsa doesn't implement it as a CLI option, but it's actually baked into the VSL core and would be trivial to add [15:26:03] #include [15:26:20] it was actually rather tricky to even find the default heh [15:27:23] one more wmf varnish patch then I guess [15:27:52] I'm pretty sure the whole patch would be to add "VSL_OPT_T" to the list near the bottom of bin/varnishncsa/varnishncsa_options.h and it would Just Work, everything else is automagic [15:28:53] here's the deeper question, though: [15:29:15] we can't size the shmlog arbitrarily huge, and we can't set -T to infinity [15:29:30] so if we're trying to catch the slowest of transactions, our filter should really be something like: [15:29:57] if Fetch[3]>N || shmlog_lost_record || VSL_timeout [15:30:18] (well nevermind whether Fetch[3] is the right thing to trigger on, but still) [15:30:49] we can't just end up ignoring those whose timeouts exceed the VSL timeout or get shmlog-wrapped. I'm pretty sure we can filter on VSL timeouts, not sure about shmlog wraps [15:32:37] there's the VSL field: [15:32:39] VSL - VSL API warnings and error message [15:32:40] Warnings and error messages generated by the VSL API while reading the shared memory log. [15:33:09] which is probably what happens on shmlog wraps [15:33:23] ^ from vsl(7) [15:33:24] maybe, I don't know that I've ever witnessed one [15:33:32] they might be silent :) [15:33:42] the VSL field is for the actual VCL timeout for sure though [15:33:56] 15:09 < bblack> - VSL timeout [15:46:02] 10Domains, 10Traffic, 10Operations: Purchase domains mediawi.ki and media.wiki to use as a url shortener - https://phabricator.wikimedia.org/T180657#3800120 (10ema) p:05Triage>03Normal [15:48:15] 10Traffic, 10Operations, 10Phabricator: Switch on http/2 in phabricator apache - https://phabricator.wikimedia.org/T180998#3800123 (10ema) p:05Triage>03Low [15:51:05] 10Traffic, 10Operations, 10Phabricator: Switch on http/2 in phabricator apache - https://phabricator.wikimedia.org/T180998#3775802 (10ema) Note that varnish does not speak TLS, so no h2. I'm not sure how stable varnish's h2c support (HTTP/2 without TLS) is. [16:46:26] I have seen timeout and overflow in VSL :) [16:46:28] https://github.com/wikimedia/varnishkafka/blob/master/varnishkafka.c#L1970-L1989 [16:46:40] documented the options that we use in vk in --^ [16:59:07] 10Traffic, 10Operations, 10Phabricator: Switch on http/2 in phabricator apache - https://phabricator.wikimedia.org/T180998#3775802 (10elukey) Same problem as for gerrit: ``` elukey@phab1001:~$ dpkg --list | grep apache ii apache2 2.4.10-10+deb8u11+wmf1 amd64 Apache... [17:12:05] 10Traffic, 10Operations, 10Phabricator: Switch on http/2 in phabricator apache - https://phabricator.wikimedia.org/T180998#3775802 (10Dzahn) As Moritz said on a Gerrit patch to enable this for Gerrit/Planet. There are security concerns even with the version in stretch and it probably becomes realistic to do... [17:31:05] 10Traffic, 10Operations, 10Phabricator: Switch on http/2 in phabricator apache - https://phabricator.wikimedia.org/T180998#3775802 (10demon) Do we even gain much from having Phab speak http2 for an internal connection to Varnish? [18:14:02] 10Traffic, 10Operations, 10Phabricator: Switch on http/2 in phabricator apache - https://phabricator.wikimedia.org/T180998#3800594 (10faidon) 05Open>03declined >>! In T180998#3800462, @demon wrote: > Do we even gain much from having Phab speak http2 for an internal connection to Varnish? As I mentioned... [18:28:08] 10Traffic, 10Operations, 10Phabricator: Switch on http/2 in phabricator apache - https://phabricator.wikimedia.org/T180998#3800621 (10demon) >>! In T180998#3800594, @faidon wrote: >>>! In T180998#3800462, @demon wrote: >> Do we even gain much from having Phab speak http2 for an internal connection to Varnish... [18:58:33] 10Traffic, 10Gerrit, 10Operations, 10Phabricator, 10periodic-update: Phabricator and Gerrit: Improve the way that maintenance downtime is communicated to users. - https://phabricator.wikimedia.org/T180655#3800663 (10demon) That's good for people on IRC. This task is more about making it clear from the br... [21:04:01] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3801188 (10Gilles) OK, we have the explanation as to why the 2+ minute request didn't show up in the Varnish slow log. By default the V... [21:04:20] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3801189 (10Gilles) a:03Gilles [21:05:04] 10Traffic, 10Operations, 10Performance-Team: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3786217 (10Gilles) [21:21:27] 10Traffic, 10Operations, 10Performance-Team: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3801215 (10Samat) Thank you Gilles for your efforts! Am I correct that I should catch a slow loading longer than 60 secs now? Because delay between 20-60 se... [22:02:32] 10Traffic, 10Gerrit, 10Operations, 10Phabricator, 10periodic-update: Phabricator and Gerrit: Improve the way that maintenance downtime is communicated to users. - https://phabricator.wikimedia.org/T180655#3801357 (10Dzahn) Yea, we should have both, it was meant to be additional not replacing the other. B...