[00:00:23] is it definitely varnish then? or is there any way this is part of some crazy varnish->rb->parsiod->rb thing? [00:00:35] or even worse, varnish->rb->parsoid->varnish->rb heh [00:01:42] in any case, I can try my own repro with a different kind of logging, give me a few mins to set it all up [00:03:50] bblack: one thing that's changed in RB quite recently is that we've started to emit `content-location` header with all responses to give browser clients a way to figure out they've been redirected. That might be involved here [00:04:20] Pchelolo: I assume google would be hitting through ulsfo (but not certain), do you know which DC your test curl entered from? [00:04:33] (based on X-Cache response header when it eventually gives up) [00:04:44] (the last entry in it) [00:05:05] bblack: it doesn't give up, but x-forwarded-for is 198.73.209.1, 10.128.0.110, 10.128.0.110, 10.128.0.109, ::ffff:10.192.32.115 [00:05:45] heh, that last bit is fascinating :) [00:06:09] but regardless, looks like ulsfo [00:06:16] so, varnish4 [00:07:07] in the meantime I'll try to run same req from prod network and bypass varnish [00:07:39] yeah that's a good idea too, directly to restbase.svc.eqiad.wmnet with different path [00:08:40] started one with 'a=c' query parameter. We'll have results in 4 minutes [00:13:20] bblack: if I bybass varnish the 504 is retruned to curl correctly [00:13:32] s/retruned/returned [00:13:47] s/correctly/in 4 minutes instead of much longer/ ? [00:14:27] bblack: it's not 'much longer', with request to varnish curl never ever stops [00:15:57] how long have you let it go? [00:16:11] (and, is 4 mintues how long it is when direct?) [00:18:17] bblack: curl to varnish was running for like 15 minutes. Then I've killed it with Ctrl+C, but it seems like the log entries from that original request still keep coming. With direct hit Parsoid killed the connection faster, just after 68 seconds with 'socket hungup' [00:18:49] so, your direct request was to restbase.svc.eqiad.wmnet, and it took 68 seconds to fail with 504? [00:19:43] what was the 4-minute figure from, earlier? [00:19:47] it was to RB instance directly. Lemme try with svc [00:20:26] 4 minutes is the default timeout in RESTBase. But Parsoid can kill the request earlier [00:23:16] ok [00:23:54] I have some test stuff running, but it may take a while to get a useful result [00:24:24] I'm going to have to actually wait out curl getting a response from varnish before I see any useful logging on the backends (they don't log the request they make at all until it's complete with some disposition) [00:25:24] at least initially it's pretty mysterious though, as none of the backends have logged anny outbound request to RB matching my request, and the curl has been running for over 9 minutes. you'd think if it was retrying, I'd see a complete log from at least 1x failed attempt that was then retried at some layer [00:26:15] one other datapoint, my repro attempt connecting directly to restbase.svc.eqiad.wmnet resulted in "curl: (52) Empty reply from server" (basically a disconnect with no response at all) after exactly 1m59s. [00:27:11] oh I spoke too soon, I have a log now [00:27:29] yep, I've just finished the same test too. But if you connect to restbase1007 directly you'd get the response with 504. [00:28:53] why the diff? isn't restbase.svc just LVS-roundrobin to the same service on restbase100x? [00:30:43] https://phabricator.wikimedia.org/P4387 [00:30:57] is the log I got. it seems to timeout at the 2-minute boundary every time, and yes, varnish is retrying [00:31:14] it seems to make 4x attempts there, but that's logging at the backend-most varnish [00:31:46] if the same behavior is happening in other layers and the timeout is transitive due to streaming, then it will multiply with the layers (probably 16 minutes later, I'll see another 4x failure on this or another backend, all from the same curl) [00:32:11] err sorry, 8 minutes later [00:32:59] but regardless, that paste is enough to know that something's not as we expect it to be with varnish backend retry behaviors (or timeouts for that matter. ttfb should be low, only between_bytes_timeout should go that long...) [00:35:21] no, it's 10 minutes I think. well, we'll see :) [00:35:58] 504 vs empty reply - it seems like this doesn't depend on LVS vs direct service hit, but instead on whether Parsoid dropped the connection before 2 minutes or after. If before - we get 504, if after - empty reply [00:38:17] ok [00:39:32] I'm at 22+ minutes in on my original curl, and still only the log in that paste, which spent what looks like probably 480 seconds doing its retries, but I think the long entry didn't appear until well after that [00:39:55] there are probably going to be other log entries eventually, from the frontend layer doing the same retry against the backend(s) that generated that log [00:41:50] I was wrong about timeouts too, I was remembering from a different case [00:42:15] timeout for connect() success is 5s everywhere [00:42:46] but timeout for first byte in the case applicable here is 185s for varnish-fe->varnish-be, 180s for varnish-be->varnish-be, and 180s for varnsh-be->restbase [00:43:58] but if the actual RB timeout is variable as well, who knows how this will play out or how long it takes for curl to come back. I'd expect a maximum in my test scenario of 48 minutes though, if always hit the ~180s timeout from varnish first, and multiplied as expected [00:45:17] unless we're dealing with multiple issues here. maybe one of the varnishes has actually "lost" this connection at this point and leaked it heh [00:48:07] my suspicion at this point is that varnish's internal code (as in C, not builtin-VCL) must attempt retries on the standard bereq.retry counter as well, independently of our VCL ones. [00:48:37] bblack: do you need any immediate actions on our side to help investigating? [00:48:42] and thus maybe we need some explicit clause in there to prevent them, something like "if (bereq.restarts > 0) { return (fail); }" [00:48:52] Pchelolo: no, I can repro this from here and we can find and fix it on our end [00:49:34] awesome, thank you. It puts some load on Parsoid, but it should be fine for now. If load keeps growing we would blacklist those bad pages [00:51:13] I think the default max_retries is 4, which jives with the retry count in that log snippet [00:51:52] but the log snippet also doesn't show it re-entering VCL to check any of our logic. From the docs, I would've expected it to enter vcl_backend_error{} and give our VCL a chance to ask for a retry (or not, and we don't in this case). [01:06:39] (FYI rest of this is just me thinking out loud as I go) [01:08:25] bin/varnishd/http1/cache_http1_fetch.c - V1F_FetchRespHdr() - if we get a read() error on first byte (e.g. due to timeout): htc->doclose = SC_RX_TIMEOUT; return 1; [01:08:54] :q [01:08:56] grr [01:09:53] caller: bin/varnishd/cache/cache_backend.c - vbe_dir_gethdrs() - iff this was a reused backend connection, on "return 1" from above, it does 1x "extrachance" retry internally, because: [01:10:07] /* [01:10:07] * If we recycled a backend connection, there is a finite chance [01:10:10] * that the backend closed it before we got the bereq to it. [01:10:13] * In that case do a single automatic retry if req.body allows. [01:10:15] 07HTTPS, 10Traffic, 06Operations, 10Wikimedia-Blog: Switch blog to HTTPS-only - https://phabricator.wikimedia.org/T105905#2778480 (10EdErhart-WMF) @BBlack we're working on it now as part of a larger effort to tweak the blog's theme. [01:10:16] */ [01:11:02] (also, that case does VSC_C_main->backend_retry++; which I think is a global stat, not the expected/useful bereq.retries counter) [01:11:23] after the possible "extrachance" retry, it's going to return -1 [01:13:24] caller: bin/varnishd/cache/cache_director.c - VDI_GetHdr() - doesn't seem to do much interesting. in the -1 case, it will reset director state and return -1 itself. [01:14:37] caller: bin/varnishd/cache/cache_fetch.c - vbf_stp_startfetch() - in the -1 case, returns F_STP_ERROR [01:15:54] at this point we're in some indecipherable state machine that uses C token concatenation, making grep difficult :P [01:16:26] ... [01:16:58] that was working outwards from the spot (I should've mentioned at the top) where it emits the "FetchError http first read error: EOF" in the pasted log [01:53:44] my curl's still going, 1h38m in heh [01:53:52] the backend did eventually emit a bunch more logs [01:54:11] it redid the 4xretry pattern a few times, then gave a 413, then did it some more, then returned a 503 [01:54:22] there are probably multiple bugs involved here :/ [01:54:32] the 413 in particular is troubling [01:54:56] as there shouldn't be any request body. it must be hitting that condition from leaking some ever-expanding state across the retried requests? [01:57:35] back to the state machine, it seems like F_STP_ERROR should invoke vbf_stp_error(), which seems to unconditionally set status to 503 and invoke vcl's backend_error(). which is something we're not seeing in the midst of those 4x-retry things [01:59:57] also the 4x-retry shmlog entry likely isn't the end of the retries [02:00:15] notably it ends with: [02:00:15] - VSL timeout [02:00:16] - End synth [02:00:28] I think that's the shmlog code giving on the transaction, not the transaction itself ending [02:00:34] s/giving/giving up/ [02:01:11] the two counter-examples in the midst of this that return 413 and 503 don't have that [02:02:31] hmmm, the 413 seems to have been sent to varnish by restbase. so perhaps unrelated [02:02:34] (to these varnish issues) [02:02:53] - BerespProtocol HTTP/1.1 [02:02:53] - BerespStatus 413 [02:02:53] - BerespReason Request Entity Too Large [02:02:53] - BerespHeader x-powered-by: Express [02:02:59] ^ after the normal 2-minutes-ish timeout [02:03:20] probably just some edge case with whatever code on the RB side is dropping connections at ~2 mins instead of returning an error [02:06:12] the beresp timestamp (missing inbetwen the 4x retries) is emitted in vbf_stp_startfetch() unconditionally shortly after VDI_GetHdr() returns, so I don't think it's returning during those retries... [02:07:09] VDI_GetHdr() is uncomplicated and has no loops, so we're back to where we started at vbe_dir_gethdrs() [02:09:34] yeah, so there's a bug in there for sure [02:10:08] basically, it's capable of retrying a request *indefinitely* within vbe_dir_gethdrs() under the right conditions [02:20:41] "the right conditions" being: socket error on first response read() attempt (due to restbase abruptly closing after its 2m timeout), and the connection was reused (we reuse keepalive connection pools in general, so very commonly it will be a reused connection.) [02:21:00] and when it does that, it just grabs a new connection (which is also liked to be a reused one, I think) and tries again [02:22:20] trying to fix it correctly while preserving their "extrachance" retry behavior seems ugly. the whole extrachance retry in general is a hack. [02:22:28] that's what VCL-level retry is for :P [02:22:38] (and/or tuning your layers' timeouts appropriately to avoid races) [02:26:05] I think that's the primary issue here. there might be others we uncover after fixing that, but step one is probably patch out the extrachance (which is potentially indefinitely-extra-chances) code. [02:32:04] ema: https://phabricator.wikimedia.org/P4388 [02:32:52] ema: can you try patching that into your debian build of our varnish4? We can deploy it and restart varnishes just in codfw cache_text and then I can try to repro all the above again and see what changes [02:32:58] I'm out for the evening [02:33:56] (it's not really in ideal form for upstreaming, it's just the simplest way to do a minimal patch to stop it cleanly) [09:18:23] 10Traffic, 10ArticlePlaceholder, 06Operations, 10Wikidata: Performance and caching considerations for article placeholders accesses - https://phabricator.wikimedia.org/T142944#2778869 (10hoo) Heads up: In {T144592} we decided to index exactly 1,000 placeholders on eowiki. All other placeholders will not b... [09:47:17] 10Traffic, 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 13Patch-For-Review: Make our docker registry public - https://phabricator.wikimedia.org/T150168#2778882 (10Joe) 05Open>03Resolved [10:06:54] 10Traffic, 06Analytics-Kanban, 06Operations: Varnishlog with Start timestamp but no Resp one causing data consistency check alarms - https://phabricator.wikimedia.org/T148412#2778941 (10ema) The cache_text waitinglist issue should be solved by https://gerrit.wikimedia.org/r/#/c/320259/ [10:13:22] confirmed that we haven't been seeing any errors during the past 12/14 hours [10:13:30] thanks! [10:14:21] \o/ [10:56:28] 10Traffic, 10ArticlePlaceholder, 06Operations, 10Wikidata: Performance and caching considerations for article placeholders accesses - https://phabricator.wikimedia.org/T142944#2779055 (10BBlack) Nothing was ever resolved here. 30 minutes seems like an arbitrary number with no formal basis or reasoning, an... [11:01:11] 10Traffic, 10ArticlePlaceholder, 06Operations, 10Wikidata: Performance and caching considerations for article placeholders accesses - https://phabricator.wikimedia.org/T142944#2779075 (10BBlack) I clicked Submit too soon :) Continuing: We'd expect content to be at minimum a day, if not significantly longe... [11:04:46] bblack: VSL:timeout and End:synth can be avoided raising the -T argument in varnishlog (vk currently uses 1500 seconds). It is indeed VSL flushing incomplete records after 120 seconds (default) [11:09:09] 10Traffic, 06Analytics-Kanban, 06Operations: Varnishlog with Start timestamp but no Resp one causing data consistency check alarms - https://phabricator.wikimedia.org/T148412#2722084 (10BBlack) The real problem here is a misunderstanding between varnish shm log's semantics and how we're interpreting that in... [11:12:25] elukey: expounding on ^ a bit, I think a datasource (for ops) of log entries vk/hadoop finds are missing critical records and/or hit synthetic timeouts could be sort of like the mysql slow-query log in utility. [11:12:48] it's a place to look to pick out the strange/bad things, but it should be routine that we're logging them and handling them in the logging pipeline [11:16:24] ema: making a bug for the restbase thing in the backlog here [11:18:42] bblack: I agree, but I didn't get fully your point. Let me tell you how I see the whole thing and then we can chat about what is missing :) [11:22:27] bblack: incomplete records in the kafka/hadoop webrequest-logs can be considered, in my opinion, data issues since having partial data screws up all our pageview metrics (and related). I corrected all the issues in vk's configuration while reading the shmlog (like VSL timeouts and overflows), the remaining ones that cause alerts are going to be investigated as corner cases (like this one). We are [11:22:33] going to explicitly log how many "incomplete" records are registered in a table called "webrequest_statistics" soon, so we'll alarm on those rather than on the missing sequence numbers. [11:24:56] We expect to reduce the overall alarms fired but we'll still be able to spot a vk issue reacting quickly [11:27:19] At the moment we spot problems since incomplete records (like the ones missing the datetime field or Timestamp:Resp in VSL terms) are bucketed incorrectly in hadoop causing holes in sequence numbers [11:28:08] so few requests easily trick our checks that a big data loss occurred while it is only few reqs misplaced [11:28:42] this will end up soon (https://gerrit.wikimedia.org/r/#/c/319582/) [11:29:15] (EOF :) [11:42:49] 10Traffic, 06Operations: Varnish4 is unexpected retrying certain applayer failure cases - https://phabricator.wikimedia.org/T150247#2779196 (10BBlack) [11:43:50] elukey: I guess what I mean is, there's a big different between "vk might be losing data and/or faulty" and "some legitimate VSL transactions lack a Beresp record" [11:44:56] elukey: lack of records we desire inside transactions shoulnd't be causing you to question vk/hadoop/pipeline sanity/faults. they should just mean you received a "legit" (on some level) transaction that lacks key records because varnish never emitted them, which is a cause for loooking into varnish badness, not vk/hadoop/pipeline badness [11:45:32] elukey: and if vk's criteria for txn legitimacy on that level is just "I saw Start+End", I think you get past that issue of questioning it [11:45:57] bblack: ah yes agreed, https://gerrit.wikimedia.org/r/#/c/319582/ should help towards this direction [11:47:54] bblack: so at the moment the only record without Timestamp:Resp that I have explicitly blacklisted are all the requests tagged as HttpGarbage by varnish. Following what you wrote and what we discussed, we might want to log them [11:48:30] but the rest, for the moment, has been completely resolved by VCL patches [11:49:14] yeah we'll have to pare down the list as we go and blacklist ones we know we don't care about based on their pattern (as some are likely induced by client-side faults and random public internet faults, etc) [11:49:49] but probably all of the ones where VSL times out a txn at some unreasonable level and/or we never get a beresp timestamp are worth looking into as potential varnish-level faults [11:50:07] (if we don't know of some reason to blacklist them as client faults) [11:51:32] elukey: but still, you're seeing sequence dropouts in these cases now, right? I think all txns that have start+end (even if they 'end' in a faulty way) should generate a legit seqno and thus erase the question. [11:52:43] 10Traffic, 06Operations: Varnish4 is unexpectedly retrying certain applayer failure cases - https://phabricator.wikimedia.org/T150247#2779266 (10BBlack) [11:53:47] 10Traffic, 06Operations: Varnish4 is unexpectedly retrying certain applayer failure cases - https://phabricator.wikimedia.org/T150247#2779196 (10BBlack) [11:54:54] nope we get a valid seqno even for incomplete records since it is a counter in vk.. The main current issue is that some records might get bucketed in the wrong hour causing false positives. Example: data consistency checks for hour 12 has seqno 100->1000 for cp1008, and because of a missing datetime/timestamp:resp another log with seqno 10 is added (that should have belonged to the hour before). [11:55:00] Our data checks will alert that seqnos 11->99 are missing [11:56:03] how does missing timestamps affect seqno? I would've thought they were just serially generated as shmlog entries appear [11:56:26] oh, because it's outstanding all that time [11:56:37] bblack: updating our varnish package with your patch [11:56:59] bucket them by their end-time (which can be synthesized if missing), instead of start-time? [11:57:18] ema: thanks :) [11:57:32] we bucket them for end time (Timestamp:Resp) but we do not synthesized if missing.. [11:57:42] I mean, not in vk [11:57:47] yeah maybe that's the missing bit here [11:58:16] they should be flagged by vk in some way as an error, but they should still have a legit end-time. if there's no resp timestamp, use local time in vk to synth it? [11:59:04] I thought about it, currently we do this trick when we read from kafka and we store to hadoop [12:00:03] or, alternatively: keep the current notional end time (which can be undefined if varnish doesn't send one), but also log a separate timestamp with vk's local time that it received the completed record, on all records, and bucket by that. [12:00:37] either way there's going to be small inconsistencies between vk's local time and the txn times in edge cases, but the latter would be more-reliable if you want buckets to contain sequential seqnos with no edge-cases [12:01:02] yeah [12:02:43] but there is also another important remark, namely that the raw number of incomplete records is usually very small. It causes giant "fake" holes in seqnos if misplaced, so removing them from the overall seqno consistency check (and count them separately) might be enough [12:04:21] anyhow, I'll open a phab task to investigate this feature [12:04:42] it shouldn't be super difficult to add (last famous words) [12:05:35] ok :) [12:05:39] on a semi-related note, https://gerrit.wikimedia.org/r/#/c/314662/ is the last change to add -Wextra to vk's makefile [12:06:09] :) [12:07:38] yeah I keep putting that off because I know I have to actually think to read through it :) [12:09:07] (well I guess I have to actually-think a lot, but sort out what happens with a type change in existing C code is a whole different level of actually-thinking) [12:09:58] no rush, thanks! This is why I am waiting for more feedbacks, I fear that I am taking type changes to lightly [12:10:27] ha, we also need to comment out the declaration of extrachance [12:13:49] ema: then you'd have to comment out the =0 of it as well? [12:14:00] ema: what did the compiler actually say? unused var? [12:14:44] because that could be an interesting hint, if it's optimizing out the clause where it's set to zero in that loop [12:15:14] bblack: var set but not used [12:15:20] cache/cache_backend.c:180:9: error: variable 'extrachance' set but not used [-Werror=unused-but-set-variable] int i, extrachance = 1; [12:15:26] oh set-but-not-used [12:15:32] also, Werror is the devil :P [12:15:34] but whatever [12:15:37] :) [12:15:52] so yeah you'll have to comment out the VBC_STOLEN part too [12:15:58] err VBC_STATE_STOLEN [12:16:08] right, on it [12:21:19] ok now it compiles [12:26:11] and jenkins agrees :) [12:26:16] https://integration.wikimedia.org/ci/job/debian-glue-non-voting/366/console [12:26:23] (that's really cool btw hashar) [12:28:47] merged, building on copper [12:28:49] 07HTTPS, 10Traffic, 06Operations, 13Patch-For-Review, 07Wikimedia-Incident: Make OCSP Stapling support more generic and robust - https://phabricator.wikimedia.org/T93927#2779342 (10BBlack) >>! In T93927#2717425, @BBlack wrote: > The downsides to switching to the internal stapling code: > > 1. It still d... [12:29:06] ema: o/ [12:29:29] ema: if the repo pass that debian-glue job just fine, we can make it to vote -1 on failure :} [12:29:53] hashar: seems to be working fine to me yeah [12:29:55] right now it is always a success [12:31:11] on patchset 3 (FTBFS) it said FAILURE, on PS4 (building fine) SUCCESS [12:31:20] yeah [12:31:27] but always vote Verified+1 in Gerrit [12:31:54] right [12:32:11] it's election day, let's make it vote too [12:34:52] though I guess it will fail on the 'master' branch [12:36:47] https://gerrit.wikimedia.org/r/320378 is the magic switch [12:38:06] yeah it doesn't make much sense to try building the package when merging to master, our work is all in debian-wmf [12:38:52] added you two as reviewers for in, I am deploying it [12:43:46] bblack: 4.1.3-1wm3 uploaded to carbon, upgrade worked fine on cp1008 [12:45:04] ema: \o/ [12:45:58] important to remember -> https://wikitech.wikimedia.org/wiki/Varnish#Upgrading_to_a_new_minor_Varnish_release [12:46:19] 10Traffic, 10netops, 06DC-Ops, 06Operations, and 2 others: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#2779419 (10faidon) [12:46:22] 10netops, 06DC-Ops, 06Operations, 10ops-eqiad: asw-d-eqiad SNMP failures - https://phabricator.wikimedia.org/T112781#2779417 (10faidon) 05stalled>03declined asw-d-eqiad is on its way to being decom'ed (or upgraded and repurposed) and replaced by asw2-d-eqiad, a QFX5100/EX4300 stack. The servers that i... [12:46:24] (our package overwrites /lib/systemd/system/varnish.service so we need to run puppet immediately after the package upgrade) [12:48:12] does our package restart daemons on upgrade? I don't remember [12:48:39] bblack: nope [12:48:46] I guess either way: depool; package upgrade; puppet; restart daemons; repool ? [12:49:11] I think puppet starts the daemons if they're not running? [12:49:14] oh I see, there's a better sequence in your link [12:49:30] I didn't have "stop before upgrade" :) [12:50:05] ema: cp1008/pu uses itself as a backend now right? [12:50:24] correct, haven't restarted the daemons there though [12:50:39] doing that now [12:50:47] well puppet's still dead there I think? [12:50:52] very much so [12:51:19] you can enable + run it and whatever. it will fail nginx stuff, but can ignore it and then disable + fixup nginx after [12:51:30] ok [12:51:47] 10Traffic, 10netops, 06Operations: Re-setup lvs1007-lvs1012, replace lvs1001-lvs1005 - https://phabricator.wikimedia.org/T150256#2779434 (10faidon) [12:51:56] 10netops, 06Operations, 13Patch-For-Review: Migrate links from cr1-eqiad/cr2-eqiad fpc 5 to fpc 3 - https://phabricator.wikimedia.org/T149196#2779450 (10mark) 05Open>03Resolved [12:52:01] bblack: ^ jftr [12:52:12] and then I can test the repro on cp1008 anyways. it was probably a better place than codfw to do so anyways, it will still hit RB the same way [12:52:24] not urgent, don't let me interrupt you or anything [12:55:08] I guess nginx didn't fail on puppet bits, it's just failing to staple both certs under the old config, which is fine for now [12:55:19] bblack: nginx seems fine puppet-wise yeah [12:55:55] ok, will test there with the RB timeout thing [12:56:02] nice [12:56:22] 10netops, 06Operations: cr1-eqiad:ae4 is disabled due to VRRP issue - https://phabricator.wikimedia.org/T149226#2779462 (10faidon) 05Open>03Resolved a:03faidon This VRRP issue was the case before the cr1/2-eqiad upgrade as well, so this was likely due to some asw-d-eqiad snafu (not propagating VRRP multi... [12:56:46] 07HTTPS, 10Traffic, 10MediaWiki-Authentication-and-authorization, 10MediaWiki-User-login-and-signup, 06Operations: After login, user not logged in when "prefershttps" set to false and "wgSecureLogin" set to true - https://phabricator.wikimedia.org/T149977#2779467 (10BBlack) [13:04:54] - Timestamp Resp: 1478610240.194538 240.007117 0.000094 [13:04:58] ^ works as expected [13:05:21] backend fails with a 503 after 2 minutes. frontend does the retry-503-once logic and does it all over again, for a total 503-to-user response time of 4 minutes. [13:05:56] not ideal, but now we're into the realm of tuning timeouts and/or fixing RB's lack of output, or parsoid's long hang [13:06:24] instead of the realm of "varnish gets stuck internally infinitely retrying this stupid request against the backend and never returning to the client" [13:07:51] the two times the backend process queried RB, the results were" [13:07:52] - RespStatus 503 [13:08:04] - Timestamp Resp: 1478610120.190677 120.002858 0.000037 [13:08:30] interestingly, that's on the -c side of the backend process. the -b side of the backend process never logged anything heh [13:09:08] on the -b side of the frontend, we see 2x requests as well with the expected intentional retry: [13:09:11] - Timestamp Bereq: 1478610000.187740 0.000154 0.000154 [13:09:14] - Timestamp Beresp: 1478610120.190730 120.003144 120.002990 [13:09:16] - BerespProtocol HTTP/1.1 [13:09:18] - VCL_call BACKEND_RESPONSE [13:09:18] - VCL_return retry [13:09:18] - BackendClose 72 74975fde-eb2f-45b2-9391-cc7731567691.be_cp1008_wikimedia_org [13:09:21] - BerespStatus 503 [13:09:23] - Timestamp Retry: 1478610120.190779 120.003193 0.000049 [13:09:26] - Link bereq 551044 retry [13:09:30] * << BeReq >> 551044 [13:09:30] - Begin bereq 686688 retry [13:09:37] - Timestamp Bereq: 1478610120.190888 120.003302 0.000109 [13:09:37] - Timestamp Beresp: 1478610240.194227 240.006641 120.003339 [13:09:37] - BerespProtocol HTTP/1.1 [13:09:37] - BerespStatus 503 [13:09:53] - VCL_call BACKEND_RESPONSE [13:09:54] - TTL VCL -1 3600 0 1478610240 [13:09:54] - VCL_return deliver [13:10:01] - Timestamp BerespBody: 1478610240.194471 240.006885 0.000244 [13:10:38] the -c side of frontend only logged once, after the whole 240 seconds was up, indicating 503-to-user [13:13:30] 10Traffic, 06Operations, 13Patch-For-Review: Varnish4 is unexpectedly retrying certain applayer failure cases - https://phabricator.wikimedia.org/T150247#2779504 (10BBlack) Pasting from IRC after testing the patch on cp1008/pinkunicorn (TL;DR - patch works as expected, bad retries gone): ``` 13:04 < bblack>... [13:13:42] so [13:13:51] it's an effective mitigation against this particular problem [13:14:21] the other end of this, though, is "extrachance" is there for a reason, and eliding it will likely cause some other corner-case subtle failures. [13:14:58] with basic break->fix validation out of the way, maybe time to look at a more robust/correct fixup before going further on this [13:15:22] * If we recycled a backend connection, there is a finite chance [13:15:25] * that the backend closed it before we got the bereq to it. [13:15:27] * In that case do a single automatic retry if req.body allows. [13:15:30] this seems to be the reason [13:16:48] yeah [13:17:00] it makes sense on some level [13:17:26] varnish is holding open keepalive connections to backends. at some point the backend may close us due to overall-time, timeout-between-requests, etc. [13:17:57] if it happens to start sending the close over the wire while we're sending a new request, we get a conn fail right at about that time in varnish's code logic, which isn't really a legitimate failure, it's something to be retried. [13:18:43] the counter-argument is that you could bin this up with other failures that are less-predicable, too. [13:19:03] someties LVS re-routing around a dead appserver is going to rst a connection at odd times. IP networks are fallible, blah blah. [13:19:54] and honestly it should be up to our high-level retry logic to route around those when they're rare (our frontend 503-retry), and up to administrators to configure timeouts to avoid those edge-cases (that varnish's outbound connection timeout should be shorter than the backend's inbound one or whatever). [13:20:18] I don't know, I haven't thought through that whole tradeoff well enough yet [13:22:53] moving on, let's assume it's a legitimate idea that varnish needs to internally-retry in such a case: [13:23:09] even so, the current logic is something like this: [13:25:17] extrachance = 1; do { conn = get_conn(); if (!conn->is_reused) { extrachance = 0; } conn->send_headers(); if(conn->get_respheaders() == success) { break loop and move on normally } } while(extrachance); [13:25:52] when conn->getrespheaders() fails, the existing connection dies. [13:26:16] but when we loop back to the top, conn=get_conn() can get yet another reused connection from the pool of many reusable keepalive connections [13:26:39] (in fact, it would be very common for all of them to be reused, especially if this is a rare failure among many concurrent successes flowing through varnish) [13:27:16] but in every case the same fundmental logic applies. if we retry in this loop 50 times in a row, it could be because every one of those 50 times, we picked up a reusable keepalive connection that was just about to die at the wrong time. [13:27:38] there has to be a better way [13:39:39] the only two defenses I can think (against the bad cases, while keeping some notion of a "good" extrachance retry) would be: [13:40:18] 1. Only do it once, period. We defend against this race the first time it happens for a given request, but if the next attempt does the same thing we let the error bubble up even if it was a reused connection. [13:41:06] 2. Actually check the timing. Defend against the race when the time from sending the req -> getting the failure is small, but not when it's large (it's 2 minutes later in this case we're seeing) [13:42:37] 3. Put some more complex thresholds on things like the age or idleness of the reused connection. e.g. if it's only been open for 10 seconds and it's only been idle for 2, there's no reasonable point considering the racing-close case. [13:43:00] 2 and 3 require tunables and magic though. they reduce the surface of the problem but don't eliminate it. [13:43:51] and 1 is problematic on its own as as well. in the edge-case like this RB behavior, I think it might still cause layer-multiplicative retries. [13:44:22] hmmm maybe not [13:45:06] I guess it would cause 1x internal retry at whichever layer failed (likely varnish->app, but also possibly varnish->varnish on timeout boundaries), but that singular retry would either succeed or generate an explicit 503 [13:45:30] so only our outer 503-retry will multiply it (so our total applayer requests in a case like this could be 4x instead of the expected 2x) [13:46:27] (and even that, only when the backend-most varnish has this issue. if varnish<->varnish had a legitimate close-race, it wouldn't forward the first failed request down to the applayer) [13:48:04] or we can rewind to the top of this wall of text and say it's not legitimate for varnish to ever do an uncontrolled internal retry, and that from varnish internals' perspective we should accept the initial close-race failure as a possibility and expose it as an error, and let VCL retry logic handle it if it wants to, just like other random failures. [13:48:21] (which our 503-retry in the frontends would cover, assuming it's rare as it should be) [13:49:44] it's a design question I guess, and I don't know where upstream's viewpoint on all related things really is [13:52:34] in the interest of being upstream-friendly, perhaps we could make this configurable as a patch? [13:53:06] not that varnish needs more config tunables heh [13:54:08] it could be a signed integer, where -1 is current behavior of unlimited retries in this case, 0 disables it completely, 1+ sets a cap on these retries. [13:54:49] it's at least a path forward for upstream, and then maybe in a future release they decide to change the default or simply get rid of the possibility of unlimited internal retries. [13:56:55] and leave the "extrachance" variable as it is, but make that while conditional be "} while (extrachance && configured_chances-- > 0)" [13:57:02] yeah the tunable seems like a good idea [13:57:16] also the comment there is a bit misleading (single automatic retry) [13:57:23] (where configured_changes is a local copy of the tuned value) [13:57:45] yeah I suspect the author of that comment didn't understand the potential fallout there [13:59:25] maybe best path forward for now is just keep our local hack (I'm ok with it, I think that behavior is what we want here locally in WMF deploy), and file an upstream bug about it and see what they say [13:59:45] they may decide that limiting it to a fixed singular retry is ok [14:00:07] (or kill it entirely, or config for 0-vs-1, or the full config option. who knows) [14:02:15] which "corner-case subtle failures" were you thinking of earlier on? [14:02:31] with the current approach of disabling extrachance [14:03:35] well, we can have the legitimate close-race on a reused connection that the comment speaks of, which is a false-failure (shouldn't have failed, will probably work fine on retry on a different conn) [14:04:08] but we've got our single 503-retry for that, right? [14:04:11] and technically, that *can* legitimately happen more than once in a row (e.g. all 10 connections currently open to a backend were opened about the same time, and all timeout at about the same time) [14:04:21] so it's valid on a per-attempt basis in some sense [14:04:44] but I think in practice it's enough of an edge case that the singular retry should cover most of them [14:05:48] the horrible worst-case would be a backend which has an overall-wallclock timeout on inbound connections (e.g. "if open >5 minutes, just close it regardless of idle time"), and it only checks the timer when a new request arrives, instead of asynchronsouly... [14:06:40] timeout tuning can fix a lot of this though, and our timout tuning is probably deficient [14:06:49] I know it is for varnish-varnish because it's fixed instead of layer-additive :) [14:07:22] and at the app boundary that should be part of our vaporware contract: that the applayer idle timeout needs to be in a certain range [14:08:09] 10Traffic, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 06Operations: CentralNotice: Review and update Varnish caching for Special:BannerLoader - https://phabricator.wikimedia.org/T149873#2767284 (10Gilles) >>! In T149873#2767335, @aaron wrote: > The first approach might work using Varnish... [14:17:04] 10netops, 10Monitoring, 06Operations: Icinga check for VRRP - https://phabricator.wikimedia.org/T150264#2779751 (10faidon) [14:21:41] https://github.com/xcir/python-varnishapi/commit/c2267240ea13de8239d0d4315c1fda7319ff8e58#diff-88b99bb28683bd5b7e3a204826ead112R406 [14:21:44] ;) [14:22:43] :) [14:52:32] ok my stack of things I'm actively working on and trying to remember things about just crossed some wtf threshold :P [14:53:23] ema: lets upgrade codfw varnish to the new package at least. the backend-most matters the most in this case anyways, and will paper over RB's specific issue here. [14:53:46] shouldn't need all that much timing, but I guess wait for the mess in -ops to settle down first so we don't add to the confusion [14:53:55] ack [14:55:55] and after that, I barring any new issues cropping up we're good to start esams [14:56:15] and/or upgrade ulsfo, and/or start eqiad (they're all relatively-async) [14:56:27] and when all those things are done, we can start fixing up cache routing back through eqiad [14:57:23] (I'm guessing all of these things will bleed over into tomorrow sometime before we're past it all) [16:35:52] meh I forgot to check whether puppet was already running while upgrading cp2013 [17:19:33] upgrade done [17:20:07] yay [17:24:07] so if no new issues pop up tomorrow I'll upgrade ulsfo to wm3 packages and then carry on with esams/eqiad [17:24:23] ema: sounds awesome :) [17:27:02] I'm testing the funky RB timeout via ulsfo now too, to confirm that fixing codfw alone mitigates the worst of it (in this sort of case, anyways) [17:37:38] < HTTP/1.1 503 Backend fetch failed [17:37:43] real 4m0.397s [17:37:59] ^ so yeah, in this RB case in particular, fixing it in codfw (which fixes the varnish->app barrier) fixes it for all [17:42:06] 10Traffic, 06Operations, 13Patch-For-Review: Varnish4 is unexpectedly retrying certain applayer failure cases - https://phabricator.wikimedia.org/T150247#2780189 (10BBlack) I've confirmed with codfw upgraded (the DC that's currently facing RestBase directly), this is fixed up even from ulsfo clients' perspec... [17:51:29] bblack, ema: do you have any plan to re-image cache servers soon? [17:52:03] volans: why? :) [17:53:19] I worked with j*oe and we got https://gerrit.wikimedia.org/r/#/c/320367/ merged, but to merge the change for wmf-auto-reimage too we first need to create a new deb package and update conftool around, and given that there is some other WIP on it [17:53:48] if it's not urgent it might get decided to wait the other WIP stuff to get merged and tested before releasing a new version [17:54:15] not urgent [17:54:35] we already reimaged a few critical ones recently, there's no other upcoming plans to reimage more of them soon [17:55:39] ok, I'll let you know when the auto-reimage will be multi-service capable, ping me if you need it earlier ;) [17:58:40] ok thanks :) [18:00:00] 10Traffic, 06Operations, 13Patch-For-Review: Better handling for one-hit-wonder objects - https://phabricator.wikimedia.org/T144187#2780213 (10Danielsberger) Here's a graph of the hit ratio for various cache sizes. {F4707605} It seems to me that this tells us: investing in a better admission policy pays of... [18:07:05] 10Traffic, 06Operations, 13Patch-For-Review: Better handling for one-hit-wonder objects - https://phabricator.wikimedia.org/T144187#2780263 (10BBlack) @Danielsberger - Fascinating stuff, thanks so much for running all of this data! We're kind of swamped in various things right now (varnish4 transition and v... [20:03:58] 10Traffic, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2780700 (10GWicke) > It seems to me like the goal is to simplify what's currently available, with no plan to add new features The design is providing fle... [20:22:35] 10Traffic, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2780822 (10Gilles) No examples of what rotation, cropping or effects would look like are provided. No complex examples of current capabilities are provid... [20:38:59] 10Traffic, 06Operations, 13Patch-For-Review: Varnish4 is unexpectedly retrying certain applayer failure cases - https://phabricator.wikimedia.org/T150247#2780894 (10GWicke) FYI, we are also bumping up the HTTP socket timeout in hyperswitch from 2 to 6 minutes: https://github.com/wikimedia/hyperswitch/pull/70... [21:18:52] 10Traffic, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2781024 (10GWicke) @gilles, there are several examples illustrating page and media thumbnails, as well as the (orthogonal) selection of size and thumb typ... [21:31:16] 10Traffic, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2781089 (10Gilles) There is no example URL of the orthogonal case. Bandwidth is mentioned, but no example of its use is provided and even the path spec co... [22:05:16] 10Traffic, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2781285 (10GWicke) > the description mentions the response containing JSON This is about other APIs, such as pageinfo & other references to thumbnails pr... [22:57:18] 10Traffic, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 06Operations: CentralNotice: Review and update Varnish caching for Special:BannerLoader - https://phabricator.wikimedia.org/T149873#2781465 (10ggellerman) p:05Triage>03Normal [22:57:25] 10Traffic, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2781464 (10Anomie) >>! In T66214#2781285, @GWicke wrote: > 2) Validate parameters in the backing service strictly, so that each unique thumb can only be a... [23:10:43] bblack: operations/software/nginx now has the Jenkins job that builds .deb package magically. Giving it a try on https://gerrit.wikimedia.org/r/#/c/319776/ [23:11:01] not sure how helpful it will be :] Build against whatever distro is in debian/changelog latest entry [23:11:49] 00:00:12.415 pristine-tar: no pristine-tar branch found, use "pristine-tar commit" first [23:11:49] 00:00:12.418 gbp:error: Couldn't checkout "nginx_1.11.4.orig.tar.gz": it exited with 1 [23:11:54] stupid gbp :D [23:13:35] 10Traffic, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2781476 (10GWicke) @anomie, adding parameters without changing the semantics of existing parameters won't break any existing clients, as none of those wou... [23:36:45] 10Traffic, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2781560 (10Anomie) Say we didn't already have a parameter to select the page of a PDF. Then we add the parameter. Either the parameter needs to be optiona...