[06:44:35] morning! We received again data consistency checks alarms, but from what I can see the number of empty timestamps is much less than yesterday [06:45:00] so we'll need to tune again vk's -T/-L [06:46:05] I suspect that raising the maximum amount of incomplete records to 5K gave us some relief but opened to the possibility of having new transactions going into timeout waiting for a End tag [06:46:35] (that were covered before) [08:10:16] elukey: morning! [08:11:45] o/ [08:28:26] 10Traffic, 06Operations, 10media-storage: Certain images failing to load in ulsfo - https://phabricator.wikimedia.org/T144257#2593821 (10MoritzMuehlenhoff) [08:28:40] 10Traffic, 06Operations, 10media-storage: Certain images failing to load in ulsfo - https://phabricator.wikimedia.org/T144257#2593559 (10fgiunchedi) the fact that ulsfo fails but not the others might be related to varnish 4, #traffic recently switched ulsfo cache_misc [08:41:36] 10Traffic, 06Operations, 10media-storage: Certain images failing to load in ulsfo - https://phabricator.wikimedia.org/T144257#2593559 (10ema) Yes, we finished upgrading cache_upload in ulsfo to Varnish 4 yesterday: T131502. I've banned the specific image from the frontends in ulsfo and I now get the right C... [08:43:04] 07HTTPS, 10Traffic, 10Icinga, 10Monitoring, 06Operations: ssl expiry tracking in icinga - we don't monitor that many domains - https://phabricator.wikimedia.org/T114059#2593846 (10fgiunchedi) [08:56:24] 10Traffic, 06Operations, 10media-storage: Certain images failing to load in ulsfo - https://phabricator.wikimedia.org/T144257#2593871 (10ema) p:05Triage>03High [08:59:16] ema: whenever you have time https://gerrit.wikimedia.org/r/#/c/307483/ [08:59:26] I don't like a lot this approach because it is not data driven [08:59:36] I put 1500 just as quick guess [08:59:50] and afaiu it shouldn't cause any issue [09:00:20] the only one that I can think of is that vk does loop through all the incomplete transactions periodically to check timeouts etc.. [09:00:35] so the longer the list the worst the performance are [09:00:41] but it is kind of unavoidable [09:01:30] the other option would be to force varnish threads to flush data more often to the shm log playing with their buffer [09:01:41] but I think it will be a more intrusive solution [09:01:54] (it might benefit all the shm log consumers though) [09:02:12] anyhow, let me know your thoughts :) [09:11:53] 10Traffic, 06Operations, 10media-storage: Certain images failing to load in ulsfo - https://phabricator.wikimedia.org/T144257#2593892 (10ema) Unfortunately quite a few requests on all ulsfo upload frontends are affected, as confirmed with: varnishlog -n frontend -q 'RespHeader ~ "Content-Length: 0" and Re... [09:23:52] 10netops, 06Operations: Connection problems (from NZ to ULSFO) - https://phabricator.wikimedia.org/T144263#2593958 (10MoritzMuehlenhoff) [09:28:55] 10netops, 06Operations: Connection problems (from NZ to ULSFO) - https://phabricator.wikimedia.org/T144263#2593965 (10Nurg) Connection is ok just at the moment. So here is a tracert when all is going well. 2 11 ms 11 ms 10 ms default-rdns.callplus.co.nz [101.98.0.131] 3 12 ms 12 ms 11 ms... [09:36:14] 10Traffic, 06Operations, 10media-storage: Certain images failing to load in ulsfo - https://phabricator.wikimedia.org/T144257#2593985 (10ema) >>! In T144257#2593892, @ema wrote: > Also, ulsfo upload backends don't seem to be affected. A rolling restart of the frontends in ulsfo is probably the easiest way t... [09:54:46] 10netops, 06Operations: Connection problems (from NZ to ULSFO) - https://phabricator.wikimedia.org/T144263#2593896 (10mark) I just did a reverse traceroute and I'm also not seeing problems at the moment... [11:08:05] 10netops, 06Operations: Connection problems (from NZ to ULSFO) - https://phabricator.wikimedia.org/T144263#2594075 (10Nurg) No further problems for a while now, so it may have come right. Thanks everyone. [11:11:31] ema: so how does the zero-byte thing look so far? [11:12:03] seems eerily familiar, we faced something similar that lead to a bugfix before [11:12:57] also, it's possible that these are the same reqs the v3 frontends were reporting as broken chunked transfers and turning into 503s on miss, but a v4 frontend is instead accepting them and giving them CL:0? [11:16:16] 07HTTPS, 10Traffic, 06Operations, 10Wiki-Loves-Monuments: configure https for www.wikilovesmonuments.org - https://phabricator.wikimedia.org/T118388#2594077 (10SindyM3) Please help me: Several instances of abuse on the Wordpress installation on domain 'wikilovesmonuments.org' (server 'schippers.wikimedia.... [11:30:59] ema: it looks like there's a lot of CL:0 responses [11:31:17] they don't tend to be cached, though, I don't think. they tend to be miss,miss,miss,miss? [11:31:43] now this has me questioning whether it's varnish or thumbnailing [11:35:01] well, except that I can't seem to repro in eqiad [11:36:56] ok not all are missx4, some are more like: X-Cache-Int: cp1073 hit/2, cp2011 hit/1, cp4005 miss, cp4005 mis [11:38:05] in the bulk of the cases, you can see that the backend->frontend response had CL:0, so it was the v4 backend (presumably) that already made the mistake. [11:38:36] however, I've also seen cases where the ulsfo backend response had CL, and it was the frontend that messed it up [11:39:05] in one normal-looking request for: /wikipedia/commons/1/12/Extensi%C3%B3_del_valenci%C3%A0_al_Pa%C3%ADs_Valenci%C3%A0.svg [11:40:07] it was a frontend miss, and the be->fe response (all in ulsfo) looked like this in the relevant fields: [11:40:09] - RespHeader Content-Length: 236662 [11:40:12] - RespHeader X-Cache-Int: cp1072 hit/1, cp2020 miss, cp4005 miss [11:40:15] - RespHeader Age: 22911 [11:40:44] then you see it go through our final deliver-time processing, mangling various headers [11:41:10] and then the final several lines of the record go like this: [11:41:16] - VCL_return deliver [11:41:16] - Timestamp Process: 1472556928.483524 0.160644 0.000040 [11:41:16] - RespHeader Accept-Ranges: bytes [11:41:16] - RespUnset Content-Length: 236662 [11:41:16] - RespHeader Content-Length: 0 [11:41:18] - Debug "RES_MODE 2" [11:41:21] - RespHeader Connection: close [11:41:23] - Timestamp Resp: 1472556928.483550 0.160670 0.000026 [11:41:26] - ReqAcct 524 0 524 943 0 943 [11:41:28] - End [11:42:20] I'm gonna guess that probably something similar to that is happening in the backend, in the cases where the backend response is already CL:0 when the frontend first gets it [11:43:04] the whole post-"deliver" reset of CL->0 seems familiar, I feel like we already debugged this on maps/misc and it was fixed upstream? [11:45:12] post" [11:45:36] post-"deliver" CL reset might be related to gzip/gunzip? would make sense given I caught it on an svg, which is gzippable [11:47:00] there are jpegs doing the same thing, though [11:53:37] bblack@cp4005:~$ varnishlog -c -g request -n frontend -q 'RespStatus == 200 and RespUnset ~ "Content-Length" and RespHeader ~ "Content-Length: 0" and ReqHeader !~ "Host: varnishcheck"' [11:54:20] ema: ^ is the logging I'm looking at. the rate of them looks pretty bad, but I hate to take away our only organic test without a reliable way to repro yet. [11:55:00] if we can figure it out with ulsfo depooled in gdnsd, do that. otherwise I'd say we roll most of them back to v3 and see if we can still observe it on the last one. [12:00:18] well that last varnishlog line is to catch the ones where the frontend is re-setting CL. without the RespUnset match in there it will show the more-common ones with CL:0 from the backend. [12:09:20] bblack: I haven't managed to get any of those from the backend [12:09:26] as in varnish-be [12:09:34] have you? [12:15:39] bblack: also add ReqMethod ~ GET, otherwise that matches OPTIONS requests too [12:15:44] they're not that many, but still [12:19:34] yeah I don't see them logging on the backend [12:19:52] but I do see them logging on the frontend at a pretty decent pace, and they indicate a backend sending CL:0 commonly [12:21:51] did you do something? [12:22:13] nope [12:22:45] nevermind, I screwed up a query filter [12:24:02] the ones that re-set CL in the frontend itself are more interesting even if they're a smaller fraction of them [12:24:09] since they're a little pure-er as to the cause [12:24:17] right [12:26:10] but now those seem to have gone away. I was getting quite a decent rate just ~30m ago... [12:26:51] I'm still seeing the ones from the backend setting CL:0 though [12:27:35] (also my varnishcheck filter above is not right, it really should be a positive match on upload, or a negative lookahead assertion in a positive match) [12:28:45] I was thinking whether this could be related to range request handling somehow [12:31:17] well, the few I catch now are range-not-satisfiable [12:31:22] but the ones I caught before weren't [12:31:30] (for frontend-reset ones) [12:31:54] so the ones you're catching now are 416s that become 200s? [12:32:52] well [12:33:02] something odd there anyways, but probably unrelated [12:33:19] the one I just caught looks like: [12:33:20] - ReqMethod GET [12:33:20] - ReqURL /wikipedia/commons/b/b1/Polaris.ogg [12:33:32] - ReqHeader Range: bytes: 0-65536 [12:33:41] - VCL_call HIT [12:33:41] - ReqHeader X-CDIS: hit [12:33:41] - VCL_return deliver [12:33:53] - RespProtocol HTTP/1.1 [12:33:53] - RespStatus 200 [12:33:53] - RespReason OK [12:33:53] - RespHeader X-Object-Meta-Sha1base36: t4t0sa8cegnikeatlf62qvcrs4x5wo1 [12:34:03] - RespHeader Content-Length: 2365995 [12:34:15] - RespHeader Age: 72300 [12:34:30] and then post-deliver, we get: [12:34:43] - RespHeader Accept-Ranges: bytes [12:34:43] - Debug "RANGE_FAIL Not Bytes" [12:34:43] - RespHeader Content-Range: bytes */2365995 [12:34:43] - RespProtocol HTTP/1.1 [12:34:43] - RespStatus 416 [12:34:45] - RespReason Requested Range Not Satisfiable [12:34:48] - RespReason Requested Range Not Satisfiable [12:34:50] - RespUnset Content-Length: 2365995 [12:34:53] - RespHeader Content-Length: 0 [12:34:55] - Debug "RES_MODE 2" [12:34:58] - RespHeader Connection: close [12:35:00] - Timestamp Resp: 1472560124.317717 0.000170 0.000038 [12:35:03] - ReqAcct 531 0 531 1023 0 1023 [12:35:18] it seems like it's rejecting a request for range 0-64K on a ~2.3MB file that was a cache hit? or is there something malformed about that Range: header? [12:35:57] shouldn't it be: bytes=0-65536? [12:36:15] if (strncasecmp(r, "bytes=", 6)) [12:36:15] return ("Not Bytes"); [12:36:27] yeah, so I guess that's a malformed range request [12:36:57] (it's one we could trivially fixup on recv, though, if that's some common client compat issue) [12:37:04] yep [12:39:30] I've sampled 1 minute of requests on cp400[5-7], we get roughly 20 CL:0 out of ~85k requests [12:40:40] yeah but 85k reqs doesn't take long either, that's still somewhere on the order of 1/s? [12:42:18] https://gerrit.wikimedia.org/r/#/c/307496/ for the broken range fixup [12:45:07] bblack: lgtm [12:49:40] bblack: so yeah, it's happening quite frequently. Should we get a big raw varnishlog capture, depool ulsfo and try to investigate further? [12:50:21] well [12:50:31] can we reproduce it at all, spamming similar requests for similar URLs? [12:50:51] I don't know that we'll figure it out on our own from just a log, although maybe upstream can [12:53:50] I can repro cache hits on them [12:54:02] but it's already cached-bad in that case [12:54:11] e.g.: [12:54:15] bblack@cp4005:~$ curl -v https://upload.wikimedia.org/wikipedia/commons/thumb/a/ac/Osaka_Castle_Nishinomaru_Garden_April_2005.JPG/300px-Osaka_Castle_Nishinomaru_Garden_April_2005.JPG --resolve upload.wikimedia.org:443:127.0.0.1 [12:54:32] yes I could repro cache hits as well earlier today [12:56:58] basically it doesn't seem like, in other ticket, upstream is good at accepting and/or figuring out two-layer problems on their own [12:57:26] it would be nice to have a repro or log of the miss->fetch where the BE sends CL and the FE re-sets it to zero, with full raw debug output or whatever [12:57:45] but I'm not seeming to catch those easily now, was before... [12:59:19] ema: cp4005, ~bblack/clzero.log [12:59:23] is one I got from my backscroll [13:00:03] the kind where the FE does: [13:00:04] - RespUnset Content-Length: 36278 [13:00:05] - RespHeader Content-Length: 0 [13:00:08] yeah [13:00:50] interesting that the client is FB heh [13:00:51] so this was a remote-be hit [13:02:12] yeah, but local-be missed, but set CL [13:03:05] and judging by: - RespHeader Last-Modified: Fri, 18 Oct 2013 18:40:06 GMT [13:03:19] it's not like this is a new image, or a new thumb, or a likely recent PURGE heh [13:05:30] retrying that clzero.log url from curl->cp4005, I now get a miss->hit and good CL [13:05:35] (hit in cp4005 be) [13:05:54] and on trying again, a good hit->hit [13:06:05] so something transient is happening on miss->miss ? [13:06:23] the age is 55K [13:06:45] which is in line with the Age sent by cp2011 earlier in clzero.log [13:07:21] sorry, cp2026 [13:07:23] either way [13:08:21] if you think we can find a repro without live users, then yeah just gdnsd-depool ulsfo for now [13:08:56] it's simpler than downgrading most of it [13:09:02] (or all) [13:10:02] maybe this is the other half of the chunked-streaming incompat we saw on v3->v4 req [13:10:29] maybe for v4->v3, it can happen in a more-subtle and borked way, and instead of 503 we get CL:0 when streaming through from be->be->fe somewhere [13:10:46] but once the object lands in cache, we're not streaming on both ends in the local-be anymore [13:13:16] I'm trying to capture a few more, so far I'm getting only 416s and 304s [13:13:33] - RespHeader Content-Range: bytes */801281 [13:13:33] - RespProtocol HTTP/1.1 [13:13:33] - RespStatus 416 [13:13:33] - RespReason Requested Range Not Satisfiable [13:13:33] - RespReason Requested Range Not Satisfiable [13:13:36] - RespUnset Content-Length: 801281 [13:13:38] - RespHeader Content-Length: 0 [13:14:15] yeah I've had a hard time getting a new cap as well [13:14:29] which makes me wonder if all the ones in my backscroll were a spam from a particular source, e.g. FB [13:14:57] I can still get be-sent CL:0 on miss, just not the unset->reset on the FE [13:16:55] ok I found a big batch of them in backscroll, let me see if I can grab the whole set [13:18:18] ok clzero.log in my homedir updated [13:18:27] those were all from one log one [13:18:32] *one log run [13:19:03] perhaps it's an intermittent thing, mirroring the earlier bursts of 503 for v3->v4 reqs failing? [13:19:32] oh, which BTW happened again [13:19:39] interestingly, every single example in that log, X-Cache ends with: cp4005 miss, cp4005 miss [13:19:46] it's never a different BE [13:20:54] I wonder if similar things happen on the others, and if the be is always 4005 or always self? [13:24:31] I've captured a bunch on cp400[5-7]:~ema/cl0.log [13:24:46] user-agent is often facebookexternalhit [13:26:30] they seem to always go through cp4005 [13:28:51] so a theory is starting to develop in the back of my head, but... [13:29:21] is there any chance we're running out of sockets at some level here? either because v4 has a new socket leak, or because default-streaming == more socket parallelism. for varnish<->varnish. [13:29:33] uh [13:29:53] it could explain 503 bursts, and strange CL:0 behavior could be a replacement for 503 when it's cp4005<->cp4005 because the tcp stack fails a little differently when it's local [13:30:19] there's a cap in varnish for our varnish->varnish sockets [13:30:33] and eventually we'd run out in the OS, too, but numbers don't look like that at this moment [13:31:02] # for pid in `pgrep varnishd`; do ls /proc/$pid/fd|wc -l; done [13:31:02] 15 [13:31:02] 131 [13:31:02] 13 [13:31:02] 661 [13:31:46] yeah but we're also not presently seeing a burst of 503 or a burst of CL:0 generated in the FE [13:32:07] mmh I'm getting a few now actually [13:32:27] yup I see it too on 4005 [13:32:53] capturing them in ~ema/cl0-new.log on cp400[5-7] [13:32:56] bursty behavior is what makes me think of running into some limit periodically [13:33:20] CPU usage of varnishd-be is very high on cp4005 [13:33:28] ~140% [13:33:46] I've seen it higher over the past couple of days, though [13:33:59] which is why I was looking at it ramping out over time in ganglia [13:34:10] much lower on cp4013, cp4014 and cp4015 [13:34:37] another thing that could be causing periodic bursts could be a nuking-related bug. even if removing the nuke patch was right, maybe it's not great that we're seeing such big bursts of nukes that do nothing... [13:35:06] I feel like we're gonna systemtap this one [13:35:20] we also have a non-default nuke_limit. I think def is 50 and we're at 300. doesn't seem like that would cause any kind of catastrophic scaling issue though. [13:35:25] probably! :) [13:35:54] so did we manage to get a frontend-miss repro? I haven't managed so far [13:36:24] nope [13:36:33] also, yeah, the 503 bursts are still there for sure [13:36:42] they're differently-shaped now that we're all-v4 though [13:36:49] more frequent, flatted, smaller peaks [13:36:53] OK. We can relatively easily hook into nukelru and try to correlate that with the bursts [13:37:18] if we think that might be the culprit [13:37:39] the 503s in the aggregate graph seem to correspond with my earlier capture during a burst, and this one [13:37:45] mmh [13:37:51] the most-recent burst that ended, and the one starting up now in the graph [13:38:01] 503s correlating with CL:0 I mean [13:38:06] the burst seems over now from my capture [13:38:15] grafana could be lagging [13:38:30] and varnishd usage on cp4005 went up to 230% [13:38:41] is CPU usage gonna go down during the bursts? [13:38:46] look at 3h view in https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes for ulsfo+upload+status-code-5 [13:39:09] 503 burst just ended too [13:40:12] so, yeah, it appears during this event, cpu% drops notably [13:40:35] on the backend, that is [13:40:46] only on the backend, yeah [13:41:07] probably because the backend is failing requests (resulting in 503 or CL:0), and failures are cheaper than transferring a valid result [13:41:35] so the ones you just caught on 400[67], are they all 4005 be, or all local-be, or mixed? [13:41:46] huge CPU usage mismatch between eg: cp4005 and cp4014 [13:42:30] well the low-numbered ones have had more time to build up running history, in terms of some long-term negative effect on cache storage (malloc or mmapfile) [13:42:49] true [13:43:02] vast majority of what I captured is 4005 be [13:43:31] possible exception captured on cp4007 though let me check [13:44:10] some of your capture is 304, not 200 [13:44:16] which is valid for CL:0 switch [13:45:02] exactly, the one going through cp4007 was a 416 [13:45:19] so maybe just 4005 backend is the issue for some reason? [13:45:30] should we try to depool it and see what happens? [13:45:40] well [13:45:56] if that's the case, we could break down a few ideas about why and what we'd see after: [13:46:23] 1. Some bug has struck the file storage on cp4005, which would clear on restarting (implies wiping) [13:46:40] 2. Some other internal state bug has struck cp4005 varnish-be that would clear on restarting [13:47:21] 10Traffic, 06Operations: varnishkafka frequently disconnects from kafka servers - https://phabricator.wikimedia.org/T144158#2594406 (10elukey) 05Open>03Resolved [13:47:37] 3. This is specific to a certain object that vslp-hashes to cp4005. for example, a very large object which is nuked for LRU or PURGEd on a regular basis, which stalls out other requests to storage due to a huge inefficiency dealing with that large object, or something like that... [13:47:48] and thus impacts other requests [13:49:26] 4. there's something broken about cp4005 itself that causes failures we don't see elsewhere, at some other level. maybe a broken switch port or whatever. [13:49:36] I donno what else [13:49:56] if it's 1 or 2, we'd probably see the problem vanish completely on depool or daemon-restart [13:50:26] if it's 3, depooling it will move the problem to a different backend eventually, restarting it won't do much in the long run, but might take a while to repro again [13:50:48] if it's 4, depooling will fix it, restarting it won't, and we'll be able to repro if we re-pool [13:51:47] so yeah, I guess depool first and see what happens in the wake of that [13:51:59] alright [13:51:59] if the problem doesn't eventually move to another backend, we know something [13:53:43] varnish-be on cp4005 depooled [13:54:01] and re: CPU usage, you're right. cp4006 also has a very high one [14:00:04] bblack: cp4006.ulsfo.wmnet:~ema/667.log looks interesting [14:00:16] - RespStatus 667 [14:00:26] oh [14:00:30] nevermind [14:00:33] it's an OPTIONS [14:00:45] leave me alone in my madness [14:06:57] 10Traffic, 10MediaWiki-extensions-CentralNotice, 06Operations: Varnish-triggered CN campaign about browser security - https://phabricator.wikimedia.org/T144194#2591355 (10Pcoombe) CentralNotice is reliant on JavaScript, and doesn't work on browsers where MediaWiki only offers [[https://www.mediawiki.org/wiki... [14:06:59] if depooling 4005 does fix it for some reasonable period (a couple of hours?), the next step is probably repool without restarting it and observe failure again, then restart the daemon. [14:07:55] bblack: in the meantime, should we ban 200s with method GET and CL:0 from the frontends? [14:08:47] 10Traffic, 10MediaWiki-extensions-CentralNotice, 06Operations: Varnish-triggered CN campaign about browser security - https://phabricator.wikimedia.org/T144194#2594482 (10BBlack) Ah, that probably puts a fork in this for notifying truly-outdated clients this way, then. The only case we could capture with CN... [14:12:29] ema: well, the previous ban didn't work btw [14:12:58] 1472549025.188861 0 - obj.status == 200 && req.http.Content-Length == 200 [14:13:12] oh yeah, after that one I got it right :) [14:13:27] but a ban is just a one-shot bandaid for anything already-cached, it's not going to stop the bursts, they're not cache hits [14:13:48] I haven't seen an FE cache hit with the problem [14:14:06] have we observed CL:0 based on a local-be hit + front-miss? [14:14:13] nope [14:14:25] so banning won't do anything, even temporarily [14:14:36] users got CL:0 on cache hit though according to T144257 [14:14:38] T144257: Certain images failing to load in ulsfo - https://phabricator.wikimedia.org/T144257 [14:15:08] I don't see any indication it was a hit [14:15:44] (frontend-hit) [14:15:49] where? [14:16:01] https://phabricator.wikimedia.org/T144257#2593813 [14:16:14] I personally tried the same request, got CL:0 [14:16:33] with an FE hit? [14:16:39] yep [14:17:30] I donno then [14:18:13] yet another case where the ability to query/explore the cached dataset directly would be a nice debugging aid [14:20:09] captured a few fe-hits with CL:0 on cp4015:~ema/cl-0-fe-hit.log [14:21:19] yeah they exist, just looked too :) [14:21:47] so, sure, periodically banning CL:0 objects from storage makes sense to limit the long-term fallout [14:21:57] probably should hit the backends too (and first) [14:22:22] is it possible to do that on obj.*? [14:22:35] I think something like: [14:22:36] ban req.status == 200 && req.http.content-length == 0 && req.http.method ~ GET [14:22:37] (which is way more efficient than a ban that references resp.http) [14:22:49] req.http.content-length? :) [14:23:00] yeah, s/req/resp/ where appropriate :) [14:23:09] I don't think you can ban on resp though, can you? [14:23:17] it must be obj.http.Content-Length? [14:23:29] assuming we can ban on that [14:23:52] but anyways, the point remains: if we can make it reference only obj.*, it will be much more efficient. [14:24:07] varnish can complete those walking the storage, instead of checking every request against them until all storage has recycled. [14:25:28] what do you mean with only obj.*? Not accessing obj.http.*? [14:25:44] I meant obj.* to include obj.http.* [14:25:57] and yes, obj.http.Content-Length should work [14:26:09] I would just ban on that alone and no other parameter [14:26:26] so we don't care if we ban all OPTIONS I guess :) [14:26:30] there shouldn't be legit CL:0 objects we care about regardless, and it will make the ban process quickly [14:26:37] OPTIONS don't cache, they're synthetic [14:26:53] ah! [14:27:58] so, ban on just the obj.http.Content-Length == 0, on all the cp4 backends then all the cp4 frontends, and it should reset the problem with the cached ones quickly. [14:28:15] but we'll probably build more over time until we solve this and need to redo that periodically [14:28:26] this morning I've banned on all frontends with obj.status == 200 && obj.http.content-length == 0, for the record [14:28:38] ok [14:28:54] oh obj.status heh [14:28:59] good idea on that one though :) [14:29:16] so we don't pointlessly clear cached redirects with CL:0 [14:29:24] yep [14:29:25] I donno if upload has many though [14:29:36] 404s should have a length, since they display some kind of error page [14:36:32] bblack: can you please double-check that I'm not doing anything silly? :) [14:36:35] sudo salt -b 1 --out=txt -C 'G@site:ulsfo and G@cluster:cache_upload' cmd.run "varnishadm ban 'obj.status == 200 && obj.http.content-length == 0'" [14:37:33] that looks right to me, modulo the fact that I often get quoting for salt->shell->varnishadm wrong the first time around and then have to redo it :) [14:37:40] but I think that quoting works, FWIW [14:38:00] well really it's shell->salt->shell->varnishadm heh [14:38:31] :) [14:42:44] the 'help' documentation for ban.list says: List the active bans. [14:42:47] thank you [14:43:09] debug.health seems to be gone in v4 is another thing I noticed :( [14:45:08] I'm gonna ban on the frontends too [14:46:34] ok [14:52:14] 10Domains, 10Traffic, 06Operations: Guapopedia - https://phabricator.wikimedia.org/T144276#2594645 (10Joaquinito01) [14:53:14] Guapopedia ? [14:54:50] 10Domains, 10Traffic, 06Operations: Guapopedia - https://phabricator.wikimedia.org/T144276#2594645 (10BBlack) https://meta.wikimedia.org/wiki/Proposals_for_new_projects ? [14:59:29] so we're still pretty clueless as to the real cause of the CL:0 right? [15:00:22] I think the fact that it's correlated 503+CL:0, and it seems to come and go like it does, points to some kind of resource exhaustion, or a periodic perf regression triggering a bug [15:00:35] as general starting point hints, anyways [15:01:15] something is periodically harming network traffic, or periodically we run out of sockets, or periodically some internal process like nuking objects out of storage takes too long and triggers races/fallout, etc... [15:02:09] what would be really enlightening would be to glean the real reason for the reset of CL:N->CL:0 post-deliver on FE miss, in those cases [15:02:23] if we knew the code path for that, it would tell us a lot about what's going on in general [15:05:23] 07HTTPS, 10Traffic, 06Operations, 10Wiki-Loves-Monuments: configure https for www.wikilovesmonuments.org - https://phabricator.wikimedia.org/T118388#2594683 (10Dzahn) @SindyM3 I think this should probably have a new ticket. Do you know who is admin of schippers.wikimedia.nl ? [15:10:11] 304s and 416s might perhaps be involved, in that case CL is set to 0 [15:10:24] we did run into some similar bug already right? [15:12:13] well we're not seeing Range commonly in this case [15:12:36] 304 is a possibility, if we're hitting some logic meant for the 304 case that's being misapplied to a 200. [15:13:13] in the clzero logs I see failures that lack 304-triggering req headers though, e.g. IMS or Etag-related [15:13:41] (which kinda rules out that it should be a 304 but just has the wrong output status. more likely it should be a 200 and 304-logic is misapplied to it) [15:14:13] and yes, we did have a very similar bug early on during cache_misc I think. it was one of the ones that got solved one way or another with upstream patches, I think? [15:14:48] it's also entirely possible that this bug only exists because of mixed v3+v4 and would go away if the whole stack was v4 through the layers... [15:15:31] it's also possible this is a bug that crosses request boundaries [15:15:51] the bug triggers on one magical request with special properties, and corrupts cache state in a way that affects other related requests. [15:16:11] this one was empty-responses with proper CL: https://github.com/varnishcache/varnish-cache/issues/1956 [15:17:45] yeah, so, on the experimental front, we could try setting grace=0 on all things [15:17:56] no repro since 4005 backend depool? [15:19:46] I've just captured a 200 with CL:0 but without CL rewrite on cp4014.ulsfo.wmnet:~ema/cl-0-2.log [15:20:09] X-Cache-Int: cp1063 hit/2, cp2002 miss, cp4007 hit/3, cp4014 hit/2 [15:20:33] well it's a cache object then [15:20:47] was that log running since the bans? surely we would've seen it entering [15:21:17] captured a couple of minutes ago [15:21:45] right, I meant I wonder if we would've caught it.. the same CL:0 back when it was FE-miss in stead of FE hit/2 [15:23:14] heh I wasn't running varnishlog on cp4007 [15:24:30] but you were on 4014 since the ban? [15:24:48] because it's also an interesting datapoint if the req->resp was fine when it first entered the cache, then later failed just on hit/2 [15:29:50] bblack: I wasn't on 4014 the whole time since the ban, no [15:30:44] no more 503 spikes, so far [15:34:54] hmmm [15:38:22] HTTrack 3.0x in the meantime keeps doing stupid stuff [15:39:17] (currently spamming my varnishlog capture on cp4007) [15:39:32] lots of "RANGE_FAIL high smaller than low" [15:40:13] yeah I guess ignore that UA [15:40:32] we could just block it if it's so commonly-broken. [15:41:16] 10Traffic, 06Operations: Evaluate Apache Traffic Server - https://phabricator.wikimedia.org/T96853#2594754 (10BBlack) In the most-recent discussions about this, I've been liking the idea of breaking this down into frontend and backend caching. Frontend is heavy-traffic and high-complexity (in VCL/vmod terms).... [15:59:01] bblack: "There might be interop downsides anywhere they disagree on HTTP protocol subtleties" [15:59:08] as if varnish 3 and 4 always agreed :) [16:02:13] I'm not seeing much going on now except for the CL rewrites due to 416s [16:24:57] 10Domains, 10Traffic, 06Operations: Guapopedia - https://phabricator.wikimedia.org/T144276#2594923 (10Aklapper) 05Open>03stalled a:05Joaquinito01>03None Hi @Joaquinito01, thanks for taking the time to report this! Unfortunately this report lacks some information. If you have time and can still reprod... [17:15:25] 07HTTPS, 10Traffic, 06Operations, 10Wiki-Loves-Monuments: configure https for www.wikilovesmonuments.org - https://phabricator.wikimedia.org/T118388#2595125 (10Multichill) >>! In T118388#2594683, @Dzahn wrote: > @SindyM3 I think this should probably have a new ticket. Do you know who is admin of schippers.... [17:36:45] 07HTTPS, 10Traffic, 10Monitoring, 06Operations: adjust ssl certificate montioring to differentiate between standard and LE certificates. - https://phabricator.wikimedia.org/T144293#2595236 (10RobH) [17:37:40] 07HTTPS, 10Traffic, 10Monitoring, 06Operations: adjust ssl certificate montioring to differentiate between standard and LE certificates. - https://phabricator.wikimedia.org/T144293#2595249 (10RobH) [18:11:22] 07HTTPS, 10Traffic, 06Operations, 10Wiki-Loves-Monuments: configure https for www.wikilovesmonuments.org - https://phabricator.wikimedia.org/T118388#2595369 (10Dzahn) ok, cool. thanks @Multichill [20:08:23] ema: depooled ulsfo for upload only (not other clusters). It doesn't look much worse than it did before, but it still does recur with 4005's backend out of the picture, and I can't monitor and debug it the rest of the afternoon/evening, so basically no point suffering the minor fallout the whole time. [20:08:37] ema: if you want to depool when you get here in the morning to debug further, feel free [20:08:44] ema: s/depool/repool/ above :) [22:20:47] 10netops, 06Operations: Connection problems (from NZ to ULSFO) - https://phabricator.wikimedia.org/T144263#2596223 (10Peachey88) 05Open>03Resolved >>! In T144263#2594075, @Nurg wrote: > No further problems for a while now, so it may have come right. Thanks everyone. Marking as closed, reopen or refile it...