[07:09:36] 10netops, 10Analytics-Kanban, 10EventBus, 10Operations, 10Patch-For-Review: Allow analytics VLAN to reach schema.svc.$site.wmnet - https://phabricator.wikimedia.org/T221690 (10elukey) 05Open→03Resolved To keep archives happy: ` term schema { from { destination-address { /* sc... [10:51:35] going for lunch, please do not enable puppet on cache hosts meanwhile! [10:53:38] would they eat your lunch? [11:26:59] 10netops, 10Analytics, 10Operations, 10hardware-requests, and 2 others: Upgrade kafka-jumbo100[1-6] to 10G NICs (if possible) - https://phabricator.wikimedia.org/T220700 (10elukey) Summary of what would be needed: * kafka-jumbo1001 (A1) -> 10G card + relocation to a 10G rack * kafka-jumbo1002 (A2) -> 10G... [12:12:40] mark: yes! :) [12:15:03] 10Traffic, 10Operations, 10monitoring: RIPE Atlas data in Prometheus - https://phabricator.wikimedia.org/T221964 (10CDanis) [12:31:43] re-enabling puppet, we're looking good! [12:42:58] meanwhile, mailbox lag has been growing on upload@ulsfo hosts, they probably didn't like the additional load due to cp4021 being depooled [12:43:47] cp4021 is ready though now (with ATS), let's pool it [12:46:19] :) [12:47:38] pcc is my new best friend: https://phabricator.wikimedia.org/P8446 [12:54:29] ok so upload@ulsfo has been under the weather for a few hours now https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?orgId=1&var-datasource=ulsfo%20prometheus%2Fops&var-cache_type=upload&var-server=All&var-layer=backend&from=now-6h&to=now [12:55:21] some varnish-bes there have been running for > 0.5 weeks due to the fact that removing cp4021 from the group of varnishes changed the crontabs for varnish-be-restrt [12:55:24] *restart [12:58:42] meanwhile ATS looks good https://grafana.wikimedia.org/d/wI0nURqiz/ats-cluster-view?orgId=1&var-datasource=ulsfo%20prometheus%2Fops&from=now-3h&to=now [13:03:22] mmh wait [13:03:33] most of the 503s I see on upload backends are due to [13:03:34] -- FetchError straight insufficient bytes [13:04:13] bblack: that rings a bell doesn't it? [13:05:21] those are 404 responses from codfw that get turned into 503 [13:10:53] I think we've seen this on varnish requests with Accept-Encoding: gzip, the origin responding with Content-Type: gzip and uncompressed body [13:11:09] but in this case the response is Content-Type: text/html; charset=UTF-8 [13:22:54] yeah but we fixed that right? [13:23:05] yes that one we fixed [13:23:06] anyways, are non-ulsfo backends seeing any of this? [13:23:24] it does not seem so, I suspect some bot in ulsfo [13:23:46] the 404s are random junk URLs or something? [13:24:04] (are they thumbs maybe? I seem to remember complications about true-404s on thumb URLs) [13:24:22] they look like normal traffic, but the UA is interesting [13:24:27] node-fetch/1.0 [13:25:59] (not exclusively, but mostly) [13:26:28] they are all thumbs [13:26:57] at least, looking at the sampled-1000 data, I see a bunch of node-fetch/1.0 UAs hitting int-remote 404s exclusively on thumbs [13:27:01] all from 1 IP [13:27:13] checking on turnilo [13:27:15] AWS [13:27:57] but regardless of what the dumb bot is doing, why are the 404 responses from swift/thumbor seeming to have bad CL? [13:27:57] yeah, confirmed loads of requests from that UA [13:28:15] do they? [13:28:17] (well could be AWS, or Amazon itself, we've seen them fetch tons of stuff to cache for alexa) [13:28:32] well that's what varnish claims anyways, whether it's true or not I donno [13:28:50] but varnish is saying basically "the 404 response from the applayer said CL: N, but gave less than N bytes data before closing" [13:29:16] it's possible that's actually true [13:29:16] ah right, that's what straight insufficient bytes means :) [13:29:56] pick one of the URLs the bot hit and try it manually against swift? [13:32:05] I got a 404 with a CL:0 when I tried one against ms-fe.svc.codfw.wmnet [13:32:58] same on eqiad [13:33:18] I got CL: 69 but the body is just "Not Found" [13:33:36] heh [13:33:37] so it seems that varnish is saying the truth [13:33:46] odd [13:34:00] godog: curl -v https://swift-ro.discovery.wmnet/wikipedia/en/thumb/7/7c/Center%7C200px%7CMaxfli_logo/200px-Center%7C200px%7CMaxfli_logo -o x [13:34:08] oh right, my test URLs came from the set that actually did a 404 to the user, not the 503s [13:34:27] godog: this returns CL:69, but smaller content ("Not Found") [13:35:04] godog: it seems that the 503s in upload@ulsfo are due to stuff like that [13:35:20] yeah same [13:35:30] at least on some thumbnails it does that [13:35:43] on some other thumbnails it does a clean 404 + CL:0 [13:35:54] even curl shows a connection error [13:36:02] curl: (92) HTTP/2 stream 1 was not closed cleanly: INTERNAL_ERROR (err 2) [13:36:20] in my example I got [13:36:21] curl: (18) transfer closed with 60 bytes remaining to read [13:36:52] which is 60 - len("Not Found"), so that adds up [13:37:01] err, 69 I mean [13:37:50] curl -4 -v 'https://upload.wikimedia.org/wikipedia/commons/thumb/3/37/I_have_House_and_Land_in_Kent.ogg/200px-I_have_House_and_Land_in_Kent.ogg' --resolve upload.wikimedia.org:443:10.2.1.27 [13:38:00] ^ this one gave me a 500 from swift [13:38:09] (it was taken from another example from that UA/IP) [13:38:40] same 500 (with a delay too) from eqiad or codfw [13:38:41] odd [13:38:56] this is the bot on turnilo: https://bit.ly/2XOTTRv [13:39:10] although why anyone would ask for a 200px thumbnailing of an ogg file in the first place, I donno [13:39:47] a lot of my 5xx examples seem to be thumbnail requests on audio files [13:40:50] poor ATS gets its debut ruined by this :) [13:40:52] some others sound like images from their title, but have no .jpg or other image extension [13:41:12] I don't see any examples I can clearly find, of a real image with a real image file extension causing broken behavior [13:42:30] here's an interesting data point too: [13:42:41] from the 5xx logs from that UA/IP, I have the example: [13:42:46] /wikipedia/en/thumb/b/b4/ProAc_logo/200px-ProAc_logo [13:43:01] if I fetch that as-is via curl from swift, I get a pause and a 500 error [13:43:17] err sorry I said that wrong, that was a previous test! [13:43:28] if I fetch that as-is via curl from swift, I get a 404 with CL:69 and "Not Found" and a bad stream close [13:43:34] right [13:43:54] but if I guess at this change to the URL: [13:43:56] /wikipedia/en/thumb/b/b4/ProAc_logo.jpg/200px-ProAc_logo.jpg [13:44:02] I get a 404+CL:0 that's successful [13:44:08] (well, not 5xx-causing anyways) [13:44:24] with actual empty content? [13:44:30] yes [13:44:45] this reminds me of gilles' patch [13:44:49] but maybe the jpg version doesn't exist. any random change to the title pathname seems to produce the clean CL:0 404 [13:44:57] https://gerrit.wikimedia.org/r/#/c/495643/ [13:45:21] he's returning 400 if the thumbnail url does not end with jpg and friends [13:45:54] so maybe this is a known issue? [13:45:58] /wikipedia/en/thumb/b/b4/asdfqwerty.jpg/200px-asdfqwerty.jpg -> clean 404+CL:0 [13:46:21] /wikipedia/en/thumb/b/b4/asdfqwerty/200px-asdfqwerty -> broken 404+CL:69+"Not Found" [13:46:44] (I'm assume asdfqwerty doesn't actually exist in either case, at least in that thumbing path bin) [13:47:29] many of the 503s I see on cp4025 are indeed requests for thumbs w/o file extension [13:48:09] the ones in the 5xx.json greps also include a bunch of thumbnails for music files [13:48:16] but yeah, both cases are clearly "wrong" [13:48:31] that's a shit-ton of "please validate my input for me" new VCL though :/ [13:48:50] shouldn't the applayer be handling all that and just 404-ing anything that's not "real"? [13:49:05] (which it seems to be trying to do, just kinda failing at it with the CL:69 thing) [13:49:06] yes, and definitely return the right CL while at it [13:49:14] :) [13:57:49] fascinating, thanks ema bblack [13:57:56] for investigating that is [14:02:55] godog: you're welcome! List of some urls being affected here https://phabricator.wikimedia.org/P8447 [14:09:08] yeah looks like mostly invalid requests, where we fail at 404'ing correctly from swift [14:09:19] my money is on the rewrite.py swift middleware [14:11:10] right, see also gilles patch linked above [14:11:24] I'm guessing he already identified most of the edge cases, which that patch works around [14:11:29] but it would be nicer if they weren't solved in VCL [14:12:29] ah, thanks I missed the patch, indeed it'd be better if swift/thumbor DTRT here [14:14:46] ema: in the meantime, maybe we can do a specific fixup here in VCL for the problematic case so that it doesn't cause mailbox lag? [14:15:18] ema: e.g. a patch in the backend header reception for upload that says "if status==404 && CL==69 { set CL= }" [14:15:38] I'm guessing 10 or 11 [14:15:58] oh there's no \r or \n heh, so 9 [14:17:20] bblack: hoping that we don't get legit 404s with CL:69 :) [14:18:31] probably not :) [14:19:53] or just unset the CL and let it be close-delimited? [14:20:04] I donno if that works with http/1.1 or whatever, I didn't think that through, but maybe [14:20:35] but I guess we have persistent conns with swift, so close-delimited wouldn't work [14:20:36] perhaps the frontend can do that in _backend_response, if 404, unset CL [14:21:04] yeah but the error + switch to 503 happens in the backend I think? [14:21:26] ah yes of course [14:22:03] why only the "edge-most" backend though [14:22:06] jus tunsetting CL only works if it's close-delimited, which implies no persistent conns (or at least, that the backend immediately closes in the special case of these 404s) [14:22:31] because once the edge-most backend has handle this situation (by turning it into a 503) it's not a problem for the rest of the stack [14:22:59] no I mean why doesn't the backend in codfw get confused by the CL mismatch [14:23:27] oh [14:23:36] yeah I guess that is a little odd [14:24:01] oh, but if it happened simultaneously in both we'd not see it in our webrequest stuff I think... [14:24:16] you have to figure it's probably streaming, esp since there is an explicit CL and this is upload [14:24:34] so the 404+CL:69 comes through all the layers before anything realizes the lack-of-content problem [14:24:35] beresp status from codfw is 404 according to varnishlog -g request on the ulsfo backend [14:24:49] initially anyways [14:25:11] I think this is all just fallout of the broken-ness though, combined with streaming and persistent conns [14:25:28] and I suspect swift actually is breaking the connection to the backend-most after its bad 404 [14:25:54] (if it weren't, then varnish would just stall and timeout waiting on more bytes on the connection, instead of complaining of a disconnect with insufficient bytes) [14:26:46] is there a way to make curl pause and hold the connection open after fetching its url? [14:27:12] maybe not relevant though [14:27:28] but anyways, all the varnish<->varnish connections are persistent [14:28:23] so the flow would be that the request rides varnish-fe -> varnish-be -> varnish-be over persistent conns. varnish-be may use an existing persistent conn to swift to make the request. Gets 404+CL:69 which immediately streams back up through the other layers. [14:29:25] then swift sends 9 bytes and closes, and the backend-most notes insufficient bytes but can't actually convert anything to a 503 at this point. I guess all it can do is maybe break the persistent conn to the next varnish and repeat the pattern? [14:29:46] I'm not sure how or why it eventually does get converted to a 503. some layer must not actually be streaming-through the headers. [14:30:33] or maybe it's just a race condition on whether it notices the insufficient bytes before it has actually forwarded the 404 status line? seems unlikely though. [14:33:53] 10Traffic, 10Operations, 10monitoring: RIPE Atlas data in Prometheus - https://phabricator.wikimedia.org/T221964 (10faidon) [14:42:13] the bug is probably somewhere in: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/swift/files/SwiftMedia/wmf/rewrite.py#81 [14:42:57] somewhere in there, it's mismatching a CL=69 from the original response with a new error body content or vice-versa, in that 404-handling of two types there [14:43:23] return swob.HTTPException(status=status, body=error.msg, headers=error.hdrs.items()) [14:43:42] maybe [14:44:32] or maybe 404 doesn't cause an exception there [14:44:53] there's just below that a headers_whitelist, maybe the CL=69 gets whitelist-copied in at the end even though the status body is different [14:45:06] bblack: re: VCL workaround, should we disable streaming at the backend-most varnish and unset CL so that varnish-be computes CL for 404s with CL:69, or should we just set CL:9? [14:45:23] can we disable streaming after receiving the headers? [14:45:43] oh right, of course we can, which explains why my meandering attempt to explain the issue above failed. streaming doesn't happen until the body. [14:45:46] we definitely disable streaming if !beresp.http.Content-Length [14:46:19] yeah you could try if status==404 && CL==69, unset CL and disable streaming, that might be somewhat-saner. [14:46:48] but, that's all predicated on the idea that either our swift connections aren't persistent at all, or they persistent but swift disconnects on these 404s. [14:47:05] if the connection stays alive, it's just going to wait on close-delimit, since it wasn't chunked and there's no CL [14:47:16] (so they'd become timeouts down there) [14:47:41] I don't remember if they're usually persistent, but I would guess they would be [14:55:44] ok so, another tidbit: [14:56:08] if you connect to thumbor directly and try one of these invalid URLs, it returns a 404 with CL:69, and an actual 69 byte body, which reads: [14:56:19] 404: Not Found404: Not Found [14:56:40] ha! [14:56:41] so yeah, swift-rewrite in this case is replacing the body with just "Not Found", but leaving the original status+CL [14:57:12] somewhere near that block I linked above I'm sure, but I can't quite follow it all [14:58:03] to answer the "persistent connections" part above, yes we are reusing connections in general [14:58:15] it may be that 404 is not an exception, and swob.Response() gets called which changes the body to "Not Found", and then we whitelist-copy in the original CL:69 afterwards [15:01:05] and the backend-most varnish is closing the connection when the CL:69 thing happens [15:01:08] -- FetchError Resource temporarily unavailable [15:01:10] -- FetchError straight insufficient bytes [15:01:13] -- BackendClose 391 vcl-594195fa-7c2d-41b4-a729-07bdf44e6996.be_ms_fe_thumbs_svc_eqiad_wmnet [15:01:22] but, it responds with status 404 [15:03:15] and 'Connection: close', so the other caches on the way to the client also close their conns [15:04:04] hmm no, I think urllib2 considers 404 to be an HTTPError [15:19:49] ema: godog: completely blind and untested (so maybe could use another look), but I suspect this will fix this and related issues: [15:19:52] https://gerrit.wikimedia.org/r/c/operations/puppet/+/506694 [15:20:00] or something very much along those lines anyways [15:21:13] bblack: my suspicion is that error.msg is the simple "Not Found" string, while error.hdrs['Content-Length'] is 69 [15:21:56] bblack: so another attempt at blind patching would be error.hdrs['Content-Length'] = len(error.msg) just before line 94 [15:23:02] yeah assuming error.hdrs can be modified [15:23:09] I guess so! [15:23:58] we could also try to leave the original body in place instead, but I'm not sure how to do that, the docs on urllib2.HTTPError and swob.HTTPException are kinda sparse and rely on digging through superclasses or whatever [15:24:22] I'm not 100% this is even the right spot to fix, but I think it is [15:26:08] seems a more appropriate place than the varnish backend VCL :P [15:26:10] your idea about a one-liner above L94 seems the most succint and likely to succeed though [15:27:09] I wonder if "error" itself (urllib2.HTTPError) has a body accessor itself, which is separate from "msg", in which case it can just be a body=error.body fix [15:27:12] but I have no idea [15:27:41] I think "error.msg" isn't meant to be a body, it's meant for e.g. logging [15:28:26] I think the error should show up in beta too, to test the change that is [15:28:38] check out also rewrite_integration_test.py in the same directory, has a bunch of test cases [15:31:08] yeah good thinking, hack it in a beta cherry-pick and try one of the above [15:31:15] and maybe add integration testing :) [15:31:23] anyways, abandoning mine, it was just for reference [15:32:09] 10netops, 10Operations, 10fundraising-tech-ops: Network setup for frmon2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T221475 (10cwdent) @ayounsi after talking to @jgreen I'm going to redo the DNS using the wildcard cert to also have the failover cname. Is there a public IP available for the cod... [16:06:48] there's plenty of legit 404s with CL:69 unfortunately -- eg: https://swift-ro.discovery.wmnet/wikipedia/commons/8/80/Pa so the VCL approach of setting CL:9 does not seem viable [16:06:58] aaand I've gotta go :) [16:20:43] 10netops, 10Operations, 10fundraising-tech-ops: Network setup for frmon2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T221475 (10ayounsi) https://github.com/wikimedia/operations-dns/blob/master/templates/152.80.208.in-addr.arpa#L24 `208.80.152.235/28` is free. [17:28:49] 10Traffic, 10Operations, 10Core Platform Team Backlog (Next), 10Services (next): Have Varnish set the `X-Request-Id` header for incoming external requests - https://phabricator.wikimedia.org/T221976 (10mobrovac) [17:28:58] 10Traffic, 10Operations, 10Core Platform Team Backlog (Next), 10Services (next): Have Varnish set the `X-Request-Id` header for incoming external requests - https://phabricator.wikimedia.org/T221976 (10mobrovac) [17:29:09] 10Traffic, 10Operations, 10Core Platform Team Backlog (Designing), 10MW-1.33-notes (1.33.0-wmf.17; 2019-02-12), and 6 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10mobrovac) [17:33:35] 10Traffic, 10Operations, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Package libvmod-uuid for Debian - https://phabricator.wikimedia.org/T221977 (10mobrovac) p:05Triage→03Normal [17:33:55] 10Traffic, 10Operations, 10Core Platform Team Backlog (Next), 10Services (next): Have Varnish set the `X-Request-Id` header for incoming external requests - https://phabricator.wikimedia.org/T221976 (10mobrovac) [17:33:59] 10Traffic, 10Operations, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Package libvmod-uuid for Debian - https://phabricator.wikimedia.org/T221977 (10mobrovac) [18:11:05] 10netops, 10Operations, 10Operations-Software-Development, 10User-crusnov: Netbox report to validate network equipment data - https://phabricator.wikimedia.org/T221507 (10crusnov) a:03crusnov [21:17:23] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Reduce / remove the aggessive cache busting behaviour of wdqs-updater - https://phabricator.wikimedia.org/T217897 (10Smalyshev) [22:57:09] 10netops, 10Operations: cr4-ulsfo rebooted unexpectedly - https://phabricator.wikimedia.org/T221156 (10ayounsi) > After checking the information. > We have created a PR 1433009 and engineering will analyze and will update us on the findings.