[10:35:18] 10Traffic, 10MediaWiki-Docker, 10Operations, 10serviceops, 10User-zeljkofilipin: docker pull from docker-registry fails with `ERROR: missing or empty Content-Length header` - https://phabricator.wikimedia.org/T270270 (10JMeybohm) I was able to reproduce as well now. Seems like we are missing the "content... [10:35:47] Hi o/ - I could use your wisdom and help on this [10:51:22] 10Traffic, 10MediaWiki-Docker, 10Operations, 10serviceops, 10User-zeljkofilipin: docker pull from docker-registry fails with `ERROR: missing or empty Content-Length header` - https://phabricator.wikimedia.org/T270270 (10JMeybohm) p:05Low→03High Also rising priority as I guess this will affect more an... [12:21:17] 10Traffic, 10Operations, 10Technical-blog-posts: 3rd part of blog post series: the evolution of Wikimedia's Content Delivery Network - https://phabricator.wikimedia.org/T270074 (10ema) >>! In T270074#6696782, @srodlund wrote: > I looked at the doc and was able to copy edit it! If you are able to go through a... [14:12:40] vgutierrez: you maybe got a minute to talk about T270270 ? [14:12:40] T270270: docker pull from docker-registry fails with `ERROR: missing or empty Content-Length header` - https://phabricator.wikimedia.org/T270270 [14:13:45] that smells more like ema's turf.. [14:14:18] I figured he's probably not around (from his mail to ops) [14:14:54] jayme: I'm half around :) [14:15:04] half ema is better than 1 vgutierrez in this scenario [14:15:07] but I can assist him [14:16:18] hehe, okay. I'll take 1x 0.5 ema then and see how far I get with that :) [14:17:09] jayme: does the origin server return Content-Length? [14:17:51] As I don't see the issue wenn talking to .discovery.wmnet I think yes [14:18:06] ema can just be like "Transfer-Encoding: chunked" [14:18:45] oh, you got the same nic color as ema mark - that was a bit confusing :D [14:22:06] jayme: I do get Content-Length for https://docker-registry.wikimedia.org/v2/releng/node10-test-browser/manifests/latest on both cache hits and misses [14:22:24] however, I can reproduce passing 'Accept: application/vnd.docker.distribution.manifest.v2+json' [14:23:02] (which is my reading of what https://phabricator.wikimedia.org/P13565 does) [14:23:53] Thats correct, docker uses that header [14:24:47] Almost anytime I guess by now. There are a couple of weird Accept headers allowed with the registry AFAIK [14:27:08] Could it be that somewhing weird happens in cache when the same URL returns different responses depending on the accept header? [14:27:31] If we don't take the header into account when calculating the cache key for example [14:29:23] the response does not have "Vary: Accept", hence once cached we'd return cache hits on the object regardless of the value of Accept [14:30:24] Okay. Thats probably bad as well but does not explain why content-length is missing I guess [14:30:29] right [14:31:42] phab search just revealed my obliviouness https://phabricator.wikimedia.org/T256762 *sigh [14:32:36] https://phabricator.wikimedia.org/T242200 more precisely [14:36:55] ok so ATS is doing the right thing [14:37:26] varnish, instead, on cache miss/pass streams the response [14:37:45] the question is why does it do only if Accept is among the request headers [14:38:36] well not just if Accept is among the headers, only if Accept is "application/vnd.docker.distribution.manifest.v2+json" [14:40:11] There are some special cases in on the registry hosts as well it seems... [14:40:47] like two different nginx server blocks, one stating it's only to be reached via varnish [14:42:44] ok so if Accept is application/vnd.docker.distribution.manifest.v2+json the origin responds with Content-Type: application/vnd.docker.distribution.manifest.v2+json [14:43:00] if Accept is "potato" the origin responds with application/vnd.docker.distribution.manifest.v1+prettyjws [14:43:15] and in VCL we have: [14:43:20] if (beresp.http.content-type ~ "json|text|html|script|xml|icon|ms-fontobject|ms-opentype|x-font|sla" [14:43:24] && (!beresp.http.Content-Length || std.integer(beresp.http.Content-Length, 0) >= 860)) { [14:43:27] set beresp.do_gzip = true; [14:43:29] } [14:44:34] so varnish is gzipping the response and streaming it (hence no Content-Length) [14:46:00] the question I guess is: why does docker fall on its face if the response does not have Content-Length [14:46:06] ah, okay. [14:46:23] idk that, but it seems to be a recent addition [14:46:25] but we'll leave it to another day, we can add an exception to that VCL snippet so that the response is not gzipped for docker-registry, or something like that [14:47:29] would it make sense to maybe gzip on the docker hosts (there is an nginx in front of the actual registry anyways)? [14:48:37] 10Traffic, 10netops, 10Operations, 10User-jbond: varnihs filtering: should we automaticly update public_cloud_nets - https://phabricator.wikimedia.org/T270391 (10jbond) p:05Triage→03Medium [14:49:08] jayme: that's a whole different can of worms, we explicitly unset Accept-Encoding at the ATS backend layer due to issues such as those described in T125938 [14:49:08] T125938: PHP fatal errors causing Varnish to return 503 - "Junk after gzip data" - https://phabricator.wikimedia.org/T125938 [14:50:01] oh, I see [14:50:06] we do have it somewhere in the mid-term ideas to re-evaluate that decision but there's work to do [14:51:40] In that case it would be nice if we could just not apply that particular rule to the docker-repo to work around this "improvement" [14:52:37] jayme: I think more in general we should only return gzip responses to clients that have gzip in Accept-Encoding [14:53:15] Yeah, but I'm not sure if docker maybe sends it tbh [14:56:22] 10Traffic, 10netops, 10Operations, 10User-jbond: varnihs filtering: should we automaticly update public_cloud_nets - https://phabricator.wikimedia.org/T270391 (10Volans) AWS allow to subscribe to the modification of the list fwiw, see https://docs.aws.amazon.com/general/latest/gr/aws-ip-ranges.html#subscri... [14:57:52] 10Traffic, 10netops, 10Operations, 10User-jbond: varnish filtering: should we automatically update public_cloud_nets - https://phabricator.wikimedia.org/T270391 (10Volans) [14:58:34] jayme: it doesn't seem so, at least judging from the headers I see in varnishlog when doing: docker pull docker-registry.wikimedia.org/dev/stretch-php72-fpm-apache2-xdebug:0.6.1-s1 [15:01:57] ema: okay [15:02:48] let me write a vtc test and see what breaks [15:03:37] ema: did you see somewhere a reference that ats/varnish uses TCP port 81 to connect to the registry? [15:04:56] jayme: no, ATS uses https://docker-registry.discovery.wmnet [15:05:11] and Varnish does not connect to the origins at all, it connects to ATS only [15:05:27] hmm...maybe thats deprecated and from the days prior to ats then [15:06:10] is there any connection established to :81 ? [15:08:04] eheh, yeah. From alert [15:22:34] Added a patch to send "Vary: Accept" back to the clients. Not urgent I guess as it's like that for quite some time now. Old :81 stuff I will not touch before the holidays but clean up next year [15:29:33] jayme: nice [15:31:39] meanwhile I think my reasoning about Accept-Encoding above are incorrect: varnish does honor the client's Accept-Encoding and decompress stuff before sending it if needed [15:32:55] so instead of checking accept-encoding before setting beresp.do_gzip, which should only affect the compression before storing in cache, let's try to disable streaming for docker [16:04:41] Fine for me as well :) [16:06:15] jayme: https://gerrit.wikimedia.org/r/c/operations/puppet/+/650156 [16:13:19] ema: Thanks! [16:19:03] jayme: yw! I forced a puppet run on cp3062 (the host my IP is chashed to) and I now consistently get Content-Length [16:19:52] great! [16:22:07] running puppet on all other cache_text nodes [16:22:09] thanks again. Now we can go and build up an front against docker for being so stubborn about content-length existing in the response :D [16:24:04] I suppose the docker registry does not support streaming, hence no reason to bother supporting it in the client too, right? :P [16:25:16] Maybe...but it was accepted since docker 20.10 it seems. Hense no reason to break it :) [16:25:21] *c [16:31:55] 10Traffic, 10MediaWiki-Docker, 10Operations, 10serviceops, and 2 others: docker pull from docker-registry fails with `ERROR: missing or empty Content-Length header` - https://phabricator.wikimedia.org/T270270 (10JMeybohm) 05Open→03Resolved a:03JMeybohm For the record: We where sending "Content-Type:... [16:32:00] 10Traffic, 10MediaWiki-Docker, 10Operations, 10serviceops, and 2 others: docker pull from docker-registry fails with `ERROR: missing or empty Content-Length header` - https://phabricator.wikimedia.org/T270270 (10ema) This should now be fixed by using https://gerrit.wikimedia.org/r/c/operations/puppet/+/650... [16:32:52] jayme: ha! we need a mutex on adding phab comments :) [16:33:48] hehe :) I though when you do the work I could at least do the "writing" ;-) [16:34:42] I'm unable to reproduce .. [16:35:53] Nice. I need to run. ttyl [16:37:38] 10Traffic, 10Operations, 10Patch-For-Review: Docker registry needs cache to vary on Accept header value - https://phabricator.wikimedia.org/T242200 (10JMeybohm) a:03JMeybohm [16:40:04] ema: can i bug yuo to check this before you go away for christmas. would be nice to get this out the door before the break https://gerrit.wikimedia.org/r/c/operations/puppet/+/650171 [16:40:06] 10Traffic, 10MediaWiki-Docker, 10Operations, 10serviceops, and 2 others: docker pull from docker-registry fails with `ERROR: missing or empty Content-Length header` - https://phabricator.wikimedia.org/T270270 (10kostajh) 05Resolved→03Open >>! In T270270#6698851, @ema wrote: > This should now be fixed b... [16:41:40] 10Traffic, 10MediaWiki-Docker, 10Operations, 10serviceops, and 2 others: docker pull from docker-registry fails with `ERROR: missing or empty Content-Length header` - https://phabricator.wikimedia.org/T270270 (10JMeybohm) Maybe Mac sends a different user-agent? That would be fun... [16:50:20] jbond42: lgtm [16:50:38] cool lets give it a go :) [16:54:31] ema: i have merged and ran puppet on all ats servers and https://config-master.wikimedia.org/ still seems to work is there anything elses i should do to test ? and thx [16:55:10] jbond42: nope, looks good [16:55:44] awesome thanks [16:59:45] 10HTTPS, 10Traffic, 10Operations, 10serviceops: puppetmaster[12]001: add TLS termination - https://phabricator.wikimedia.org/T263831 (10jbond) 05Open→03Resolved a:03jbond Sorry for the delay however this has been configured now [16:59:49] 10HTTPS, 10Traffic, 10Operations, 10codfw-rollout: HTTPS for internal service traffic - https://phabricator.wikimedia.org/T108580 (10jbond) [17:02:32] 10HTTPS, 10Traffic, 10Operations, 10codfw-rollout: HTTPS for internal service traffic - https://phabricator.wikimedia.org/T108580 (10Dzahn) >>! In T108580#6488253, @BBlack wrote: > $ grep 'replacement: http:' hieradata/common/profile/trafficserver/backend.yaml > replacement: http://puppetmaster1001.... [17:04:11] 10Traffic, 10Operations, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) also see T108580 [17:04:48] 10HTTPS, 10Traffic, 10Operations, 10serviceops: puppetmaster[12]001: add TLS termination - https://phabricator.wikimedia.org/T263831 (10Dzahn) @jbond @ema So puppetmaster1001 can also be checked off on T210411 ? [17:27:56] 10Traffic, 10MediaWiki-Docker, 10Operations, 10serviceops, 10User-zeljkofilipin: docker pull from docker-registry fails with `ERROR: missing or empty Content-Length header` - https://phabricator.wikimedia.org/T270270 (10ema) >>! In T270270#6698866, @JMeybohm wrote: > Maybe Mac sends a different user-agen... [20:31:54] 10Traffic, 10MediaWiki-Docker, 10Operations, 10serviceops, 10User-zeljkofilipin: docker pull from docker-registry fails with `ERROR: missing or empty Content-Length header` - https://phabricator.wikimedia.org/T270270 (10ema) The current theory is that the problem boils down to the following HEAD request... [21:09:07] 10Traffic, 10MediaWiki-Docker, 10Operations, 10serviceops, and 2 others: docker pull from docker-registry fails with `ERROR: missing or empty Content-Length header` - https://phabricator.wikimedia.org/T270270 (10ema) >>! In T270270#6699691, @ema wrote: > That being said, I suspect that our VCL trying to do... [21:10:33] 10Traffic, 10MediaWiki-Docker, 10Operations, 10serviceops, and 2 others: docker pull from docker-registry fails with `ERROR: missing or empty Content-Length header` - https://phabricator.wikimedia.org/T270270 (10kostajh) >>! In T270270#6699794, @ema wrote: >>>! In T270270#6699761, @gerritbot wrote: >> Chan... [21:16:31] 10Traffic, 10MediaWiki-Docker, 10Operations, 10serviceops, and 2 others: docker pull from docker-registry fails with `ERROR: missing or empty Content-Length header` - https://phabricator.wikimedia.org/T270270 (10ema) >>! In T270270#6699806, @gerritbot wrote: > Change 650191 had a related patch set uploaded... [21:48:24] 10Traffic, 10Operations, 10Technical-blog-posts: 3rd part of blog post series: the evolution of Wikimedia's Content Delivery Network - https://phabricator.wikimedia.org/T270074 (10srodlund) @ema I published this. Will you look it over and let me know if you see anything that needs changing or fixing before I... [21:54:40] 10Traffic, 10Operations, 10Technical-blog-posts: 3rd part of blog post series: the evolution of Wikimedia's Content Delivery Network - https://phabricator.wikimedia.org/T270074 (10Aklapper) The bottom says that `This post is part 2 of a 3 part series.` (Plus I wonder if `million` and `billion` should really... [22:06:20] 10Traffic, 10Operations, 10Technical-blog-posts: 3rd part of blog post series: the evolution of Wikimedia's Content Delivery Network - https://phabricator.wikimedia.org/T270074 (10srodlund) Ah! Thank you for catching that. I fixed both of these.