[00:24:50] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Reduce / remove the aggessive cache busting behaviour of wdqs-updater - https://phabricator.wikimedia.org/T217897 (10Smalyshev) @Addshore btw do I understand right that constraints can not be fetched per-revision? In this case, do... [07:11:58] !log disabling puppet in acme-chief clients to merge I437b91c177d97b863a4356ca30b388561f5b9d3c safely [07:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:40] 10Traffic, 10netops, 10Operations: ulsfo <-> codfw transit link flapping causing nginx availability alerts - https://phabricator.wikimedia.org/T219591 (10elukey) p:05Triage→03High [08:35:10] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Reduce / remove the aggessive cache busting behaviour of wdqs-updater - https://phabricator.wikimedia.org/T217897 (10Addshore) >>! In T217897#5066900, @Smalyshev wrote: >> WDQS does know what the latest version of the entity that... [09:18:01] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: ulsfo <-> codfw transit link flapping causing nginx availability alerts - https://phabricator.wikimedia.org/T219591 (10ema) [10:41:35] mmh interesting, it looks like the 503s generated by varnish when fetching from ATS in codfw are all gzip related [10:41:40] -- FetchError Invalid Gzip data: incorrect header check [10:42:15] hmmm weird, can we capture some traffic samples? [10:42:47] I've got the varnishlogs under cp2002.codfw.wmnet:~ema/cp2002-503.log [10:42:56] they're all 404s from ATS [10:43:02] I was thinking in some .pcap [10:43:03] * vgutierrez checking [10:43:05] see (BerespStatus) [10:48:19] so the actual response body is "Not Found", uncompressed (CL: 9) [10:48:21] pretty weird.. the content length is 9 bytes and gzip header is 10 bytes [10:48:24] yeah [10:49:06] varnish backends set the CT to "text/html; charset=UTF-8", ATS to "gzip" [10:49:29] compare the following: [10:49:31] curl -v -H "Accept-Encoding: gzip" -H "Host: upload.wikimedia.org" cp2002.codfw.wmnet:3128/wikipedia/commons/thumb/b/b7/Fridtjof_Nansen_1880.jpg -o /dev/null [10:49:41] vs [10:49:43] curl -v -H "Accept-Encoding: gzip" -H "Host: upload.wikimedia.org" cp2009.codfw.wmnet:3128/wikipedia/commons/thumb/b/b7/Fridtjof_Nansen_1880.jpg -o /dev/null [10:49:50] the former is varnish-be, the latter ats-be [10:51:19] why have we not been seeing this all the time? :) [10:51:37] maybe the update to ATS 8.0.3 is related? [10:53:26] https://raw.githubusercontent.com/apache/trafficserver/8.0.x/CHANGELOG-8.0.3 dosn't show anything obviously related though [10:55:34] uh, works fine in eqiad [10:55:43] compare: [10:55:55] curl -v -H "X-Forwarded-Proto: https" -H "Accept-Encoding: gzip" -H "Host: upload.wikimedia.org" cp1076.eqiad.wmnet:80/wikipedia/commons/thumb/b/b7/Fridtjof_Nansen_1880.jpg -o /dev/null [10:55:59] vs. [10:56:05] curl -v -H "X-Forwarded-Proto: https" -H "Accept-Encoding: gzip" -H "Host: upload.wikimedia.org" cp2002.codfw.wmnet:80/wikipedia/commons/thumb/b/b7/Fridtjof_Nansen_1880.jpg -o /dev/null [10:57:00] uh [10:57:03] https://github.com/apache/trafficserver/issues/2849 [10:59:39] a few lines above, what I said about CT is incorrect. Both varnish-be and ats-be set Content-Type to "text/html; charset=UTF-8", the difference is that ATS also sets Content-*Encoding* to gzip [10:59:43] is still pretty weird, is reporting a 404 gzip encoded with content-length 0 [11:00:51] BTW... this reminds me of the content-length gzip optimization that we have in place in varnish [11:01:15] dunno if you applied something similar in ATS [11:01:54] here the header diff. for the record: https://phabricator.wikimedia.org/P8310 [11:02:45] I'm talking about https://gerrit.wikimedia.org/r/c/operations/puppet/+/419228 [11:03:46] vgutierrez: nope, we're not doing that in ATS land! [11:03:53] nice catch [11:04:15] it's still puzzling how ATS in eqiad would behave differently than in codfw though [11:05:54] oh no, wait a moment, ATS in eqiad behaves the same, it's varnish that behaves differently [11:06:15] errr [11:06:28] https://www.irccloud.com/pastebin/5c97X0Pb/ [11:06:29] I just got this [11:06:33] a 503 in eqiad [11:06:52] but it's flapping between 404 and 503 [11:07:11] uh indeed [11:07:15] when I tried I got a 404 [11:07:27] wtf [11:08:10] yeah [11:08:13] I got a 404 before [11:12:36] so varnish reports "FetchError Invalid Gzip data: incorrect header check" in varnishlog even when returning 404 to the client [11:13:27] cause ATS is reporting to varnish the Content-Encoding: gzip, right? [11:16:26] it also messes with curl [11:16:46] right, so it seems that varnish tries and fails to gzip the content, which is fair. What puzzles me is that I'd either always expect a 503 (as in https://phabricator.wikimedia.org/P8312) and not the occasional 404 (as in https://phabricator.wikimedia.org/P8311) [11:16:58] s/either// [11:21:50] /o\ accidental open of pandora's box [11:25:15] every single time [11:25:33] I'm not awake and aware enough to dig far yet, but I vaguely remember some related things [11:25:40] so 404 reports X-Cache: cp1072 miss, cp1076 miss || X-Cache-Status: miss, and 503 --> X-Cache: cp1076 int || X-Cache-Status: int-front [11:26:08] we did some things with gzip magic in varnish, to make sure varnish-be could cope with the applayer (because of varnish's and/or applayer's issues at the boundary), and the FEs may have been relying on the BEs to clean it up for them. [11:26:18] so with varnish-fe -> ATS, we may need to revisit it [11:26:57] vgutierrez: yeah the different X-Cache is due to the fact that the 503 is generated internally by varnish (hence "int") [11:28:26] yep, but it doesn't go back to 1072 to attempt to fetch it.. we cache 404s on the fe for a small amount of time? [11:28:54] ema: first off, do we even have the "gzip these mime types" thing in ATS, or is it completely different? [11:29:10] common_backend_response in varnish-land has: [11:29:11] if (beresp.http.content-type ~ "json|text|html|script|xml|icon|ms-fontobject|ms-opentype|x-font|sla" [11:29:15] && (!beresp.http.Content-Length || std.integer(beresp.http.Content-Length, 0) >= 860)) { [11:29:18] set beresp.do_gzip = true; [11:29:22] which tells varnish to gzip those if they're not already gzipped [11:29:32] vgutierrez: it does actually fetch from the backend, see the extended varnish logs at https://phabricator.wikimedia.org/P8312 [11:30:28] bblack: yeah vgutierrez spotted that earlier! No, we don't have it in ATS, and we should [11:30:56] the other magic has long commentary, in wikimedia-backend VCL [11:31:40] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/varnish/templates/vcl/wikimedia-backend.vcl.erb#204 [11:31:52] so the source of the issue is actually at the swift level, it seems (returning CE:gzip but content is not) [11:32:14] which is basically "blah blah commentary": when fetching from varnish->applayer, turn off AE:gzip, so the applayer's dumb broken gzip stuff doesn't happen and varnish can do it all. [11:32:20] /o\ [11:32:40] so, maybe we want to do that in ATS-be as well (turn off AE:gzip on the applayer-facing backend connection) [11:32:51] it looks exactly like that [11:33:27] sidenote: sorry if I expected varnish to behave deterministically [11:33:50] that would be too easy ;P [11:36:49] what's the state of our ATS's gzip currently (as in, does it try to gzip ungzipped things at all?) [11:37:59] bblack: it does not (there's the gzip plugin for that, if we want to) [11:38:49] https://docs.trafficserver.apache.org/en/8.0.x/admin-guide/plugins/compress.en.html, it looks like it's been renamed to compress in 8.x [11:38:56] yeah I just found that heh [11:39:07] it's ok for the upload case at least [11:39:20] the varnish-fe will still compress on that content-type filter, will just shift some CPU load over to varnish-fe [11:39:49] heh the plugin even has a setting "remove-accept-encoding" to strip AE:gzip towards origin :) [11:40:21] they also must have seen things! [11:40:42] also related: [11:40:44] https://docs.trafficserver.apache.org/en/8.0.x/admin-guide/files/records.config.en.html#proxy-config-http-normalize-ae [11:41:05] I think varnish was doing this (perhaps in core rather than VCL) [11:41:10] but only for gzip [11:41:35] but again varnish-fe is probably still doing it for us too, so all of this is not-critical yet [11:41:44] except the "kill AE:gzip to the applayer" part [11:46:23] bblack: so.. with all this mess we didn't perform the wikiba.se test or deploy.. I think that now it would be wise to wait till Monday [11:46:30] just to keep deploy gods happy [11:46:58] yeah [11:47:09] also I've found a slightly worrisome thingie in pcc [11:47:13] let me show you [11:48:00] from: https://puppet-compiler.wmflabs.org/compiler1002/15419/cp1008.wikimedia.org/ [11:48:15] +++ Service[nginx] [11:48:15] @@ [11:48:15] - before => [u'Service[varnish]', u'Service[varnish-frontend]'] [11:48:15] + before => [u'Service[varnish]', u'Service[varnish-frontend]', u'Service[varnish]', u'Service[varnish-frontend]'] [11:49:10] I guess that duplicity is triggered by https://gerrit.wikimedia.org/g/operations/puppet/+/refs/changes/25/499825/8/modules/profile/manifests/cache/ssl/wikibase.pp#62 [11:49:32] so we have "Service['nginx'] -> Service<| tag == 'varnish_instance' |>" in both wikibase.pp and unified.pp [11:49:50] maybe it would be wise to move it to a common place? [11:49:55] or it could be safely ignored? [11:57:44] it's just a duplicate dependency, it should be fine I think [12:00:52] note my extra cautious mode :) [12:04:13] vgutierrez: you really don't want to win that tshirt do you [12:04:41] bblack, vgutierrez: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/500011/ [12:06:49] ema: LGTM [13:09:13] Zayo's emergency maintenance window is over, we might want to repool ulsfo in a bit if things look good [13:09:47] re: T219591 [13:09:47] T219591: ulsfo <-> codfw transit link flapping causing nginx availability alerts - https://phabricator.wikimedia.org/T219591 [13:11:03] https://gerrit.wikimedia.org/r/#/c/operations/dns/+/500031/ [13:11:55] nice to watch ATS deal with ulsfo's traffic though! [13:14:11] well we just had another ulsfo nginx alert heh [13:14:30] the last thing I see from zayo is 2h claiming they cleared it, but also mentioning more maintenance tonight? [13:14:34] whenever "tonight" is [13:16:19] I see there's a new notification: RESCHEDULE NOTIFICATION***Wikimedia Foundation Inc***ZAYO TTN-0003155831 Emergency*** [13:16:41] Maintenance Window: 00:01 - 05:00 Pacific (30-Mar-2019) [13:24:38] yeah doesn't look like fixed yet [15:03:03] reading scrollback, seems like we got some flaps, any outage? [15:03:39] 10Traffic, 10Operations: Refactor public-facing DYNA scheme for primary project hostnames in our DNS - https://phabricator.wikimedia.org/T208263 (10BBlack) There's some complexities here that I've been stewing on for a while, mostly noted in the original description, but I like this general direction. Most of... [15:03:59] XioNoX: ulsfo is depooled still, but yeah we're wondering when/if the link will be trustable again :) [15:06:59] ok! will look at it [15:07:20] they claimed it was fixed but then they also scheduled more maintenance for later [15:12:58] "this is not an outage, this is a planned maintenance for in -1h" [15:13:00] :) [15:14:27] https://twitter.com/honest_update [15:20:55] bblack: yeah seems like they're waiting for some equipment and will do the work in the night. We can either repool over the weekend/Monday when it's done. Or I can increase the cost of that link so the backup is prefered over the weekend and repool today. I'd vote for #2 but no strong opinion [15:28:03] XioNoX: yeah #2 seems reasonable all things considered, and flip the cost back Monday if all's well [15:28:15] cool! [15:28:47] ema: https://mobile.twitter.com/RedTeamPT/status/1110843396657238016 [15:48:52] I wish we'd thought of that first! What a brilliant way to fix bugs whose PoC uses curl :) [15:49:59] bblack: alright, confirmed codfw-ulsfo traffic goes through eqord, want to repool or should I? I see a pending repool CR on the task [15:50:46] XioNoX: go for it, I think ema uploaded it [15:50:54] ok! [15:52:41] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: ulsfo <-> codfw transit link flapping causing nginx availability alerts - https://phabricator.wikimedia.org/T219591 (10ayounsi) a:03ayounsi [15:56:21] alright, all set! [18:08:57] bblack: https://www.bortzmeyer.org/hackathon-ietf-104.html "DNS Extended Error reporting at the IETF hackathon" [19:05:17] that's interesting too https://labs.ripe.net/Members/florian_streibelt/bgp-communities-a-weapon-for-the-internet-part-2 We already do or will do soon their recomendations [20:43:10] 10Traffic, 10Cloud-VPS, 10DNS, 10Maps, and 2 others: multi-component wmflabs.org subdomains doesn't work under simple wildcard TLS cert - https://phabricator.wikimedia.org/T161256 (10TheDJ) FYI, I have configured [abc].tiles.wmflabs.org webhosts to redirect to http://tiles.wmflabs.org during {T204506} The...