[00:05:22] <godog>	 so re: ulsfo/esams, there's an outstanding question I'm working on on whether to place the servers local to the site (e.g. on the bastions) or in eqiad/codfw and pull over the network
[00:11:34] <bblack>	 well
[00:12:08] <bblack>	 that sounds complicated and I don't know enough, but:
[00:12:41] <bblack>	 1) It's probably a requirement that we can see global aggregate stats from some datasource, perhaps both eqiad and codfw
[00:13:09] <bblack>	 2) It's probably a requirement that a link outage between e.g. eqiad<->esams doesn't mean losing stats in the interim, either.
[00:13:28] <bblack>	 3) also, Asia is Coming
[00:14:35] <bblack>	 making tons of assumptions about how all related things work, I would suspect something like this would work:
[00:15:17] <bblack>	 1. for codfw/eqiad hosts, poll stats directly into both from both, so they serve as redundant sources of global data? (but then what happens when they're split for a while? does it have some way to-resync both views later?)
[00:15:40] <bblack>	 2. for the remote DCs, poll stats from local servers (e.g. on the bastions), and have it forward that data to eqiad+codfw as well?
[00:15:56] <bblack>	 (and hopefully it can replay the forwarding when it finally reconnects)
[00:22:26] <godog>	 yeah I had sth like that in mind, sort-of a meld of 1 and 2, there's per-site servers that poll only from hosts in the site and that gives us the local detail. For global view there's another prometheus instance / datasource "global" that in turn polls data from all the per-site prometheus and serves aggregates
[00:24:04] <godog>	 that moves the "can't poll data due to network unreachable" problem to the global instance only, which would live in eqiad and codfw, if the global instance can't poll data there will be an interruption in the global aggregates though
[00:27:31] <godog>	 anyways I'm going to write all of this down in wikitech and/or phabricator
[00:29:26] <wikibugs>	 10Traffic, 06Operations: 503 errors for users connecting to esams - https://phabricator.wikimedia.org/T149865#2767016 (10fgiunchedi) Update: it was due to a suspected problem with eqiad<->esams wave link, @bblack has failed over to the MPLS eqiad<->knams link and things seem to have stabilized for now.
[02:04:58] <bblack>	 godog: yeah I don't think it's such a big deal if we have temporary loss in the aggregates.  but it would be nice to have the holes fill themselves back in on reconnect.
[02:53:27] <wikibugs>	 07HTTPS, 06Operations, 10Wikimedia-Stream, 13Patch-For-Review: stream.wikimedia.org speaks http (not https) on port 443 - https://phabricator.wikimedia.org/T102313#2767183 (10jeremyb)
[02:53:29] <wikibugs>	 07HTTPS, 10Traffic, 06Operations, 06Performance-Team, and 2 others: HTTPS-only for stream.wikimedia.org - https://phabricator.wikimedia.org/T140128#2767182 (10jeremyb)
[04:13:01] <wikibugs>	 10Traffic, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 06Operations: CentralNotice: Review and update Varnish caching for Special:BannerLoader - https://phabricator.wikimedia.org/T149873#2767310 (10AndyRussG)
[04:33:21] <wikibugs>	 10Traffic, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 06Operations: CentralNotice: Review and update Varnish caching for Special:BannerLoader - https://phabricator.wikimedia.org/T149873#2767335 (10aaron) The first approach might work using Varnish xkey support. I'm not how far along we ar...
[05:03:56] <wikibugs>	 10Traffic, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 06Operations: CentralNotice: Review and update Varnish caching for Special:BannerLoader - https://phabricator.wikimedia.org/T149873#2767345 (10aaron) Another idea is to add a cache-busting parameter to the URLs handed out, like the cac...
[06:40:40] <wikibugs>	 10Traffic, 06Operations: 503 errors for users connecting to esams - https://phabricator.wikimedia.org/T149865#2767401 (10ema) p:05Triage>03High
[07:49:47] <ema>	 cp2019's varnish-be is depooled for some reason
[07:52:45] <_joe_>	 ema: uhm I think it could be due to a failure to write to etcd while repooling
[07:52:51] <_joe_>	 we had pybal go rogue
[07:53:06] <_joe_>	 I am about to write a ticket on that
[07:53:16] <ema>	 _joe_: thanks, I'll repool it given that it looks sane
[07:54:29] <_joe_>	 ema: or a race condition when moritz restarted etcd
[07:54:39] <_joe_>	 do you know when it was depooled?
[07:54:59] <ema>	 nope
[07:55:14] <_joe_>	 we might find out from ganglia
[07:56:35] <_joe_>	 you should check all pools on config-master.wikimedia.org
[07:56:37] <ema>	 cp2023 running v4, no obvious fuckups so far
[07:57:20] <_joe_>	 heh, nope
[07:57:26] <_joe_>	 I'll check with confctl
[08:00:11] <ema>	 well now I have a good candidate for the next host to upgrade to v4, cp2019 has almost no objects in the backend so I might as well upgrade it immediately :)
[08:00:54] <moritzm>	 conf1* were rebooted on 2016-10-27
[08:01:09] <moritzm>	 10:20, 10:26, 10:30
[08:01:19] <moritzm>	 (for 1001, 1002, 1003)
[08:04:38] <_joe_>	 moritzm: we have a series of different failures happening, sith
[08:04:52] <_joe_>	 ema: can you confirm cp3009 is rightfully depooled from all services?
[08:05:21] <_joe_>	 ema: did you just depool cp2019?
[08:05:28] <ema>	 _joe_: I did, see SAL
[08:05:32] <ema>	 checking cp3009
[08:05:35] <_joe_>	 cool
[08:06:20] <ema>	 _joe_: cp3009 is properly depooled
[08:06:29] <_joe_>	 ok
[08:06:33] <_joe_>	 sorry, just checking
[08:07:41] <_joe_>	 also, nescio is depooled from pdn_recursor (esams)
[08:14:18] <moritzm>	 _joe_: hmm, nescio was also rebooted about a week ago, it's not entirely impossible I forgot to repool it, but might also have been repooled for other reasons more recently
[08:18:53] <elukey>	 vk seems to happily produce events from cp2019
[08:19:26] <ema>	 elukey: and cp2023 as well I guess
[08:20:22] <elukey>	 ah and I forgot that now there are three vk running on each host
[08:23:26] <elukey>	 so I will check the kafka topics eventlogging-client-side and webrequest-text 
[08:23:32] <elukey>	 just to be sure :)
[08:23:53] <elukey>	 but for the moment they seem ok
[08:28:06] <ema>	 /etc/cron.d/varnish-backend-restart on cp3040 failed:
[08:28:10] <ema>	 cp3040: Nov 03 08:20:40 cp3040 varnishd[7540]: Error: (-spersistent): fallocate() for file /srv/sda3/varnish.main1 failed: No space left on device
[08:28:36] <ema>	 restarted by hand
[08:29:24] <_joe_>	 argh
[08:29:34] <_joe_>	 what happened there?
[08:29:34] <ema>	 we didn't get an email from cron about that though, which is strange
[08:31:20] <ema>	 looking into the logs:
[08:31:29] <ema>	 Nov 03 08:20:38 cp3040 varnishd[7503]: WARNING: (-spersistent) file size does not fit in reported free space, continuing anyways!
[08:32:03] <ema>	 then a bunch of -> Error: (-spersistent): fallocate() for file /srv/sda3/varnish.main1 failed: No space left on device
[08:34:24] <elukey>	 361GB varnish.main2 ?
[08:39:21] <wikibugs>	 10Traffic, 06Operations: varnish-be not restarting correctly because of disk space issues - https://phabricator.wikimedia.org/T149881#2767575 (10ema)
[08:58:24] <wikibugs>	 10Domains, 10Traffic, 10DNS, 06Operations, and 2 others: Point wikipedia.in to 205.147.101.160 instead of URL forward - https://phabricator.wikimedia.org/T144508#2767607 (10Naveenpf) >>! In T144508#2764038, @CRoslof wrote: >>>! In T144508#2740901, @Naveenpf wrote: >> @Aklapper Can you please change title t...
[09:37:12] <ema>	 ok so the upgrade is going well from the 503 point of view (as in all quiet on that front)
[09:37:27] <ema>	 the hitrate is not recovering as it should though
[09:38:15] <moritzm>	 bblack, ema: openssl 1.1.0b-1+wmf2 uploaded to carbon
[09:40:00] <ema>	 first impression is that objects with ReqURL ~ /static/ are cached properly, those with /wiki/ aren't?
[09:41:04] <ema>	 4 hosts out of 8 have been upgraded so far, I'll stop for a coffee, which is needed at this point
[09:48:40] <wikibugs>	 07HTTPS, 10Traffic, 10DBA, 06Operations, and 2 others: dbtree loads third party resources (from jquery.com and google.com) - https://phabricator.wikimedia.org/T96499#2767699 (10jcrespo)
[10:16:05] <bblack>	 moritzm: awesome, thanks :)
[10:16:41] <ema>	 bblack: hi! :)
[10:17:37] <ema>	 trying to figure out why /wiki/ objects don't seem to be cached by v4, I've disabled the 4-hit-wonder code on cp2023 to no avail
[10:19:47] <ema>	 in particular, both backend and frontend always miss on those objects
[10:20:11] <ema>	 I've tried removing the 4-hit-wonder code from the frontend because of that reason (it would never hit if the backend always misses)
[10:20:34] <bblack>	 looking on 2023
[10:22:07] <ema>	 /w/load.php, /static/ and such get cached properly
[10:22:51] <bblack>	 my first thought was, why are so many 301 TLS redirects spamming by my screen
[10:22:52] <_joe_>	 ema: I'd look at response headers from the backend
[10:23:07] <bblack>	 they should be 1/100th or less the non-redirect ones
[10:23:26] <_joe_>	 ema: /w/index.php gets cached?
[10:23:56] <ema>	 let me see
[10:24:39] <ema>	 _joe_: nope (but  http://meta.wikimedia.org/w/index.php does)
[10:24:59] <bblack>	 ema: I think you should perhaps depool them for now
[10:25:41] <bblack>	 well I donno, give it a few more mins I guess
[10:26:29] <bblack>	 cp2023 frontend?
[10:27:48] <ema>	 bblack: ?
[10:27:53] <bblack>	 nevermind that last question
[10:28:34] <ema>	 ok, I think it should be safe to leave them pooled for a little longer while we try to figure out what's happening. They don't seem to return any garbage (but they miss)
[10:29:05] <bblack>	 -   BerespHeader   X-Cache-Int: cp2004 hit/4344
[10:29:15] <bblack>	 that's on a wiki article
[10:29:19] <bblack>	 it was a hit on the backend, lots of hits
[10:29:27] <ema>	 cp2004 is v3
[10:29:27] <bblack>	 oh 2004 is probably v3 still?
[10:29:29] <bblack>	 ok
[10:29:57] <ema>	 2001, 2004, 2010 and 2013 are still v3
[10:30:12] <ema>	 2016, 2007, 2019 and 2023 are v4
[10:31:52] <bblack>	 -   TTL            RFC 1209600 10 -1 1478168856 1478168856 1478168855 0 1209600
[10:31:55] <bblack>	 -   VCL_call       BACKEND_RESPONSE
[10:31:57] <bblack>	 -   TTL            VCL 86400 10 0 1478168856
[10:32:00] <bblack>	 -   TTL            VCL 86400 3600 0 1478168856
[10:32:02] <bblack>	 -   TTL            VCL 0 3600 0 1478168856
[10:32:16] <bblack>	 that's on v4-fe's reception of a backend response from a v4-be, for a /wiki/ that looks normal
[10:32:31] <bblack>	 you can see it starts with the RFC TTL 1209600 from Cache-Control
[10:32:41] <bblack>	 then gets chopped down to 86400 by our 1d TTL cap
[10:32:45] <bblack>	 then gets chopped down to zero at the end
[10:33:07] <bblack>	 I think the second field is grace?
[10:34:50] <wikibugs>	 10Traffic, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2767815 (10Gilles) >>! In T66214#2762226, @bearND wrote: > In addition to that I'd like to know what the relationship between this task and the Thumbor pr...
[10:35:57] <bblack>	 and that's seems to be happening during vcl_backend_response somewhere
[10:36:27] <ema>	 bblack: interesting. It might be useful to focus on varnish-be, they also always seem to miss and we can rule out any varnish<->varnish banana 
[10:37:23] <bblack>	 playing on 2023
[10:37:26] <ema>	 can this have anything to do with the 'Cacheable object with Set-Cookie' exception?
[10:37:42] <ema>	 ok
[10:38:01] <ema>	 I'll take 2019
[10:39:53] <bblack>	 ok the reset to zero I saw, that was from the 4-hit-wonder code
[10:40:08] <bblack>	 (which makes sense, if the BE was also v4, it will never reach 4-hit status)
[10:40:41] <ema>	 right
[10:41:11] <bblack>	 oh, one thing I just noticed
[10:41:16] <bblack>	 text is still on deprecated_persistent :P
[10:41:23] <ema>	 yup :)
[10:41:27] <bblack>	 but I doubt that's the root of this problem
[10:41:40] <bblack>	 well
[10:41:55] <bblack>	 ok, yeah, still it's unlikely the issue here
[10:45:01] <bblack>	 I could almost believe something like "persistent is so borked it never returns cache objects for hits, and FE now refuses to cache in turn due to 4-hit-wonder"
[10:45:17] <bblack>	 but disabling 4-hit-wonder has no effect on the malloc frontend still refusing to have cache hits
[10:45:25] <ema>	 yep
[10:46:03] <bblack>	 ok wait, I'm getting *some* frontend hits on /wiki/ paths, at least now with 4-hit-wonder gone
[10:46:15] <ema>	 really? I didn't get any
[10:46:25] <ema>	 that's interesting
[10:46:32] <bblack>	 bleh they're 301s
[10:46:44] <ema>	 heh :)
[10:46:53] <bblack>	 or they're 200s on special things, like: /wiki/Special:CentralAutoLogin/checkLoggedIn?type=script&wikiid=eswiki&proto=https
[10:50:07] <bblack>	 ok I'm gonna go pour round 2 of espresso first
[10:58:53] <wikibugs>	 07HTTPS, 10Traffic, 06Operations, 06WMF-Communications, 07Security-Other: Server certificate is classified as invalid on government computers - https://phabricator.wikimedia.org/T128182#2767875 (10Florian)
[11:02:42] <wikibugs>	 10Traffic, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2767879 (10Gilles) In the current examples, I think it's unfortunate that height-constraining isn't considered. Not as a feature that would be available i...
[11:04:54] <bblack>	 ema: on the backend, it's easy to believe a persistent explanation.  You can see on the -b side of the backend that it believes these objects have real TTLs, and selects storage persistent main1
[11:05:11] <bblack>	 yet, on the -c side of the backend, every lookup on such objects is a miss
[11:05:31] <ema>	 yeah I've seen that. Also, in any case /static/ objects do get cached fine
[11:05:43] <ema>	 that's why I doubt persistent is the culprit here
[11:05:57] <bblack>	 well
[11:06:16] <bblack>	 mostly the only reason I doubt persistent is that a 4-hit-wonder-less frontend still doesn't cache things either
[11:06:41] <bblack>	 (or cache them from v3 backends, for that matter)
[11:06:41] <ema>	 btw, for easy repros we can use pinkunicorn
[11:06:44] <ema>	 curl -H 'Host: en.wikipedia.org' -v https://pinkunicorn.wikimedia.org/wiki/Main_Page
[11:07:46] <ema>	 bblack: should I depool the v4 hosts while we investigate further?
[11:08:09] <wikibugs>	 07HTTPS, 10Traffic, 06Operations, 06WMF-Communications, 07Security-Other: Server certificate is classified as invalid on government computers - https://phabricator.wikimedia.org/T128182#2767895 (10Florian) Update: There was a new user reporting such a problem (from the US coast guard, and he was reportin...
[11:10:14] <bblack>	 ema: if we have a repro on cp1008, then yeah.  is there an example /static/ that hits on cp1008, too?
[11:10:19] <wikibugs>	 10Traffic, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Use content hash based image / thumb URLs - https://phabricator.wikimedia.org/T149847#2767897 (10Gilles) FYI I've sort of implement a solution for this on Vagrant a while ago for the current thumbnail URI scheme, by replacing...
[11:11:54] <ema>	 bblack: yeah, forgive the ugly URL:
[11:11:56] <ema>	 https://pinkunicorn.wikimedia.org/w/load.php?debug=false&lang=en&modules=ext.cite.styles%7Cext.uls.interlanguage%7Cext.visualEditor.desktopArticleTarget.noscript%7Cext.wikimediaBadges%7Cmediawiki.legacy.commonPrint%2Cshared%7Cmediawiki.sectionAnchor%7Cmediawiki.skinning.interface%7Cskins.vector.styles%7Cwikibase.client.init&only=styles&skin=vector
[11:12:04] <ema>	 this one hits
[11:12:14] <wikibugs>	 10Traffic, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Use content hash based image / thumb URLs - https://phabricator.wikimedia.org/T149847#2767898 (10Gilles) The option that makes that happen is $supportsSha1URLs on the FileRepo
[11:14:27] <wikibugs>	 07HTTPS, 10Traffic, 06Operations, 06WMF-Communications, 07Security-Other: Server certificate is classified as invalid on government computers - https://phabricator.wikimedia.org/T128182#2767900 (10BBlack) Keep in mind we've had the GlobalSign incident ~ 2 weeks ago, which precipitated us switching Global...
[11:15:41] <wikibugs>	 07HTTPS, 10Traffic, 06Operations, 06WMF-Communications, 07Security-Other: Server certificate is classified as invalid on government computers - https://phabricator.wikimedia.org/T128182#2767902 (10BBlack) [As an aside, we'd **love** to have more-direct contact on this from someone technical inside these...
[11:17:37] <bblack>	 ema: I still have puppet disabled on cp2023 btw.  should I re-enable or?
[11:17:44] <ema>	 despite one Raft Internal Error, v4 codfw-text depooled
[11:17:58] <ema>	 bblack: as you wish, it's depooled now
[11:19:13] <bblack>	 cp1008 with its backend-to-self is much easier to debug regardless
[11:19:19] <ema>	 exactly
[11:20:02] <ema>	 that /w/load.php URL pasted above is currently a backend hit, not yet a frontend hit (because 4-hit wonder)
[11:22:04] <ema>	 objects that do get cached have Cache-control: public, max-age=300, s-maxage=300
[11:22:11] <ema>	 /wiki/ stuff has Cache-control: s-maxage=1209600, must-revalidate, max-age=0
[11:22:49] <ema>	 (just dumping here the main differences I'm noticing from visualdiffing varnishlog output)
[11:23:01] <bblack>	 right
[11:23:16] <bblack>	 but we can see in the TTL lines it's taking the TTL from s-maxage where applicable, as expected
[11:23:28] <bblack>	 assuming the TTL lines can be trusted
[11:24:33] <ema>	 hitrate going back up nicely after the depools
[11:28:05] <bblack>	 that pasted load.php URL is a 404 for me
[11:28:18] <ema>	 with Host: en.wikipedia.org?
[11:29:32] <bblack>	 curl -v -H 'Host: en.wikipedia.org' https://pinkunicorn.wikimedia.org/static/images/poweredby_mediawiki_88x31.png
[11:29:37] <bblack>	 ^ that's simpler and works
[11:29:44] <bblack>	 without, but regardless
[11:32:49] <bblack>	 -   TTL            RFC 1209600 10 -1 1478172745 1478172745 1478172744 0 1209600
[11:33:01] <bblack>	 -   TTL            RFC 31536000 10 -1 1478172639 1478172639 1478172638 1509708638 31536000
[11:33:11] <bblack>	 the first is /wiki/Main_page, the latter is the working /static/
[11:33:52] <bblack>	 https://www.varnish-cache.org/docs/4.1/reference/vsl.html - search for "TTL set on object"
[11:34:01] <ema>	 TTL - Grace - Keep - Reference time for TTL - Age (incl Age: header value) - Date header - Expires header - Max-Age from Cache-Control header 
[11:34:13] <bblack>	 the second-to-last number is supposedly Expires
[11:34:25] <bblack>	 why does the working one have Expires in the future, and the borked one has Expires:0 ?
[11:34:56] <bblack>	 ah, the /static/ one does set expires
[11:34:57] <bblack>	 -   BerespHeader   Expires: Fri, 03 Nov 2017 11:30:38 GMT
[11:35:04] <ema>	 the other doesn't
[11:35:04] <bblack>	 whereas most of our /wiki/ wouldn't
[11:38:54] <bblack>	 it's chunked, streamed, and pre-gzipped by the applayer
[11:39:07] <bblack>	 I wonder if that's influencing something about object storage?
[11:39:23] <bblack>	 (that's probably new, we probably weren't streaming before, or using HTTP/1.1 in general before for chunked)
[11:39:50] <ema>	 right a big difference is that we're using http/1.1 now
[11:40:00] <bblack>	 oh wait I'm wrong, for whatever reason
[11:40:14] <bblack>	 it is different for nginx->varnish, but apparently not varnish-be->app
[11:40:27] <bblack>	 (which is also HTTP/1.1 and chunked, looking at cp1065)
[11:40:47] <bblack>	   411 RxHeader     b Content-Encoding: gzip
[11:40:55] <bblack>	   411 RxHeader     b Transfer-Encoding: chunked
[11:40:55] <bblack>	   411 RxHeader     b Content-Type: text/html; charset=UTF-8
[11:40:55] <bblack>	   411 Fetch_Body   b 3(chunked) cls 0 mklen 1
[11:41:13] <bblack>	 tx/rx protocol was 1.1 as well
[11:41:25] <bblack>	 (that's cp1065 fetch from mw of a /wiki/foo)
[11:49:34] <bblack>	 ema: I think alex has puppet disabled everywhere for some pupetmaster stuff
[11:49:47] <bblack>	 oh wait, over now I think
[11:49:59] <bblack>	 cp1008 is still disabled, though
[11:51:02] <ema>	 seems re-enabled now
[11:52:37] <ema>	 cp1008:/tmp/x.diff has a varnishlog diff betweeen /wiki/Main_Page and /w/load.php
[11:52:56] <ema>	 CC, ETag, Expires are some of the differences
[11:53:51] <ema>	 Vary: Accept-Encoding,Cookie,Authorization vs Vary: Accept-Encoding
[11:53:55] <bblack>	 yeah
[11:54:08] <bblack>	 Vary:Cookie has some potential for complex interactions
[11:54:24] <bblack>	 can I stop puppet and manually edit+reload frontend vcl?
[11:54:29] <ema>	 sure
[11:54:56] <ema>	 I'm off for lunch now, ttyl
[11:58:13] <bblack>	 ok
[12:01:35] <bblack>	 yeah it's cookie-related
[12:02:15] <bblack>	 as in: if the backend response has Vary:Cookie, something about our crazy cookie save/restore stuff in text-common breaks the object stored into the cache such that won't much a future lookup going through the same code
[12:04:51] <bblack>	 all I have to do to "fix" frontend caching of Main_page on cp1008 is disable the 4-hit-wonder code and then disable all the code in text-common's text_common_misspass_restore_cookie and evaluate_cookie
[12:05:10] <bblack>	 (which would break the site, so it's not a solution, but it fixes the curl requests that had no cookies to begin with, so it definitely narrows the problem)
[12:06:40] <bblack>	 something has changed about exactly where/when/how it looks at Vary-specified headers for lookup and storage purposes
[12:07:34] <bblack>	 or alternatively, something's wrong with how we're setting X-Orig-Cookie and/or Cookie that has subtle changed (e.g. setting to the value of an unset header results in an empty header, which has a Vary meaning, or something)
[12:10:39] <bblack>	 I think... that the Vary slot is being set from bereq headers rather than req headers, that might be the change
[12:10:48] <bblack>	 (for storing, not for lookup)
[12:11:06] <bblack>	 so for a Vary:Cookie response like /wiki/Main_page...
[12:11:34] <bblack>	 the initial storage lookup uses req.http.Cookie, which we've filtered out and unset (so it's looking in the Vary slot for "no such header Cookie:")
[12:12:15] <bblack>	 but on miss->fetch, once the response comes back and needs to be inserted in the cache, it's using bereq.http.Cookie to set the storage Vary slot, which is the original Cookie: value (we restored it to ensure the backend applayer sees it)
[12:13:06] <bblack>	 and on top of that, even though we're sending no cookies at all with curl, I think we still end up with a distinction there between "no cookie header at all" and "empty cookie header set by the restore code", due to some other subtle change in VCL semantics
[12:25:00] <bblack>	 well I donno about all of that above
[12:25:12] <bblack>	 but it's something inter-related with all of those things.  I can't quite pin it down yet.
[12:25:53] <bblack>	 by guarding the set of X-Orig-Cookie with an if(req.http.Cookie) (avoiding setting it from a boolean-false value), it lets no-request-cookies-at-all requests use the cache effectively
[12:26:20] <bblack>	 but requests with random cookies still fail to use cache objects from no-cookie requests like they should, and don't cache against themselves either
[12:27:02] <bblack>	 but they do use a no-cookie cache entry for lookup heh
[12:27:21] <bblack>	 so if I get a cache entry stored from a no-cookie request, and request with a random (not-session-related) cookie will hit that object
[12:27:36] <bblack>	 but it doesn't create such an object, if all of my requests have random non-cookie-related headers
[12:27:57] <bblack>	 I think that does imply that the storage vary slot is being set from bereq.http.Cookie
[12:28:11] <bblack>	 (whereas in v3, it was set from req.http.Cookie)
[12:29:21] <bblack>	 correction 3 lines up: but it doesn't create such an object, if all of my requests have random non-**session**-related headers
[12:33:19] <bblack>	 rewinding to what the intended behavior is:
[12:34:06] <bblack>	 restore_cookie should be restoring a Cookie: header for the actual backend request, but we don't intend for it to affect the local varnish in any way (including Vary-slotting for the fetched object)
[12:34:34] <bblack>	 we were doing the restoration from vcl_pass+vcl_miss in v3.  we moved that to vcl_backend_fetch for v4
[12:34:46] <bblack>	 which I think makes sense
[12:35:44] <bblack>	 we might have to re-remove the cookie (replacing with Token=1 as appropriate, as in evalute_cookie) shortly afterwards in vcl_backend_response?
[12:40:00] <bblack>	 yeah
[12:40:58] <bblack>	 that fixes a lot of things.  seeing if I can remove the new cookie-guard code as well, or if both fixups are necc
[12:42:19] <bblack>	 yeah the set-cookie-guarding stuff was a red herring.  it was just working around part of the problem for the edge-case of requests without any cookies
[12:43:46] <bblack>	 *** TL;DR:
[12:44:41] <bblack>	 1. V4 behavior change is: when storing a new object into cache, it uses bereq.http.Foo to set the Vary slot, whereas v3 used req.http.Foo.  Both still do the initial cache lookup using req.http.Foo.
[12:45:33] <bblack>	 2. Our code restores cookies necessarily in vcl_backend_fetch, because the backend needs to see them (and this makes the original cookie be used for Vary, which we don't want happening, and wasn't happening in v3)
[12:46:07] <bblack>	 3. However, apparently it's only later in vcl_backend_response (or later) that bereq.http.Cookie actually gets used for Vary, which is after the applayer request is all done with
[12:46:55] <bblack>	 4. So, we just need an additional small block of code in vcl_backend_response that re-strips bereq.http.Cookie in the same fashion as evalute_cookie (but doesn't have to bother saving the value in to X-Orig-Cookie or ever restoring it again)
[12:50:01] <bblack>	 reconfirmed the above from a clean slate
[13:07:25] <bblack>	 ema: https://gerrit.wikimedia.org/r/#/c/319561/
[13:28:42] <ema>	 that's awesome :)
[13:32:00] <bblack>	 ema: assuming that works out ok, we'll probably need to deploy the VCL update to the converted nodes, wipe their storage (restart frontends, stop->wipe->start persist backends), and then pool them slowly to let caching start to fill in (maybe pool all the backends in first?)
[13:32:12] <bblack>	 well really, we should fix the persist thing too
[13:32:23] <bblack>	 it will just get harder the more nodes we convert before fixing it
[14:01:53] <bblack>	 FYI - twice now I've had to fixup icinga alerts for a dead varnish-backend
[14:02:30] <bblack>	 it's the cron varnish-backend-restarts on v3 text nodes.  they try to restart "quickly", and sometimes w/ persistent it takes a sec for the filesystem to catch up to reality and know there's enough free space for fallocate()
[14:02:55] <bblack>	 so sometimes that fails startup enough times in a row that systemd gives up, but by the time icinga notices, it's fixable with "service varnish start" and then "pool"
[14:03:11] <bblack>	 (or the next puppet run would've fixed the instance, but not the pooling)
[14:03:24] <bblack>	 I figure it's low-incidence for now, and will fix itself after v4 conversion anyways
[14:06:12] <bblack>	 looks like cp3040 was hit with it too while i wasn't looking, and puppet-fixed itself but remained backend-depooled
[14:06:21] <bblack>	 (fixed now)
[14:06:40] <bblack>	 can periodically check for them with: confctl select cluster=cache_text,service='varnish-be.*' get|grep no
[14:07:00] <bblack>	 (and mentally filter out the known ones: the current depooled v4 nodes in codfw, and cp1052 because it has network issues)
[14:33:57] <wikibugs>	 07HTTPS, 10Traffic, 06Operations, 06WMF-Communications, 07Security-Other: Server certificate is classified as invalid on government computers - https://phabricator.wikimedia.org/T128182#2768462 (10Florian) >>! In T128182#2767902, @BBlack wrote: > [As an aside, we'd **love** to have more-direct contact on...
[14:53:09] <wikibugs>	 10Traffic, 06Operations, 13Patch-For-Review: Strong cipher preference ordering for cache terminators - https://phabricator.wikimedia.org/T144626#2768492 (10BBlack) Update:  We're now preferring chapoly to other symmetric algorithms outright in our strongest cipher suites at the top of the list, without prefe...
[15:28:39] <wikibugs>	 10netops, 06Labs, 06Operations, 05Prometheus-metrics-monitoring: Firewall rules production/labs for prometheus-node-exporter - https://phabricator.wikimedia.org/T149253#2768566 (10fgiunchedi)
[15:57:53] <ema>	 bblack: yeah that would be T149881
[15:57:54] <stashbot>	 T149881: varnish-be not restarting correctly because of disk space issues - https://phabricator.wikimedia.org/T149881
[15:58:05] <ema>	 it happened earlier on today
[16:01:07] <ema>	 I guess we could sleep a little before /usr/sbin/service varnish start 
[16:07:41] <ema>	 (or tell systemd to keep trying a little longer?)
[16:09:11] <ema>	 at any rate, we'll get rid of persistence soon so perhaps a simple sleep is enough 
[16:12:03] <wikibugs>	 10Traffic, 06Operations, 13Patch-For-Review: varnish-be not restarting correctly because of disk space issues - https://phabricator.wikimedia.org/T149881#2768711 (10ema) p:05Triage>03Normal
[16:12:48] <ema>	 bblack: proposed workaround -> https://gerrit.wikimedia.org/r/319596
[16:29:53] <wikibugs>	 10netops, 06Operations: Investigate why disabling an uplink port did not deprioritize VRRP on cr2-eqiad - https://phabricator.wikimedia.org/T119759#2768782 (10mark) It's actually working as designed. Our current configuration looks like:                  track {                     interface ae4.1004 {...
[16:44:53] <ema>	 s/persistent/file/ https://gerrit.wikimedia.org/r/#/c/319609/
[16:45:18] <ema>	 wanted to run pcc on it but jenkinsapi doesn't like me today
[17:00:07] <wikibugs>	 10netops, 06Labs, 06Operations, 05Prometheus-metrics-monitoring: Firewall rules production/labs for prometheus-node-exporter - https://phabricator.wikimedia.org/T149253#2768875 (10akosiaris) 05Open>03Resolved a:03akosiaris After a couple of rounds, finally done. Tested with telnet on a few hosts and...
[17:03:09] <wikibugs>	 10netops, 06Operations, 10ops-eqiad: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2768895 (10faidon) The 2<->5 link was due to a faulty cable. Chris has replaced that and the stack is fully formed now, albeit with not much redundancy (still waiting...
[17:30:47] <bblack>	 back
[17:38:39] <ema>	 bblack: I'm not sure whether we were planning to do something smarter for text then a simple s/persistent/file/ or not
[17:39:29] <ema>	 in any case we've got quite some ugliness to get rid of from the puppet code once all nodes are on v4 :)
[17:42:37] <bblack>	 :)
[17:42:57] <bblack>	 I think for text we'll assume we don't need size-splitting, unless evidence indicates otherwise
[17:43:08] <bblack>	 probably the distribution of object sizes is considerably less-crazy
[17:43:55] <bblack>	 we probably should tune up jemalloc at least as far as we did for upload on all of them, since upload has the largest sizes, though
[17:46:59] <bblack>	 prepping some commits for mem stuff and storage stuff
[17:48:35] <ema>	 cool
[17:56:34] <MaxSem>	 another validation error report: https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Safari
[17:57:23] <ema>	 bblack: I did already prep the s/persistent/file/ part :) https://gerrit.wikimedia.org/r/#/c/319609/
[17:57:41] <bblack>	 oh!
[17:58:46] <ema>	 maybe we can amend that one to remove the conditional on varnish_version4 in base.pp like you did on https://gerrit.wikimedia.org/r/#/c/319626/1/modules/role/manifests/cache/base.pp
[18:03:32] <bblack>	 MaxSem: responded, thanks
[18:03:43] <bblack>	 ema: eh it can wait
[18:05:44] <bblack>	 lol both our changes are ugly
[18:05:52] <bblack>	 there is no role::cache::2layer anymore :P
[18:06:10] <ema>	 gh
[18:07:12] <ema>	 confirmed by pcc that finally managed to build the change and found no differences for cp3033 and cp1008
[18:08:51] <ema>	 so that should be role::cache::base::varnish_version4 I guess
[18:09:21] <wikibugs>	 10Traffic, 06Operations, 13Patch-For-Review: Strong cipher preference ordering for cache terminators - https://phabricator.wikimedia.org/T144626#2769186 (10BBlack) Had a chance to dig through our NavTiming performance metrics.  There's some slight hints of improvement here and there, but probably just wishfu...
[18:09:50] <bblack>	 yeah
[18:09:58] <ema>	 ok, amending
[18:13:27] <ema>	 clearly the range between -2 and +2 is not enough for a meaningful CR
[18:13:29] <ema>	 lol is needed
[18:14:29] <ema>	 right, pcc agrees this time around https://puppet-compiler.wmflabs.org/4538/
[18:17:36] <ema>	 running puppet on the depooled v4 text nodes
[18:19:53] <ema>	 I'm gonna wipe cp2016's storage and test it a bit, if all is good I'll repool it
[18:20:56] <bblack>	 ok
[18:21:32] <bblack>	 I need to get back on the overall mem sizing thing sometime too
[18:21:41] <ema>	 varnish-frontend restart and for the backend:
[18:21:45] <ema>	 sudo systemctl stop varnish && sudo rm -f /srv/sd*/varnish*; sync; sleep 3; sudo systemctl start varnish
[18:21:52] <ema>	 that should be enough I think
[18:21:59] <bblack>	 yeah
[18:23:18] <ema>	 < X-Cache: cp2004 hit/4, cp2016 hit/2
[18:23:19] <ema>	 nice
[18:23:26] <ema>	 (for /wiki/Main_Page)
[18:25:11] <ema>	 and who forgot the chmod? I did! :)
[18:26:20] <ema>	 repooling cp2016, looks sane
[18:27:38] <ema>	 yeah frontend hitrate > 70% already
[18:29:52] <ema>	 cp2001 hit/4, cp2016 hit/1 GET http://en.m.wikipedia.org/wiki/Slovenia HTTP/1.1
[18:29:55] <ema>	 \o/
[18:31:55] <ema>	 I'll wipe cp2007 cp2019 and cp2023 as well, we can repool them later on if nothing weird comes up with 2016
[18:34:41] <ema>	 wiped
[18:43:49] <wikibugs>	 10Traffic, 06Operations, 13Patch-For-Review: Strong cipher preference ordering for cache terminators - https://phabricator.wikimedia.org/T144626#2769467 (10BBlack) The more I dig and study the relevant information that's out there, and especially in light of the invisible impact of the chapoly change on our...
[18:47:17] <ema>	 staring at f_nob and b_nob is quite interesting, we get ~1k new objects into the backend every 10s
[18:47:33] <ema>	 while the frontend gets ~2k new ones
[18:48:36] <bblack>	 makes sense
[18:49:07] <bblack>	 be is a chashed fraction of all possible objects, and fe sees a statistical sampling of 100% of all possible objects
[18:52:36] <bblack>	 are the other ones able to be puppeted (even if they stay depooled)?
[18:52:51] <bblack>	 (there's 4x in codfw that are puppet-disabled still)
[18:53:03] <ema>	 the disabled ones are v3
[18:53:27] <ema>	 (and pooled)
[18:53:54] <bblack>	 why are they disabled?
[18:54:01] <bblack>	 oh because they're already varnish_version4 in puppet
[18:54:05] <ema>	 because of my usual optimisim
[18:54:07] <bblack>	 got it :)
[18:55:37] <ema>	 feel free to repool the other v4s if you've got time later on :)
[18:55:43] <bblack>	 is it repooled?
[18:55:52] <ema>	 yes 2016 is pooled
[18:55:55] <bblack>	 I might repool the others and then finish conversion
[18:56:01] <bblack>	 what kind of timing were you using before?
[18:56:54] <ema>	 excessively long while trying to figure out what was going on with the hitrate
[18:56:59] <bblack>	 ok :)
[18:57:57] <bblack>	 anyways, yeah, I'll finish up codfw text v4 conversion
[18:58:00] <bblack>	 and if something looks awful, we'll just depool codfw in dns
[18:58:06] <ema>	 +1
[19:00:25] <ema>	 if everything looks pretty, I'll carry on tomorrow with the next DC (ulsfo after routing it to codfw)?
[19:01:30] <bblack>	 maybe re-route first, the convert, but yeah
[19:01:45] <bblack>	 I seem to recall v3 fetch from v4 was not as bad as v4 fetch from v3
[19:02:06] <bblack>	 oh maybe that's what you said :)
[19:02:08] <ema>	 right yeah that's what I meant :)
[19:02:15] <bblack>	 anyways, enjoy :)
[19:02:19] <ema>	 o/
[20:20:15] <bblack>	 while the cookie problem was real, monitoring hitrate recoveries on codfw is tricky to begin with
[20:20:36] <bblack>	 it gets so little traffic that the hitrate naturally drops off notably during its overnight lull period
[20:20:56] <bblack>	 and even during its peak-ish times, there's so little traffic that it takes an inordinately long time for it to refill its long-tail objects
[22:40:09] <bblack>	 ema: a couple of tricky things come to my mind, that we should verify are working right on codfw before moving further:
[22:41:11] <bblack>	 1. We should validate that placing beresp.uncacheable (was: hfp) in a Vary slot at all still works as it did before.  A good test of this would be whether logged-in pageview of an article causes miss (well, pass) of the same article for anon temporarily
[22:43:44] <bblack>	 2. We should verify that we're seeing proper caching of URLs in direct-backends (which codfw backends currently are) that are mangled in misspass_mangle(), on the plausible theory that modifying bereq.url and/or bereq.http.host there (there being vcl_backend_fetch) could affect the hashing of the object into storage.
[22:44:14] <bblack>	 neither of those would've seemed like things to worry about before, but after the unexpected change in the effects of beresp.http.* on Vary storage slots....
[22:44:30] <bblack>	 should be careful
[22:44:59] <bblack>	 in both cases it would mostly manifest as missed opportunities for cache hits, and at least in codfw possibly not very big ones to notice statistically