[01:53:35] <wikibugs>	 10netops, 06Operations, 05Goal: Decomission palladium - https://phabricator.wikimedia.org/T147320#2688855 (10Dzahn)
[01:54:47] <wikibugs>	 10netops, 06Operations, 05Goal: Decomission palladium - https://phabricator.wikimedia.org/T147320#2688855 (10Dzahn) It was still in Icinga because puppet was disabled on einsteinium temp. It's gone now.  done. resolving. T149395 is for datacenter follow-up work.
[01:55:02] <wikibugs>	 10netops, 06Operations, 05Goal: Decomission palladium - https://phabricator.wikimedia.org/T147320#2750936 (10Dzahn) 05Open>03Resolved
[09:09:47] <ema>	 puppet agent disabled on all v4 varnish hosts to test https://gerrit.wikimedia.org/r/#/c/318314/
[09:12:55] <ema>	 testing the change on cp1045 (misc) cp3003 (maps) and cp4014 (upload)
[09:25:59] <ema>	 varnishmedia and varnishxcache are doing the right thing
[09:26:52] <ema>	 I'm running all scripts depending on varnishapi.py by hand, waiting a few secs to get > 10k matching reqs, checking if the output looks sane
[09:27:15] <ema>	 varnishxcps is also ok
[09:37:57] <ema>	 varnishstatsd is fine
[09:40:06] <elukey>	 super
[09:40:38] <ema>	 varnishreqstats seems broken :P
[09:41:26] <ema>	 varnishreqstats --interval 10 --verbose <- this gives all-zeroes on hosts with the latest varnishapi.py
[09:44:04] <ema>	 the be more precise, backend.total and client.total report realistic values, all other metrics are 0 (eg: 0 client.status.2xx)
[09:50:51] <ema>	 and GE doesn't seem like a valid request method, no
[10:01:20] <ema>	 heh
[10:01:21] <ema>	         # Remove trailing NULL byte
[10:01:21] <ema>	         value = cbd['data'][:-1]
[10:13:20] <elukey>	 sorry didn't follow
[10:13:27] <elukey>	 have you found the issue? :(
[10:13:39] <ema>	 yeah, a dumb one
[10:14:15] <ema>	 elukey: https://gerrit.wikimedia.org/r/318514
[10:15:48] <ema>	 CC: volans ^
[10:16:19] <volans>	 Re: ema
[10:17:03] <elukey>	 ah nice
[10:18:08] <volans>	 b'\0' would probably do the same, but I like the 0x :D
[10:18:45] <volans>	 ema: LSTM
[10:18:50] <ema>	 volans: great, thanks
[10:19:27] <volans>	 yw
[10:19:41] <ema>	 0x to stress that it's a friggin' evil NULL byte and not just a pretty and harmless character you might want to invite for dinner
[10:25:07] <ema>	 OK that fixes varnishreqstats
[10:25:13] <ema>	 re-enabling puppet
[10:27:29] <volans>	 rotfl
[11:23:33] <bblack>	 Oct 28 11:23:18 cp3036 gmond[87685]: [PYTHON] Can't call the metric handler function for [frontend.MAIN.n_lru_moved] in the python module [varnish].
[11:23:36] <bblack>	 Oct 28 11:23:18 cp3036 gmond[87685]: Traceback (most recent call last):
[11:23:39] <bblack>	 Oct 28 11:23:18 cp3036 gmond[87685]: File "/usr/lib/ganglia/python_modules/varnish.py", line 80, in get_value
[11:23:42] <bblack>	 ^ related?
[11:24:25] <bblack>	 hmm no apparently that's been going on a while
[11:25:53] <bblack>	 perms
[13:12:46] <ema>	 bblack: oh yeah that's the usual gmond issue with perms
[13:14:59] <ema>	 bblack: what's the current experiment on cp3046? 
[13:16:11] <bblack>	 ema: it's running normal config with puppet enabled now, just newer nginx packaging (1.11.6-ish + openssl-1.1 + workaround to avoid the client SSL errors)
[13:16:27] <bblack>	 it happens to be the endpoint I've been pointing at the past day or so from my laptop
[13:17:01] <ema>	 I'm asking because the nginx-full version installed is 1.11.3-1+wmf2
[13:21:29] <bblack>	 oh you're asking about a different server
[13:21:36] <bblack>	 the one I'm on is cp3036, not 3046
[13:21:56] <moritzm>	 btw, the regression update which Ubuntu and Debian pushed yesterday for nginx doesn't affect our minimised patch, it was only related to a cornercase in the debconf check
[13:22:23] <bblack>	 ema: all the rest shouldbe in the normal state (puppet enable, and 1.11.3-based packages, and showing 8 outstanding package upgrades we don't want because they're the initial broken 1.11.4+openssl-1.1 packages)
[13:22:56] <ema>	 bblack: yeah I've just noticed it's another host. cp1047 and cp3038 are running other nginx versions (1.11.4-1+wmf3 and 1.11.4-1+wmf5 respectively)
[13:23:24] <bblack>	 ugh
[13:23:36] <bblack>	 1047 is "special", that probably got messed up because I reimaged it
[13:23:46] <bblack>	 I'll look at 3038
[13:25:10] <bblack>	 3038 is fixed now
[13:28:00] <ema>	 I can copy 1.11.3-1+wmf2 from another host to 1047 and downgrade it
[13:28:26] <bblack>	 just did
[13:28:32] <ema>	 ah :)
[13:28:45] <bblack>	 it's still got one anomaly, but it's not a functional problem
[13:29:04] <bblack>	 apt-cache policy openssl says it wants to upgrade to the borked old openssl CLI packages from debian-experimental
[13:29:32] <bblack>	 I had to add experimental there post-reimaging (reimaging varnish_version4 hosts means you have to do some of those steps after it comes up)
[13:29:45] <bblack>	 but they all have experimental from that, so I don't see why it sees this differently
[13:30:10] <bblack>	 oh it doesn't, they all have that :(
[13:30:17] <bblack>	 lovely!
[13:30:48] <bblack>	 ii  openssl                       1.1.0-1+wmf2        amd64               Secure Sockets Layer toolkit - cryptographic utility
[13:31:01] <bblack>	 I guess we know the CLI works out ok for the cache hosts' case
[13:32:40] <bblack>	 moritzm: I thought you deleted those old 1.1.0 packages?
[13:33:16] <moritzm>	 mhh, I thought I had?
[13:35:09] <moritzm>	 ah, I remember what the problem was, when I looked into removing those, "reprepro remove" only allowed a selection of code name and (like jessie-wikimedia) and source package
[13:35:50] <bblack>	 ok
[13:36:06] <bblack>	 well the other unfortunate complication is all our varnish4 hosts have the experimental repo turned on :)
[13:37:05] <ema>	 we could remove the experimental repo and only add it if we need to install a new varnish while we're still in a mixed v3/v4 state
[13:37:06] <moritzm>	 with openssl 1.1 in jessie-wikimedia/backports we can drop that, or is there anything else?
[13:37:22] <moritzm>	 ah, varnish4
[13:37:50] <bblack>	 I figured out how to remove it from crabon
[13:37:52] <bblack>	 *carbon
[13:38:07] <bblack>	 # reprepro listfilter jessie-wikimedia '$Version (== 1.1.0-1+wmf2)'
[13:38:07] <bblack>	 jessie-wikimedia|experimental|amd64: libssl-dev 1.1.0-1+wmf2
[13:38:08] <bblack>	 ...
[13:38:17] <bblack>	 (luckily nothing else has that exact same version number)
[13:38:22] <bblack>	 # reprepro removefilter jessie-wikimedia '$Version (== 1.1.0-1+wmf2)'
[13:42:41] <wikibugs>	 10Traffic, 06Operations, 10ops-esams: cp3021 failed disk sdb - https://phabricator.wikimedia.org/T148983#2738539 (10mark) This was me, experimenting with ATA Secure Erase. :) I did log it in SAL.
[13:43:04] <ema>	 which version of openssl do we want to have installed? Almost all varnish4 hosts have 1.1.0-1+wmf2, a few have 1.0.2j-1~wmf1, most of the v3 have 1.0.2j-1~wmf1 with the exception of cp1067 (1.1.0-1+wmf2)
[13:43:17] <bblack>	 I'm in the process of fixing that
[13:43:25] <bblack>	 they should have 1.0.2j-1
[13:43:29] <ema>	 ok
[13:43:52] <bblack>	 it's just the CLI component though, so it's not hugely impactful
[13:44:08] <ema>	 sure. Let me know if I can help
[13:44:09] <bblack>	 it affects a few things that use the CLI, like the ocsp updater and cert chain building scripts, etc
[13:44:12] <bblack>	 but doesn't seem to have hurt them :)
[13:44:28] <bblack>	 E: There are problems and -y was used without --force-yes
[13:44:34] <bblack>	 #AptAwesomeness
[13:44:42] <bblack>	 wtf does -y mean if it doesn't mean "force yes" :P
[13:45:22] <elukey>	 "polite yes"
[13:45:27] <elukey>	 :P
[13:47:13] <bblack>	 Reports that say that something hasn't happened are always interesting to me, because as we know, there are Yes Yesses; there are things we know we know are Yes. We also know there are Yes Nos; that is to say we know there are some things we do not know are Yesses. But there are also No Nos – the ones we don't know we don't know are Nos. And if one looks throughout the history of our country an
[13:47:20] <bblack>	 d other free countries, it is the latter category that tend to be the difficult ones.
[13:48:43] <bblack>	 ( ^ https://en.wikipedia.org/wiki/There_are_known_knowns )
[13:49:29] <bblack>	 ok all the caches' openssl packages are now at 1.0.2j-1~wmf1
[14:09:51] <ema>	 bblack: OK to merge https://gerrit.wikimedia.org/r/#/c/318320/ and https://gerrit.wikimedia.org/r/#/c/318519/? pcc output looks sane
[14:11:54] <bblack>	 +1
[14:11:59] <ema>	 \o/
[14:11:59] <bblack>	 maybe add a negative test too though?
[14:12:07] <ema>	 yes
[14:12:10] <bblack>	 (that X-Carrier isn't set for some other random IP)
[15:01:37] <ema>	 today I discovered: the most acceptable syntax highlighting for VCL files is C, for VTC tests is perl
[15:27:33] <godog>	 FYI yesterday I got varnish stats to prometheus in codfw for maps and misc, stub dashboard https://grafana-admin.wikimedia.org/dashboard/db/prometheus-varnish-dc-stats
[15:28:11] <godog>	 please improve it though, there's likely more interesting stats to have there
[15:31:03] <ema>	 godog: nice!
[15:31:31] <volans>	 nice! probably the cache type could be a good candidate for a template variable ;)
[15:33:21] <godog>	 yeah I haven't up my mind yet if it makes sense on a single dashboard "overall" and then a separate dashboard to get a drilldown with more details
[15:33:30] <godog>	 anyways I guess we'll have to play with it
[15:35:44] <godog>	 there's a grafana bug open to be able to click one of graphs or labels and redirect you to the drilldown
[18:47:09] <gehel>	 bblack, ema: we have an issue with some tiles generated wrong on the last version of kartotherian
[18:47:37] <gehel>	 bblack, ema: a fix is coming up, but we have poluted caches
[18:49:23] <gehel>	 I'm not up to date at all on how to invalidate our varnish caches (yes, I know it is not a good idea to do that, but maps is probably one place where we still can)
[18:49:52] <bblack>	 yeah we'd rather not just invalidate the whole cache
[18:50:01] <bblack>	 is there not some other key we can go by?
[18:50:11] <gehel>	 actually, only the zoom > 9 is polluted
[18:50:15] <bblack>	 (or even a server Date: range or something?)
[18:50:48] <gehel>	 in term of URL https://maps.wikimedia.org/osm-intl/<zoom>, with zoom in [9..18]
[18:51:00] <bblack>	 it doesn't matter as much on maps because it doesn't have much traffic, but it's better to find better ways now than later :)
[18:51:13] <gehel>	 yeah, definitely!
[18:51:41] <bblack>	 is there some natural key we could've used with xkey to identify a bad run?
[18:52:07] <bblack>	 (maybe some identifier/stamp for each run of tile updates kartotherian does or something?)
[18:52:17] <bblack>	 I donno, just thinking out loud
[18:52:26] * gehel is thinking as well
[18:52:40] <gehel>	 in this case, zoom level could be used
[18:52:51] <bblack>	 well yeah, but even then we didn't break all the tiles at that zoom level
[18:53:05] <bblack>	 (and even if we did in theory, they weren't all fetched since the breakage and thus in cache)
[18:53:26] <gehel>	 yep, that's still too large.
[18:54:11] <gehel>	 in this case, the issue is on tile rendering, which is done only when the tile is actually requested, so no way to link this to a particular tilerator run
[18:54:29] <bblack>	 what caused the issue?
[18:54:56] <bblack>	 (and how are they broken?)
[18:55:01] <gehel>	 The issue is linked to the index calculation (the transformation from x/y coordinates to a linear index)
[18:55:31] <gehel>	 so at least theoretically, we should be able to identify which ranges of that index are affected
[18:56:12] <bblack>	 yeah but what *caused* the issue?
[18:56:29] <bblack>	 and are they broken so bad that it's easier to identify the breakage? (like, the tiles are zero bytes)
[18:57:50] * gehel is looking for the example
[18:58:40] <gehel>	 basically the tiles render the wrong location, so they still look like normal tiles, nothing technically obvious
[19:05:53] <gehel>	 sorry it took time. Example of the issue: https://maps.wikimedia.org/#16/-24.9543/132.6781
[19:07:40] <gehel>	 as I understand it, this is related to https://github.com/kartotherian/kartotherian-core/blob/master/lib/core.js#L192
[19:09:04] <bblack>	 so bad karto version bump here, now reverted?
[19:09:57] <bblack>	 maybe include that in a header
[19:10:15] <bblack>	 X-KartoVersion: 0.0.19b1 or whatever
[19:10:22] <bblack>	 when we could ban on that
[19:10:25] <gehel>	 actually fixed, not reverted. (I have mixed feelings there)
[19:10:54] <bblack>	 well either way, there was a version bump somewhere, and could have been a header identifying the version of software generating the output
[19:11:24] <bblack>	 probably such a think is too broad for xkey
[19:11:31] <gehel>	 KartoVersion would make a lot of sense. And upgrade of dependencies should require a new karto version
[19:11:37] <bblack>	 but it would work as a very specific ban condition
[19:12:41] <gehel>	 I was wondering if we already had that, but it does not seem so (I still don't know much about kartotherian)
[19:14:02] <bblack>	 I don't see one in live header output
[19:14:16] <gehel>	 nope, it does not seem to be there.
[19:15:01] <gehel>	 bblack: the new kartotherian version is deployed.
[19:15:34] <gehel>	 bblack: I have some notes from last time we had a similar issue, but I don't entirely trust them
[19:17:03] <gehel>	 bblack: can we purge on URL patterns (anything starting with https://maps.wikimedia.org/osm-intl/1) That is still a lot of tiles, but might make more sense...
[19:17:13] <bblack>	 we can ban on URL pattern
[19:17:21] <bblack>	 which is a little different than purge
[19:17:35] <bblack>	 also, it makes a functional difference whether we ban on request attributes or response attributes
[19:17:46] <bblack>	 (response attributes are way better, but usually don't include the request URL)
[19:18:04] <gehel>	 how are they better?
[19:18:14] <wikibugs>	 10Traffic, 10MediaWiki-General-or-Unknown, 06Operations, 10media-storage: Mediawiki thumbnail requests for 0px should result in http 400 not 500 - https://phabricator.wikimedia.org/T147784#2752759 (10fgiunchedi) p:05Normal>03High There's still significant traffic for 0px thumbs resulting in 500s, it'd...
[19:19:43] <bblack>	 so a varnish "ban" is a kind of purge that takes effect immediately for ongoing requests, even though it obviously can't immediately invalidate all of the matching objects in storage (as it takes a while to iterate all of the existing cache)
[19:20:10] <bblack>	 if it was just 1 URL, it could be a simple purge of a single object, but the reason you use bans is to match many objects
[19:20:43] <bblack>	 if the ban's condition refers only to attributes of the [response] object, then its logic can proceed something like this:
[19:21:12] <bblack>	 1) Immediately, implement a rule that applies to all new client fetches from the cache, where if the response from cache matches the condition, we ignore the cache item and treat it like a miss
[19:21:46] <bblack>	 2) Also, in the background, scan all cache storage and evict items matching the condition.  when the scan is complete, we can remove the rule above from (1) because no such objects are left that pre-date the ban.
[19:22:08] <bblack>	 but if the ban references attributes of requests... there's no way to scan the storage for request attributes, only response attributes
[19:22:20] <bblack>	 so for those, it's more like:
[19:23:14] <gehel>	 Ok, make sense (as always)
[19:23:14] <bblack>	 1) Immediately, implement a rule that applies to all new client fetches from the cache, where if the *request* matches the condition and there's a fetchable object from cache which is older than the ban timestamps, we treat it as a miss and get a new ones.
[19:23:40] <bblack>	 2) Also, in the background, basically wait until all objects in the cache are newer than the ban (total rollover of storage), then you can remove the rule from (1) above.
[19:23:51] <bblack>	 so the rule persists much longer
[19:25:01] <bblack>	 so a ban on response attribute obj.http.X-KartoVersion==123 will complete its work as soon as it's done scanning cache storage
[19:25:26] <bblack>	 a ban on req.url ~ /abcd will complete its work after all storage has rolled over/expired and nothing is left in cache that's older than the ban
[19:25:37] <gehel>	 in this specific case, it looks to me that since we will clear a very significant part of the cache anyway, does it make more sense to just kill all of it?
[19:25:45] <bblack>	 (and then if you do those commonly, they pile up as many rules that have to be checked on every request)
[19:26:02] * gehel hope that it stays an exception!
[19:26:38] <bblack>	 varnish doesn't have any kind of runtime "cache-wipe" command or anything.  when we really need to wipe caches, we can restart all the instances, but that's also a PITA and a lot of depool churn, etc
[19:26:57] <bblack>	 in this case, it's probably simpler to ban on a universal response attribute
[19:28:24] <bblack>	 maybe something like ban obj.status == 200
[19:28:49] <bblack>	 I've never really tried to think what the most efficient condition is to ban all storage
[19:30:20] <bblack>	 do you want to try it or want me to?
[19:32:12] <bblack>	 https://wikitech.wikimedia.org/wiki/Varnish#How_to_execute_a_ban_across_a_cluster
[19:32:16] <bblack>	 is basically what to do
[19:32:28] <bblack>	 but cache_maps and obj.status == 200
[19:32:28] <gehel>	 Honestly, it is already well past dinner time here... But I feel bad asking your support each week on Fridays for maps...
[19:32:45] <bblack>	 I can do it, I'm used to it :)
[19:33:37] <bblack>	 gehel: we're still on eqiad as the primary site right?
[19:33:49] <gehel>	 bblack: yes, that has not changed
[19:34:05] <gehel>	 Oh actually that looks much easier than I thought!
[19:34:39] <gehel>	 bblack: if you have not started, I'll try to do that ban
[19:34:54] <gehel>	 (under your supervision)
[19:35:00] <bblack>	 I already started :)
[19:35:39] <wikibugs>	 10Traffic, 06Operations, 10RESTBase, 06Services (doing): Restbase redirects with cors not working on Android 4 native browser - https://phabricator.wikimedia.org/T149295#2752824 (10GWicke)
[19:35:46] <bblack>	 with slight unimportant modifications, my commands were:
[19:35:51] <bblack>	 salt -b 1000 -v -t 10 -C 'G@cluster:cache_maps and G@site:eqiad' cmd.run 'varnishadm ban obj.status == 200'
[19:35:57] <bblack>	 salt -b 1000 -v -t 10 -C 'G@cluster:cache_maps and G@site:codfw' cmd.run 'varnishadm ban obj.status == 200'
[19:36:03] <bblack>	 salt -b 1000 -v -t 10 -C 'G@cluster:cache_maps and ( G@site:esams or G@site:ulsfo )' cmd.run 'varnishadm ban obj.status == 200'
[19:36:22] <bblack>	 salt -b 1000 -v -t 10 -C 'G@cluster:cache_maps' cmd.run 'varnishadm -n frontend ban obj.status == 200'
[19:36:41] <bblack>	 now the last one's done
[19:36:50] <bblack>	 should be getting fresh data and it should start caching it
[19:37:30] <gehel>	 let me just see if I understand that correctly...
[19:39:33] <gehel>	 so we ban from the cache the closest from the application, so that we don't re-cache problematic tiles, and we do the frotends last...
[19:40:03] <bblack>	 yeah so in general, every site has frontend and backend caching.  the frontend is the outer-most (client-facing) edge, and pull from the backends at the same datacenter
[19:40:24] <bblack>	 and ulsfo pulls from codfw, codfw pulls from eqiad, esams also pulls from eqiad
[19:40:35] <bblack>	 you have to make sure you ban from the bottom-most layer outwards
[19:41:13] <bblack>	 there are a number of ways you could arrange the purge order that meets the requirements
[19:41:25] <bblack>	 but that way seems to take the least thought and the fewest commands
[19:41:36] <bblack>	 backends: eqiad, codfw, ulsfo+esams
[19:41:38] <bblack>	 then all frontends
[19:44:33] <gehel>	 bblack: I'll pack some special chocolate for you if you're coming to the all hands! Thanks a lot!
[19:46:49] <bblack>	 np!
[19:52:57] <wikibugs>	 10Traffic, 06Operations, 10RESTBase, 06Services (doing): Restbase redirects with cors not working on Android 4 native browser - https://phabricator.wikimedia.org/T149295#2752861 (10GWicke) See also: {T149444}
[19:54:07] <gwicke>	 bblack: Pchelolo has been digging deep into CORS / redirect issues in the last days, and last status of that discussion is at https://phabricator.wikimedia.org/T149295
[19:55:54] <gwicke>	 basically, browsers don't support redirects very well when using XHR and fetch APIs; A work-around we are considering is following redirects on behalf of clients in the Varnish layer, and adding a header indicating the resource location. Your input on this would be much appreciated.
[19:58:15] <bblack>	 why do we need to redirect them in the first place?
[19:59:05] <gwicke>	 one source of redirects is #REDIRECT pages
[19:59:20] <gwicke>	 we'd like to avoid cache fragmentation on the target page
[20:00:09] <gwicke>	 there are others, like redirecting image description page requests from a project domain to commons
[20:00:21] <bblack>	 ok
[20:03:22] <bblack>	 I'm trying to remember all the original problem of even normal mediawiki redirect articles
[20:03:42] <bblack>	 user follows a link to a directed title like https://en.wikipedia.org/wiki/Catala
[20:04:24] <bblack>	 we served a 200 OK, with <link rel=canonical> to the real article Catalan_language, and the contents of that article, and the extra "Redirected from" text
[20:04:56] <gwicke>	 yeah, and the location is updated with JS
[20:05:03] <bblack>	 so basically while we internally call it a "redirect" page, from an HTTP perspective it's a unique URL with its own content that's almost exactly like the target page except for the "redirect from" text, and has a rel=canonical to the real place?
[20:05:23] <bblack>	 and we've always had the duplicate content for normal desktop views
[20:05:27] <bblack>	 (in the caching sense)
[20:05:58] <gwicke>	 we have been looking for a better way to do that, but when you follow a top-level redirect there is no clean way to discover where you came from
[20:06:21] <bblack>	 so... when fetching via RB, we decided to actually do 301 redirect and avoid the cache duplication, or something like that?
[20:06:33] <gwicke>	 things are changing now, with serviceworkers and SPA
[20:06:35] <bblack>	 and then there were soe hacks we had to build around that with ?redirect=(true|false)
[20:08:00] <bblack>	 right ok, I'm looking now
[20:08:09] <bblack>	 it was a 302, which we transform into a 200 if redirect=false
[20:08:29] <gwicke>	 yeah, that's for the REST API
[20:08:37] <gwicke>	 that part works fine
[20:09:22] <bblack>	 GET /api/rest_v1/page/mobile-sections/User%3APchelolo%2FRedirect%20Test?redirect=true
[20:09:28] <gwicke>	 there are several things we are considering, which are mostly orthogonal
[20:09:37] <bblack>	 so this example is fetching the above, and then not liking the 302 in the browser?
[20:09:51] <gwicke>	 the first is adding a `Content-Location` header, so that clients can figure out whether there was a redirect
[20:10:04] <bblack>	 (can't the browser look at rel=canonical?)
[20:10:18] <gwicke>	 that's assuming it parses the HTML
[20:10:25] <bblack>	 also, why is this CORS-related?
[20:10:27] <gwicke>	 in a SW, you look at a byte stream
[20:10:43] <bblack>	 isn't the 302 to another article on the same site?
[20:11:13] <gwicke>	 the interaction between CORS and redirects is extra problematic, because of an earlier spec bug
[20:11:37] <gwicke>	 the spec said to just "go to error" in case of pre-flighted CORS redirects
[20:12:00] <gwicke>	 now fixed, but browsers haven't implemented that yet
[20:12:05] <bblack>	 why is it considered a CORS redirect?
[20:12:13] <bblack>	 the location doesn't even have a hostname in it
[20:12:21] <gwicke>	 oh, it's just a redirect, but the access is via CORS
[20:12:24] <bblack>	 < HTTP/2 302 
[20:12:25] <bblack>	 < content-length: 0
[20:12:26] <bblack>	 < location: User%3APchelolo
[20:12:40] <bblack>	 why is the initial access via CORS?
[20:13:03] <gwicke>	 it is not in production, but a lot of the prototypes are not hosted on the main domain, so use CORS
[20:13:15] <bblack>	 so basically this isn't a production problem :)
[20:13:38] <gwicke>	 even production ones are likely to run into this at some point when pulling down resources from multiple projects
[20:13:45] <gwicke>	 it's less common, though
[20:14:18] <gwicke>	 the bottom line is that redirects are icky, and especially so when the request is via CORS
[20:14:41] <gwicke>	 so we are looking for a way to have our cake & eat it too
[20:14:49] <bblack>	 the only reason we don't handle this problem the way MW does is cache pollution?
[20:15:07] <gwicke>	 well, mw disables caching altogether, doesn't it?
[20:15:24] <bblack>	 I hope not, that would awful for commonly-linked redirect pages
[20:15:43] <bblack>	 yeah it doesn't
[20:17:12] <gwicke>	 ah, then I guess it explicitly purges each of those
[20:17:58] <gwicke>	 this must be at least the second time I am trying to figure out how this works
[20:18:58] <gwicke>	 in RB, we needed to expose content for both the redirect page, as well as the redirect destination
[20:19:03] <bblack>	 it probably just doesn't bother purging them, for all I know
[20:19:31] <gwicke>	 the HTTP redirect works well for that, it just needs a bit of work to make it more usable in a browser context
[20:19:59] <bblack>	 well even the MW version of this exposes both
[20:20:14] <bblack>	 via rel=canonical and whatever js uses to reset the location bar
[20:20:38] <gwicke>	 yeah, but the caching is kind of icky
[20:20:41] <bblack>	 can js read arbitrary http response headers?
[20:20:56] <gwicke>	 it can, if you whitelist those headers
[20:21:07] <bblack>	 whitelist them with what?
[20:21:20] <gwicke>	 and it can't read the URL (legacy security issue), but could read something like `content-location`
[20:21:41] <gwicke>	 the cors expose-headers header
[20:22:10] <bblack>	 so in non-cors case, it can just read them all?
[20:22:24] <gwicke>	 Access-Control-Allow-Headers
[20:22:28] <bblack>	 right
[20:22:43] <gwicke>	 yeah, but not the URL
[20:22:47] <bblack>	 seems like at the very least, we could clean up MW's version of this problem with that
[20:23:01] <bblack>	 but that's neither here nor there
[20:23:17] <bblack>	 really the whole model is borked, the "redirected from" thing
[20:23:56] <bblack>	 looking at the MW version of the problem again
[20:24:01] <gwicke>	 supporting this in a web app wouldn't be too hard once we add something like `content-location`
[20:24:33] <gwicke>	 however, it wouldn't help the current MW frontend, which reloads the entire page & doesn't use ServiceWorkers
[20:25:31] <gwicke>	 and we'll always have a variant of that model for first view
[20:25:42] <bblack>	 why wouldn't we just eliminate the annoying "redirected from", have redirect-articles return a 302 with a reasonable cache lifetime, and then maybe in the main article, have a section that says "Aliases this is redirected from" listing them (if we even bother)
[20:25:57] <bblack>	 and of course editing a redirect can always be handled differently, but this is about reading
[20:26:37] <gwicke>	 yeah, I think that direction makes a lot of sense
[20:26:39] <bblack>	 (reasonable cache lifetime for varnish, I mean.  still purgeable)
[20:26:54] <bblack>	 the non-redirect redirects are just kind of awful, design-wise
[20:27:05] <gwicke>	 for modern clients we could even preserve the "redirected from" feature client-side
[20:27:22] <gwicke>	 without compromising caching
[20:27:27] <bblack>	 transposing this over to the RB world and apps/SW though, the 3xx still presents your CORS problem either way
[20:27:46] <bblack>	 do we ever do cross-project redirect pages anyways? what's the usecase for that?
[20:28:18] <gwicke>	 we currently redirect to commons for specific image description pages
[20:28:36] <bblack>	 can you show me an example?
[20:28:58] <gwicke>	 basically any description that only lives on commons
[20:29:00] <bblack>	 I think it was just a link to commons
[20:29:03] <bblack>	 *thought
[20:29:12] <gwicke>	 no, some images are local per-wiki
[20:29:23] <gwicke>	 so we check for that first, and as a fall-back check commons
[20:29:28] <gwicke>	 if that's found, we redirect
[20:29:33] <bblack>	 we redirect... what?
[20:29:40] <bblack>	 like, what URL returns a redirect?
[20:29:40] <gwicke>	 in the REST API
[20:29:48] <bblack>	 no, I mean desktop
[20:29:54] <bblack>	 regular MW html
[20:30:01] <gwicke>	 no, there is no redirect there
[20:30:09] <bblack>	 ok
[20:30:10] <gwicke>	 it's inlined, and (probably?) purged
[20:30:18] <bblack>	 probably not :)
[20:30:23] <bblack>	 ok
[20:30:24] <gwicke>	 hehe
[20:30:54] <bblack>	 so, the wiki the description lives on can be determined from the upload path
[20:31:09] <bblack>	 https://upload.wikimedia.org/wikipedia/commons/thumb/0/04/Greuges_de_Guitard_Isarn.jpg/200px-Greuges_de_Guitard_Isarn.jpg
[20:31:12] <bblack>	 ^ description lives on commons
[20:31:43] <gwicke>	 right, that's what we do for descriptions that aren't local
[20:32:16] <bblack>	 each description for each non-local image is a separate REST API fetch? which may then redirect?
[20:33:04] <gwicke>	 clients don't know where the description lives
[20:33:22] <bblack>	 aren't clients given that same image URL I linked above?
[20:33:42] <gwicke>	 yeah, but that's not a public API
[20:34:03] <bblack>	 put an attribute on the links as a new public API?
[20:34:05] <gwicke>	 we have always told users that the path should be treated as an implementation detail, and not as an API
[20:34:08] <gwicke>	 subject to change at any time
[20:34:20] <bblack>	 <img href="..." x-wmf-src-wiki="common"> ? :)
[20:34:26] <bblack>	 s
[20:34:44] <gwicke>	 sure, it's possible to work around any of those things
[20:34:48] <bblack>	 I donno, just thinking there must be a way to send the metadata without redirects
[20:35:12] <gwicke>	 but, whenever you inline, you
[20:35:19] <gwicke>	 'll have to deal with the invalidation issues
[20:35:43] <gwicke>	 so when that description is created locally, you better have a way to purge all those pages that inlined that hint
[20:35:55] <bblack>	 we're not inlining the description itself in this case, just the decoding of the image URL -> source wiki domain
[20:36:07] <gwicke>	 right, and that can change
[20:36:11] <gwicke>	 by creating a local description
[20:36:25] <bblack>	 local to.. the article?
[20:36:32] <gwicke>	 the project
[20:36:49] <bblack>	 so each project can have its own local description of the same commons image they're linking?
[20:36:50] <gwicke>	 lets say we start with image:foo.png only available on commons
[20:36:55] <gwicke>	 we inline that hint everywhere
[20:37:06] <gwicke>	 now somebody uploads image:foo.png on project X
[20:37:19] <gwicke>	 all pages in project X with that hint now need to be updated
[20:37:21] <bblack>	 that's a distinct image
[20:37:28] <bblack>	 it's not a new description of the same image
[20:37:41] <gwicke>	 I'm not sure about the description-only case
[20:37:54] <gwicke>	 I don't think it's possible to override that only, without uploading an image
[20:38:07] <bblack>	 I don't know how you pull this back to the general case of external wiki deployments
[20:38:43] <bblack>	 but at least in the WMF case, it's either a truly-local image (is that even possible outside of RL?), or it's an upload.wikimedia.org image link which explicitly encodes what the source wiki is in the image pathname.
[20:38:51] <gwicke>	 even if you need to upload a separate image, it'll be hard to cleanly update every json blob that ever embedded the hint that this image lives on commons
[20:39:10] <bblack>	 but you have to do that anyways, to change the link
[20:39:19] <bblack>	 when the link changes, the description source implicitly changes
[20:39:39] <gwicke>	 the link doesn't necessarily change
[20:39:45] <bblack>	 I think it has to
[20:39:58] <gwicke>	 the thumb URL does, but the image name doesn't necessarily
[20:40:29] <bblack>	 the original URL does too, every URL does
[20:40:44] <bblack>	 that they share the terminal image name is kind of irrelevant, that's not a valid "key" without knowing the wiki it lives on
[20:41:21] <bblack>	 https://upload.wikimedia.org/wikipedia/commons/0/04/Greuges_de_Guitard_Isarn.jpg != https://upload.wikimedia.org/wikipedia/en/0/04/Greuges_de_Guitard_Isarn.jpg
[20:41:46] <bblack>	 they could be completely-unrelated to each other for different purposes entirely
[20:41:57] <gwicke>	 well, in any case right now those thumb URLs aren't enough to figure that out
[20:42:13] <bblack>	 it is
[20:42:23] <bblack>	 the name of the project and language-subdomain is embedded in the image path
[20:42:26] <gwicke>	 if you treat them as an api, sure
[20:42:37] <bblack>	 it is, effectively, our API.  just document it :)
[20:42:41] <gwicke>	 and then do some regexp match
[20:42:56] <gwicke>	 it'll also break once we move to hashed thumbs
[20:43:04] <gwicke>	 since those don't even care about projects
[20:43:05] <bblack>	 but if that seems ugly with the regexp parsing of path fields, etc
[20:43:29] <bblack>	 thumbs do care about projects, or else an image with the same name on two projects which is completely unrelated would collide
[20:43:47] <gwicke>	 hashes are on the original content, and don't care about names
[20:43:53] <bblack>	 anyways, if it seems ugly, then re-encode it as metadata from the same source
[20:44:19] <bblack>	 ok, if you're making the original's hash the key for thumbnails, then unrelated images are still unrelated?
[20:44:37] <gwicke>	 they could resolve to the same thumb url
[20:44:45] <gwicke>	 if you re-upload the same original locally, for example
[20:44:50] <bblack>	 seems like you still need the project name in there though, because apparently part of the image is its description, which might differ for identical files uploaded to two wikis :)
[20:45:15] <bblack>	 it seems insane to have identical thumb paths for two thumbnails which have different descriptions
[20:45:16] <gwicke>	 the thumb URL is not a link to its description
[20:45:38] <bblack>	 so, given a thumb URL that's just a hash without a wiki name, how do you find the description?
[20:45:44] <gwicke>	 we have separate links for those
[20:45:59] <gwicke>	 in HTML contexts, at least
[20:46:00] <bblack>	 ...
[20:46:05] <gwicke>	 but not in some others
[20:46:31] <gwicke>	 anyway, we have moved from "how to deal with redirects" to "how to avoid all redirects"
[20:46:44] <gwicke>	 which imho is not really useful / realistic
[20:46:50] <bblack>	 in the current schema of things that's actually-live, Foo.png can exist in several wikis independently, and even if it happens to be duplicate file contents, they still have unique description metadata
[20:47:29] <bblack>	 if the new scheme makes it such that Foo.pn on 10 wikis is indistinguishable as a thumb URL, maybe the thumb URLs need wiki names
[20:48:04] <bblack>	 and no, this isn't "how to avoid all redirects", it's "how to avoid cross-project redirects", which don't seem very useful
[20:48:04] <gwicke>	 the thumb issue is just an example of a use case that currently uses redirects cross-project
[20:48:24] <bblack>	 the other part of this discussion was about using real http redirects in place of fake content-duplicating ones within a project
[20:48:46] <gwicke>	 there are other cross-origin use cases on the horizon, such as surfacing pageview data in content pages
[20:48:47] <bblack>	 my argument is I don't suspect cross-project redirect-metadata makes sense
[20:49:11] <bblack>	 why would that be a cross-project redirect and not just a cross-project fetch?
[20:49:45] <gwicke>	 more generally, a lot of uses of our APIs is outside our own projects, so often using CORS
[20:50:06] <bblack>	 ... and apparently CORS breaks HTTP because redirects aren't useable...
[20:50:12] <gwicke>	 so any redirect is potentially a redirect in a CORS context
[20:50:40] <bblack>	 well, a local redirect (same site), which happens as a result of a CORS fetch, is different from an actual cross-site redirect
[20:50:56] <bblack>	 if the former is broken for CORS clients.... well, it's broken...
[20:51:27] <gwicke>	 hehe
[20:51:47] <bblack>	 so now we're re-implementing a new kind of redirect semantics separately from HTTP because HTTP redirects are broken?
[20:51:53] <bblack>	 it just seems like an ill-advised path
[20:52:45] <gwicke>	 it is the best we could come up with that keeps caching sane, and provides good performance
[20:53:08] <gwicke>	 it should actually outperform a client following the redirect itself
[20:54:13] <gwicke>	 if we didn't have to deal with specs designed around legacy browser bugs, this would all be nicer and cleaner..
[20:54:37] <bblack>	 or we can just take the MW route and return 200+content+redirect-metadata
[20:54:44] <bblack>	 and accept the cache pollution
[20:54:54] <bblack>	 and accept that we'll need to fix up purging in that case too
[20:55:06] <bblack>	 (by, I guess, emitting the canonical article title xkey for the redirect content)
[20:55:16] <gwicke>	 compared to following redirects, that seems a lot more complex and expensive
[20:55:30] <bblack>	 the MW html dataset size is relatively tame, it's not like it doesn't fit in caches today, even with the duplication
[20:55:51] <gwicke>	 sure, but the purging is not simple or cheap
[20:55:58] <bblack>	 sure it is
[20:56:09] <bblack>	 you already know the canonical title when you emit the duplicate.  put that in the xkey header
[20:56:32] <bblack>	 (I imagine that's what we'll do to improve this situation for MW html output with xkey, too)
[20:57:11] <bblack>	 maybe you need 2x xkeys emitted in that case, since a change to the original or a change to the redirect metadata could invalidate that object
[20:57:39] <gwicke>	 well, for one this means that we'll always have to send all purges to all varnishes
[20:57:52] <bblack>	 we have to do that anyways
[20:57:53] <gwicke>	 which seems like a liability
[20:58:03] <bblack>	 how else?
[20:58:53] <bblack>	 [meeting starting, sorry]
[20:58:56] <gwicke>	 we might eventually come up with a partitioning scheme
[20:59:32] <bblack>	 unlikely.  if anything the other way around
[20:59:57] <gwicke>	 famous last words ;)
[21:00:10] <gwicke>	 we'll see
[21:00:19] <bblack>	 we chash in backends, the frontends don't
[21:00:29] <bblack>	 the chashing can change though, it's not something we'd route purges on
[21:00:38] <bblack>	 (since the content lives on after routing changes)
[21:01:02] <gwicke>	 the same problem has been solved successfully in storage systems
[21:01:06] <bblack>	 but the storage set size for mw content isn't huge.  it's not an issue worth partitioning
[21:01:09] <gwicke>	 with replication
[21:01:17] <bblack>	 this isn't a storage system, it's a cache
[21:01:42] <gwicke>	 right, but you can apply the same principles to ensure overlap in purges after topology changes
[21:01:51] <bblack>	 but in any case, there's not a problem to be solved here.  what's "the same problem"?
[21:02:26] <bblack>	 the purge traffic going to all nodes? it isn't fundamentally an issue
[21:03:01] <gwicke>	 maybe
[21:03:15] <gwicke>	 in any case, I'm curious which issues you see with following some redirects in the Varnish layer
[21:03:44] <gwicke>	 if that's a genenral no-go, we can always punt & fragment caches
[21:04:44] <gwicke>	 I'd like to give it proper consideration, though
[21:50:05] <bblack>	 gwicke: re following: it's probably possible, but it's tricky on a couple of levels.  we'd have to restart the backend transaction like we'd do for an error, and do some complex VCL hacks around rewriting the request, and I'm not sure if that late in the game we can still rewrite the cache slot to solve the fragmentation problem anyways.
[21:51:09] <bblack>	 gwicke: but in general, my long-term hope is our caching layer can get less complex and hacky, and more just a transparent cache layer obeying reasonable standards.... it pains me anytime I see us wanting to add more custom complexity into it.
[21:53:13] <bblack>	 not that we'll ever escape such things, it's just an asymptotic goal that it's desirable to aim towards
[21:53:40] <bblack>	 we'll always need the capability to do custom hacky things at this layer.  we just need to be more judicious in using that power.
[23:06:00] <gwicke>	 agreed on limiting complexity, but often it's a question of where to put the complexity, and whether we can at least solve similar cases using the same mechanism once we have paid the price
[23:06:59] <gwicke>	 both the xkey + fragmentation option as well as generic redirect following can be such general solutions, with slightly different trade-offs
[23:08:13] <gwicke>	 one concern I'd have with redirect following is loops
[23:10:44] <gwicke>	 I think as a first step that's orthogonal to the redirect discussion we'll add a content-location header
[23:11:45] <gwicke>	 that will resolve the issue for non-CORS users
[23:16:30] <gwicke>	 and will be useful even if we go with the cache fragmentation option