[00:03:43] <wikibugs_>	 10Traffic, 10Varnish, 06Discovery, 06Operations, 06Maps (Tilerator): Tilerator should purge Varnish cache - https://phabricator.wikimedia.org/T109776#1558879 (10Pnorman) I'd rather see `max-age` significantly reduced and `stale-while-revalidate` set to the current `max-age` value. This avoids the need to...
[11:06:40] <wikibugs_>	 10netops, 06Operations: Audit and cleanup border-in ACL on core routers - https://phabricator.wikimedia.org/T160055#3087244 (10mark)
[12:44:47] <moritzm>	 ema, bblack: https://phabricator.wikimedia.org/T154934#3087501 ok to also upgrade cp1008 next?
[12:58:01] <gehel>	 ema: We still have some wrong tiles on maps after T159631. If I read the docs correctly, I should be able to invalidate with https://phabricator.wikimedia.org/P5035
[12:58:01] <stashbot>	 T159631: Tasmania is covered with water at z10+ - https://phabricator.wikimedia.org/T159631
[12:58:23] <gehel>	 ema: could you have a look and confirm that this will probably not break things in horrible ways?
[13:02:30] <elukey>	 gehel: qq about line 3 - would it be better to do esams and ulsfo separated as you are doing for eqiad and codfw in line 1/2?
[13:02:40] <elukey>	 (not sure if needed, trying to review :)
[13:03:09] <gehel>	 our doc propose to do both at the same time : https://wikitech.wikimedia.org/wiki/Varnish#Execute_a_ban_on_a_cluster
[13:03:22] <gehel>	 elukey: thanks for looking!
[13:03:30] <elukey>	 well you have -b 1
[13:03:38] <elukey>	 doesn't matter sorry :)
[13:04:09] <gehel>	 no problem! you're taking the time to review, I'm not going to complain!
[13:04:58] <elukey>	 gehel: line 3 -  'req.url ' \" ?
[13:05:43] <gehel>	 typo, fixed. Thanks! Good eyes!
[13:06:32] <elukey>	 lgtm! Not sure about the regex though, better wait a bit for master ema :)
[13:07:26] <elukey>	 maybee it could be good to add req.http.host if possible
[13:08:06] <elukey>	 even if /osm-intl is probably not belonging to anything else :)
[13:08:13] <gehel>	 for the regex, I'm trying to match tiles in Tasmania, like https://maps.wikimedia.org/osm-intl/10/928/642.png (osm-intl/{zoom}/{y}/{x}.png)
[13:08:24] <elukey>	 ah ok nice
[13:08:52] <gehel>	 the cache-maps cluster only serves maps, so we should probably be good
[13:09:05] <elukey>	 true :)
[13:09:11] <gehel>	 and actually, I should probably ban not only png but all formats...
[13:09:48] <elukey>	 the regex seems sane
[13:10:12] <elukey>	 maybe you could target a single Varnish maps backend first
[13:10:16] <elukey>	 say in ulsfo
[13:10:21] <elukey>	 double check that everything is good
[13:10:24] <elukey>	 and then proceed
[13:10:35] * gehel does not like touching varnish... it looks scary
[13:10:45] <gehel>	 elukey: sounds like a good plan...
[13:11:11] <elukey>	 worst that can happen is that you'll execute locally sudo -i depool
[13:11:34] <elukey>	 but it would be good to see metrics for one host about the effect of the purge
[13:12:31] <gehel>	 we still did a full cache wipe not too long ago and even with a cold cache we were able to not have issues. We have a bit more traffic now, but not all that much
[13:12:52] <gehel>	 and at zoom level 10, I'm only invalidating a few 100's tiles
[13:14:49] <elukey>	 oh yes yes I am throwing out all the paranoid things that I can think of
[13:15:09] <elukey>	 anyhow, if it is not urgent ema will be probably back in a few
[13:15:26] <gehel>	 good! That's the goal of asking for a review! And I'm making sure that all the checkboxes are checked...
[13:15:46] <gehel>	 it isn't that urgent
[13:33:57] <ema>	 moritzm: ok to upgrade cp1008
[13:36:26] <moritzm>	 k
[13:37:47] <ema>	 gehel: seems reasonable, have you tried that out on a single host first?
[13:38:09] <gehel>	 ema: not yet
[13:38:12] <ema>	 oh I see that good old elukey suggested that already :)
[13:39:47] <gehel>	 ema: other question, we seem to have a TTL limit of 1 day on maps cache. That's to total limit? It is not backend 1 day + frontend 1 day ?
[13:40:39] <gehel>	 I'm asking because we still have some wrong tiles in cache that should have been expired already tuesday evening...
[13:43:20] <gehel>	 ok, I'll start this ban, we'll see...
[13:45:22] <gehel>	 damn, invalid backslash sequence...
[13:45:26] * gehel needs new glasses
[13:50:07] <gehel>	 ok, found it
[13:59:03] <gehel>	 ok, there is something wrong (probably with my regex)
[13:59:21] <elukey>	 gehel: things not banned?
[13:59:22] <gehel>	 https://maps.wikimedia.org/osm-intl/10/929/644.png still return "Age: 74730"
[14:00:13] <gehel>	 I did test on cp4020 and I did see X-Cache changing from a hit to a miss for cp4020...
[14:00:41] <elukey>	 gehel: so you tried manually on one host, it worked, then you did on the whole maps but nothing?
[14:01:10] <gehel>	 I would not say nothing, but at least the URL above does not seem to have been banned
[14:02:57] <ema>	 gehel: on cp4020 I only see the ban on the frontend
[14:03:23] <ema>	 journalctl -u varnish --since=today|grep ban
[14:03:32] <ema>	 journalctl -u varnish-frontend --since=today|grep ban
[14:03:35] <gehel>	 my bad, mixed up the front and back...
[14:03:55] <gehel>	 starting again...
[14:04:00] * gehel is learning today...
[14:05:51] <ema>	 echo ban.list | varnishadm -n frontend
[14:05:56] <ema>	 to see the current bans ^
[14:06:39] <elukey>	 see? The master explained :)
[14:08:19] <gehel>	 ok, this time it seems to have worked. Thanks for the help!
[14:08:46] <gehel>	 Now I need to understand why we were still generating wrong tiles 24h ago (or why they were cached for longer than 24h)
[14:09:37] <ema>	 gehel: do you have an example we can look at?
[14:10:06] <gehel>	 at lower zoom level probably
[14:10:35] <gehel>	 https://maps.wikimedia.org/osm-intl/11/1859/1285.png
[14:10:59] <gehel>	 Age: 75397
[14:11:52] <gehel>	 the tile currently generated by kartotherian has land instead of water 
[14:13:09] <gehel>	 kartotherian *should* have been rendering that tile correctly since ~1am UTC on March 7
[14:13:29] <gehel>	 or at least since 7am UTC March 7
[14:14:36] <gehel>	 kartotherian codfw is still rendering the tiles wrong (with water), but we should not have any traffic served from there, right?
[14:17:58] <gehel>	 kartotherian is not setting any cache-control headers, except ETag (which is not consistent between the nodes, we should most probably drop it)
[14:18:39] <_joe_>	 lol
[14:18:54] <_joe_>	 we don't set headers, but when we set them, they're wrong :P
[14:18:54] <gehel>	 and a last modified at epoch(0), mostly useless (and wrong) as well
[14:19:25] <gehel>	 where would be the fun if everything just followed the expected path?
[14:19:30] <gehel>	 :/
[14:19:48] <_joe_>	 gehel: well the application needs to be fixed
[14:19:57] <_joe_>	 pretty simple!
[14:20:25] <gehel>	 agreed, but that still does not explain why my wrong tiles are still in cache...
[14:41:33] <_joe_>	 oh sure, you weren't expecting me to make a constructive contribution, did you?
[14:50:41] <gehel>	 _joe_: you had me worried for a minute :)
[17:02:52] <wikibugs_>	 10netops, 06Operations, 10ops-codfw: codfw: oresrdb2002 switch port configuration - https://phabricator.wikimedia.org/T160087#3088288 (10Papaul)
[17:15:31] <elukey>	 bblack: Hi! I'd need some help on cache:misc if you have time
[17:15:43] <elukey>	 otherwise I'll follow up with ema tomorrow :)
[17:16:28] <bblack>	 elukey: depends what kind of help?
[17:17:03] <bblack>	 if this is the "can we source hash clients for wdqs" thing that I skimmed the other day, the answer is no, we're probably not going to support that
[17:17:17] <bblack>	 (at the cache layer)
[17:17:45] <bblack>	 (at this point in time in the broad sense.  meaning maybe we will a yea ror two from now when we've significantly changed lots of things, but who knows!)
[17:18:08] <bblack>	 right now, it's just not a reasonable thing, it causes too many edge cases for too many other efforts
[17:21:51] <elukey>	 bblack: no idea about wdqs but it should be simpler :)
[17:22:04] <bblack>	 ok!
[17:22:19] <elukey>	 so I removed the piwik backend probes with https://gerrit.wikimedia.org/r/#/c/342007/
[17:22:34] <elukey>	 from cache misc but I still see them on bohrium, and tcpdumps confirms..
[17:22:43] <elukey>	 (from hosts in which puppet already ran)
[17:23:02] <bblack>	 hmmm
[17:23:13] <elukey>	 so it might be needed a restart of the backends? I hoped that it was only a VCL restart like when we added them..
[17:23:17] <bblack>	 well let's dissect this a bit...
[17:23:36] <bblack>	 first, they should only be coming from a small handful of cache backends.  like, 8 of them? 4 in eqiad 4 in codfw?
[17:23:46] <bblack>	 I don't think we probe applayer from the edge pops right?
[17:24:01] <bblack>	 but I could be wrong
[17:24:32] <bblack>	 ah ok, we do probe them from the edge sites too (pointlessly), I can see that now
[17:24:36] <elukey>	 this was what I thought as well, but then I saw probes from uslfo!
[17:24:40] <bblack>	 still, 16 hosts doing what should be a lightweight probe query is the problem?
[17:25:00] <bblack>	 e587660c-1f1a-4720-83a7-031466db51a5.be_bohrium_eqiad_wmnet probe      Healthy (no probe)
[17:25:09] <bblack>	 ^ that's cp4001 in ulsfo, not currently probing
[17:25:39] <bblack>	 but really, if the probes are failing requests are failing too.  it seems a little backwards to say "stop looking at failure rates, because it's failing and we want to use the failing thing anyways"
[17:26:31] <elukey>	 my theory is that it exacerbate the issue, but I could be wrong, and since the probe was originally meant to give us more info about the only backend I thought to try
[17:27:05] <bblack>	 well there's lots of things to look at here
[17:27:19] <bblack>	 but maybe the first point is: is the probe request URI actually lightweight? (it should be)
[17:27:36] <bblack>	 (and if so, it shouldn't be the cause or exacerbater)
[17:27:36] <elukey>	 re: cp4001
[17:27:37] <elukey>	 17:27:05.062795 IP cp4001.ulsfo.wmnet.8475 > bohrium.eqiad.wmnet.http: Flags [P.], seq 1:74, ack 1, win 58, options [nop,nop,TS val 672769348 ecr 3010942158], length 73: HTTP: GET /piwik.php HTTP/1.1
[17:27:54] <bblack>	 are you sure that's not a real request?
[17:28:19] <elukey>	 well I see the IPs in the bohrium apache logs, I can check for cp4001 to be sure
[17:28:50] <bblack>	 "the IPs?"
[17:29:20] <bblack>	 I get that that tcpdump output means cp4001 requested /piwik.php, but was it a probe, or was it forwarded user traffic?
[17:29:50] <bblack>	 all cache_misc globally (checked with salt) at least claim in varnishadm to know of know probe for bohrium, as expected.
[17:30:00] <bblack>	 if an old VCL was stuck still probing, we'd see it there
[17:30:54] <elukey>	 the IPs of the cp misc hosts in the apache logs as requesters
[17:31:02] <elukey>	 and I can still see them
[17:31:24] <elukey>	 they are the only ones using "http://bohrium.eqiad.wmnet/piwik.php" and not the piwik domain
[17:32:07] <bblack>	 hmmm
[17:32:13] <bblack>	 ok
[17:32:22] <bblack>	 (also, duh, cp4001 should never route requests directly anyways)
[17:32:43] <elukey>	 I think it only probes the backend, could it be possible? 
[17:32:59] <bblack>	 yeah the probe exists just because the backends are defined there pointlessly
[17:33:09] <bblack>	 but you deconfigured the probe, and varnish claims it reloaded VCL and is now running no probe...
[17:34:28] <bblack>	 so, clearly, varnish must have an issue where it keeps probing afterwards
[17:34:55] <elukey>	 bblack: would it be possible to try a backend restart in misc to see if it goes away?
[17:35:09] <bblack>	 well we can, but I still really question whether we should be removing probes here
[17:35:23] <bblack>	 it was meant to be an improvement, and an example we were going to use elsewhere too to get better health state
[17:35:49] <bblack>	 why is the probe URL causing heavy load? it should be something light just to confirm basic connectivity?
[17:35:56] <bblack>	 and the req rate can't be that high from 16 varnishes can it?
[17:36:39] <_joe_>	 bblack: it's piwik, it's a kind of magic(TM)
[17:36:44] <bblack>	 it just seems strange to me that overload from probes is what's taking a service down
[17:37:01] <bblack>	 and removing the probes somehow makes life better for real clients?
[17:37:04] <elukey>	 I think it is a bit more subtle, but I could be wrong (this is why I want to test)
[17:37:32] <elukey>	 there are no 50X on bohrium, and after inspecting failed probes via varnishlog it seems that they timeout
[17:37:50] <bblack>	 e587660c-1f1a-4720-83a7-031466db51a5.be_bohrium_eqiad_wmnet probe      Healthy (no probe)
[17:37:54] <bblack>	 err wrong paste
[17:37:58] <bblack>	 https://www.varnish-cache.org/trac/ticket/1061
[17:38:30] <bblack>	 elukey: yeah, but why are probes timing out? that's still a valid "failure" from any perspective upstream of bohrium
[17:39:53] <elukey>	 bblack: didn't find a good explanation from the logs itself (apache on bohrium and varnishlog) but I suspect that during high load the bohrium apache threads gets to the point in which they queue probes for more than necessary, causing Varnish to timeout..
[17:40:10] <elukey>	 and blocking traffic up to the point in which bohrium passes the probes
[17:40:23] <elukey>	 I saw a lot of them coming from all the hosts (16) at constant rate
[17:40:35] <bblack>	 ok you have to break that down a bit for me, since I have no idea what half of this does
[17:40:44] <bblack>	 "during high load" = during high real-client load?
[17:41:25] <bblack>	 in any case, I'm looking at the VCL staleness thing
[17:41:37] <bblack>	 it has cold VCLs from boot time listed, and 1x warm/busy and the active one with no probe
[17:41:44] <elukey>	 sure I am going to try to explain :)
[17:42:07] <elukey>	 I consider high load when apache workers are growing up too much  - https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20eqiad&h=bohrium.eqiad.wmnet&r=hour&z=default&jr=&js=&st=1489081251&v=250&m=ap_busy_workers&vl=threads&ti=Busy%20Threads&z=large
[17:42:15] <bblack>	 available  auto/busy          8 root:12c25f88-4e7b-4b3a-98ed-52a629200347
[17:42:17] <elukey>	 it is a ganeti host with 8 virtual cores
[17:42:18] <bblack>	 active     auto/warm         34 e587660c-1f1a-4720-83a7-031466db51a5
[17:43:12] <elukey>	 and the set up is apache with prefork and mod_php.. 
[17:43:31] <elukey>	 I have scheduled tasks for next quarter to refactor puppet/apache/etc..
[17:48:46] <_joe_>	 elukey: it's LOLPHP then
[17:49:22] <elukey>	 the service definitely needs some wikilove :)
[17:49:35] <_joe_>	 you'll get plenty for me
[17:49:39] <_joe_>	 *from
[17:50:17] <elukey>	 Did I miss the <sarcasm> tags by any chance? :D
[17:50:47] <bblack>	 well anyways, I think the most-probable scenario for the lingering probes is that "busy" old VCL
[17:51:10] <bblack>	 backend.list is only showing the no-probe stuff for the "current" VCL, but the previous VCL is still busy too, and it holds the old probes
[17:51:31] <bblack>	 because there's some lingering long-term client connections there that haven't closed, for some cache_misc service
[17:52:47] <bblack>	 I tried discarding the busy one, but it still shows "discarded/busy".  did the cp4001 probe stop?
[17:57:47] <elukey>	 checking
[17:58:27] <elukey>	 seems to still hit bohrium
[17:58:52] <bblack>	 cp4001 does, right?
[17:58:58] <bblack>	 in which case, discarding busy does nothing
[17:58:58] <elukey>	 yep yep
[17:59:29] <elukey>	 I can re-check tomorrow morning if things have changed or not
[17:59:36] <elukey>	 I mean, it is not really super urgent :)
[18:00:00] <bblack>	 let me see how many there are that are like that
[18:01:55] <bblack>	 yeah apparently the auto/busy thing is quite common in cache_misc, probably because we have some apps that use very persistent conns
[18:02:33] <bblack>	 cp1051 is particularly bad:
[18:02:35] <bblack>	 root@cp1051:~# varnishadm vcl.list
[18:02:35] <bblack>	 available  auto/cold          0 boot
[18:02:35] <bblack>	 available  auto/cold          0 0953b788-8f2b-4714-954b-c3496fc98ec7
[18:02:35] <bblack>	 available  auto/cold          0 root:e3f812cd-dc6b-4eb3-824f-95ef6ef53d81
[18:02:38] <bblack>	 available  auto/busy          4 root:1c270f75-90cf-4329-b1ec-8d8261a4795a
[18:02:41] <bblack>	 available  auto/busy          2 e24e881f-99a7-41ad-aa8d-cfb0c0b9b073
[18:02:44] <bblack>	 available  auto/busy          8 038e7cde-017c-46e4-ae7e-4ca719d985bc
[18:02:46] <bblack>	 available  auto/cold          0 root:90ea43e9-9a14-4098-bd63-5bf2d7e02ce1
[18:02:49] <bblack>	 available  auto/busy         30 root:5017bd1d-e2e4-4a45-9dfa-65cb1a0e4f87
[18:02:52] <bblack>	 active     auto/warm        180 befd352f-e914-442b-992c-5ebcdb5c6ed8
[18:02:55] <bblack>	 it has 4x old VCLs still busy with old conns
[18:03:07] <bblack>	 I'll try restarting cp4001
[18:04:41] <bblack>	 elukey: cp4001 probes gone now?
[18:04:45] <elukey>	 checking
[18:05:11] <elukey>	 can't see them anymore :)
[18:06:36] <bblack>	 ok
[18:06:48] <bblack>	 I kicked off a salt to slowly restart all the others with busy threads
[18:06:58] <bblack>	 I think it will take something on the order of ~10-15 mins
[18:07:34] <elukey>	 thanks a lot, it is perfect
[18:07:56] <elukey>	 so I am going to report back about the result of my experiment
[18:08:53] <elukey>	 (brb in a bit, commuting)
[18:12:18] <wikibugs_>	 10netops, 06Operations, 10ops-codfw: codfw: oresrdb2002 switch port configuration - https://phabricator.wikimedia.org/T160087#3088549 (10RobH) 05Open>03Resolved switch port updated and committed, resolving task  ```  robh@asw-c-codfw# show | compare  [edit interfaces interface-range vlan-private1-c-codfw...
[18:20:01] <bblack>	 elukey: they're all done
[18:26:31] <moritzm>	 cp1008 upgraded to 4.9.13
[18:35:51] <elukey>	 bblack: confirmed, probes gone! thanks!
[18:45:33] <bblack>	 the whole thing still seems off to me, fwiw, but that's neither here nor there
[18:45:48] <bblack>	 probes shouldn't be more traffic/load than users, or the reason users are failing
[18:59:04] <elukey>	 bblack: you are completely right, in fact this is an experiment to see if there is any variation in the piwik 503s. The backend is definitely not working fine and needs to be refactored and improved, it currently have a lot of limitations (one above all, apache preforking model with mod_php)
[19:05:56] <gehel>	 bblack, ema : my understanding is that we have a cap on TTL at 1 day for cache-maps (https://github.com/wikimedia/puppet/blob/production/modules/role/manifests/cache/maps.pp#L48) is that correct ?
[19:06:57] <gehel>	 I see tiles with wrong content on maps, which should have been fixed much earlier than 24h ago (https://maps.wikimedia.org/osm-intl/11/1860/1298.png)
[19:07:56] <gehel>	 I can't find an explanation...
[19:10:44] <bblack>	 gehel: technically, the applayer sets the TTL
[19:10:51] <bblack>	 varnish caps it, but only per-layer, not globally
[19:11:24] <gehel>	 ok, so with 3 layers of varnish, we might get 72h of cache?
[19:12:05] <bblack>	 it doesn't really work like that usually, but that's the idea for overall cap.  it varies by location and situation how many layers there are
[19:12:19] <bblack>	 for ulsfo and maps I think it would be 4 layers presently
[19:12:28] <gehel>	 and no, the app layer is not setting any caching header at this point, which needs to be fixed
[19:12:45] <bblack>	 heh
[19:13:08] <gehel>	 Ok, then that explains what I'm seeing. And the fix is obviously to expose proper caching headers...
[19:13:21] <bblack>	 gehel: can you get a dump of headers of your bad request?
[19:13:31] <gehel>	 sure
[19:13:38] <bblack>	 because everyone can fetch that url differently, but the headers of your actual bad result are telling
[19:13:48] <bblack>	 (it may be a different image for me)
[19:14:33] <gehel>	 which headers are you interested in?
[19:15:07] <gehel>	 https://phabricator.wikimedia.org/P5038
[19:15:12] <bblack>	 X-Cache and Age mostly
[19:15:27] <gehel>	 < Age: 21365
[19:15:27] <gehel>	 < X-Cache: cp1047 miss, cp3003 miss, cp3003 hit/71
[19:15:41] <bblack>	 so it was fetched from maps applayer less than half a day ago
[19:15:54] <bblack>	 sounds like ttl caps are not the issue
[19:16:24] <bblack>	 (~6h ago)
[19:16:39] <gehel>	 the part that I dont understand is that the maps issue was fixed early Tuesday morning UTC
[19:16:48] <bblack>	 are you sure? :)
[19:17:12] <bblack>	 try requesting directly from all of the karto frontends
[19:17:14] <gehel>	 no I'm not, that why I'm wondering soo much
[19:17:39] <gehel>	 I did, as far as I can see, that same tile is looking good on all the eqiad nodes
[19:17:40] <bblack>	 (varnishes requests would roundrobin them all, could be just 1/N still giving old data?)
[19:17:57] <gehel>	 of course, I don't have a time machine to go see how it looked 6 hours ago
[19:18:03] <bblack>	 right
[19:18:14] <bblack>	 but I think earlier today you had someone ban it and it still came back bad?
[19:19:13] <gehel>	 I banned the a specific zoom level (10) where there was quite a few bad tiles. They did no re-apear
[19:19:53] <gehel>	 the tile linked above is not part of that ban. As this is a fairly minor issue, I wanted to dig a bit more first..
[19:21:27] <gehel>	 the issue is still present on codfw, but I see no reason for any traffic to end up on the codfw maps servers...
[19:22:25] <bblack>	 right, the applayer route is to eqiad
[19:22:46] <bblack>	 how do I tell good and bad apart?
[19:23:05] <bblack>	 for https://maps.wikimedia.org/osm-intl/11/1860/1298.png
[19:23:52] <gehel>	 if on the left you see a road in the middle of water (blue), it is bad, if you see a road on land (grey / yellow) it is good
[19:27:12] <bblack>	 I get different results from all 3 presently (the cached tile on varnish, codfw karto, and eqiad carto)
[19:28:30] <bblack>	 so when you banned zoom level 10, that was for this same reason right? that the bad ones should've expired by now but hadn't?
[19:28:57] <gehel>	 yep, exactly
[19:29:06] <bblack>	 did you ever figure out why?
[19:29:12] <bblack>	 (and did it work?)
[19:29:36] <gehel>	 it worked (or at least I dont see any bad tiles at zoom level 10)
[19:29:57] <bblack>	 ok
[19:30:03] <bblack>	 but going back to the time of th eban in backscroll
[19:30:04] <bblack>	 13:59 < gehel> ok, there is something wrong (probably with my regex)
[19:30:04] <bblack>	 13:59 < elukey> gehel: things not banned?
[19:30:05] <bblack>	 13:59 < gehel> https://maps.wikimedia.org/osm-intl/10/929/644.png still return "Age: 74730"
[19:30:21] <bblack>	 that was because bad ban execution or whatever, which was fixed after
[19:30:28] <gehel>	 but I don't know why we still had tiles at that time. Each bad tile that I checked against the applayer was rendered correctly at the applayer
[19:30:39] <bblack>	 but still, at the time you were saying "should've been fixed >24h ago", yet the bad cache object had <1d Age header
[19:31:18] <bblack>	 Age is cache-transitive, and since maps isn't providing an initial age it's invented at the backend-most varnish
[19:31:22] <gehel>	 the issue quoted was me purging only the frontend. Which I corrected afterward
[19:31:32] <bblack>	 so it does accureately represent "this object was fetched from kartotherian X seconds ago"
[19:32:10] <bblack>	 my point is even though the ban wasn't right, the evidence at the time was that the bad object was <24h old, which seems to not jive with this all being fixed for good at karto level >1d ago
[19:32:34] <gehel>	 exactly...
[19:33:47] <gehel>	 the most probable is that things were not fixed at kartotherian level when I thing they were, but I can't find any evidence of that. Which makes me wonder if we have another transient problem at the app level that hides every time I'm looking at it
[19:34:19] <gehel>	 but I dont find direct evidence of that, except those bad tiles in varnish...
[19:34:35] <bblack>	 I'd say the age values are evidence
[19:37:04] <gehel>	 yes, but it does not correlate with any other observation...
[19:37:42] <bblack>	 :)
[19:38:01] <gehel>	 I went over all of tasmania this morning at zoom level 11 on each of the server, I did not see any bad tile...
[19:38:02] <bblack>	 it doesn't have to if varnish reqs are more common than your test reqs and its intermittent
[19:38:27] <gehel>	 yeah, all point to intermitent ... I don't like that...
[19:38:49] * gehel will try to find a way to dig more into this...
[19:38:50] <bblack>	 or, it could point to basic http issues
[19:38:58] <bblack>	 you mentioned earlier than etags were handled wrong?
[19:39:19] <gehel>	 yes, etags are not consistent between nodes
[19:39:28] <bblack>	 also:
[19:39:30] <bblack>	 < Last-Modified: Thu, 01 Jan 1970 00:00:00 GMT
[19:39:37] <bblack>	 ^ all tiles seem to return that
[19:39:39] <gehel>	 yes, also noted :(
[19:39:50] <bblack>	 that could be the entirety of the problem here
[19:40:06] <bblack>	 can you strip that at the front edge of karto, since it's completely wrong information? (the LM header)
[19:40:24] <bblack>	 or change its IMS behavior?
[19:40:48] <gehel>	 T108435
[19:40:48] <stashbot>	 T108435: Add proper expiry headers to kartotherian's responses - https://phabricator.wikimedia.org/T108435
[19:41:17] <bblack>	 curl -vH 'Host: maps.wikimedia.org' -H 'If-Modified-Since: Thu, 01 Jan 1970 00:00:00 GMT' http://maps1004.eqiad.wmnet:6533/osm-intl/11/1860/1298.png
[19:41:24] <bblack>	 < HTTP/1.1 304 Not Modified
[19:41:38] <bblack>	 so this is a pretty valid explanation of what's happening:
[19:41:39] <gehel>	 I need to understand how those node services manage headers and fix them
[19:41:54] <bblack>	 doesn't it have an apache or something in front, or is it direct-to-nodejs?
[19:42:04] <gehel>	 direct to node
[19:42:07] <bblack>	 anyways, on to explanation:
[19:42:28] <bblack>	 1) maps1004 sends "bad" tile to varnish, with Last-Modified: Thu, 01 Jan 1970 00:00:00 GMT
[19:42:48] <bblack>	 2) maps1004 data fixed
[19:43:24] <bblack>	 3) bad tile expires from varnish cache, varnish refreshes with "If-Modified-Since: Thu, 01 Jan 1970 00:00:00 GMT", but karto says "304 not modified"
[19:43:43] <bblack>	 4) Thus, varnish refreshes the age on the existing object (which gives us our low age) and keeps seving bad content
[19:44:25] <bblack>	 when things expire on TTL in varnish, the object is (usually, optimization) still in storage under a longer "keep" TTL, and then used for IMS conditions like these, which when the object didn't actually change saves the BW of re-transfering an object we already have the contents of
[19:44:27] <gehel>	 Ow, that sound very plausible! I don't know why, but I expected varnish to not do conditional get
[19:44:38] <bblack>	 varnish is all about conditional get, it's a big win
[19:44:48] <gehel>	 makes a lot of sense!
[19:45:03] <bblack>	 if your server lies about the fact that the object hasn't changed since 1970 forever, even though it's constnantly changing, that will screw things up
[19:45:28] <gehel>	 Time for dinner here... have to run.
[19:45:43] <gehel>	 Thanks a lot for all the thinking! Enlightning as always!
[19:47:25] <bblack>	 ideally LM should actually be correct (when tile was updated), and then we get correct behavior and good efficiency
[19:47:41] <bblack>	 but lacking the ability to do that, a quick hack would be to just suppress the output LM header altogether
[20:25:03] <justinl_>	 Hey all, I have what may be a dumb MW/Varnish question...
[20:25:42] <justinl_>	 I may have been doing something wrong for years now. I have four load-balanced servers for my wikis, each running Varnish on 80 with Apache as its backend on 8080.
[20:26:31] <justinl_>	 In my LocalSettings.php for my wikis, however, I have the IP addresses of all four web servers so that purges from one would go to all of the servers. Is that right or should each server only send purges to its own Varnish?
[20:29:23] <bblack>	 justinl_: assuming it's the same content on all four servers (as opposed to, say, a sharded setup where each of the 4 services separate wikis), then the purges should be going to all, yes.
[20:31:01] <bblack>	 the idea being that whlie the edit of /wiki/Foo may have happened on server1, over the past [cache TTL] /wiki/Foo may have been read by different users on all 4x of your varnish frontends, and therefore needs purging from them all to update everywhere.
[20:33:55] <justinl_>	 bblack: Ok, thanks. It is the same content, shared amongst the servers via NFS. This thought just struck me last night as I'm considering architectural requirements for migrating my environment to AWS.
[20:34:54] <justinl_>	 If I move to a couple of dedicated Varnish servers in front of possibly load-balanced EC2 web servers, for sure I'd want all of the Varnish servers in the EC2 serers' $wgSquidServers lists.
[20:35:14] <bblack>	 justinl_: yes! :)
[20:35:47] <justinl_>	 That's another catch, though, is that if I were to have ELB between Varnish and web servers, the web servers would still need to know about the Varnish IPs for the purges.
[20:40:12] <bblack>	 yeah, possibly
[20:40:29] <bblack>	 you could set up something to advertise them to a registry or something like that, though
[20:41:30] <justinl_>	 That's the catch with trying to figure out autoscaling of Apache and maybe Varnish for a MediaWiki environment is automated management of the IPs. I use SaltStack for configuration management, so I'd probably have to set up some reactors to update LocalSettings.php when hosts are dynamically added or removed.
[20:41:30] <bblack>	 AWS probably even has a KV-store you could use for such things as a service
[20:42:08] <justinl_>	 Maybe Elasticache? I'll be using it to replace Memcached, unless Aurora is fast enough to obviate the need for such a cache.
[20:42:19] <bblack>	 yeah that's one possibility
[20:42:27] <bblack>	 or something etcd-like
[20:42:38] <justinl_>	 Salt has Salt Mine though, so that could work if there's not an ideal AWS service.
[20:47:41] <SMalyshev>	 bblack: hi! do you have a minute? wanted to talk about https://phabricator.wikimedia.org/T159574
[22:17:41] <wikibugs_>	 10Traffic, 06Operations, 06Performance-Team, 13Patch-For-Review: Segment Navigation Timing data by continent - https://phabricator.wikimedia.org/T128709#3089607 (10Krinkle)
[22:37:40] <bblack>	 SMalyshev: the short answer is we can't do client affinity for you in varnish
[22:49:46] <SMalyshev>	 bblack: what about IP affinity in LVS? also won't work?
[22:50:59] <SMalyshev>	 or in fact any sort of affinity that ensures requests from one host within short time go to one server?
[22:56:34] <bblack>	 SMalyshev: LVS can do affinity for the direct clients of LVS, but wdqs's clients are varnish caches.  varnish can do affinity for clients->backends when it has direct knowledge of the backend cluster nodes instead of through an abstraction like LVS, but we currently use that abstraction.
[22:57:43] <bblack>	 we made a decision a while back that at least for the present, we're not supporting client affinity for applayer services.  it simplifies a bunch of things architecturally and virtually no service actually needed it.  It gives us the LVS abstraction beneath varnish as a constant.
[22:58:56] <bblack>	 I really do suspect that will be re-visited at some later date, once we're past some other future architectural changes (and varnish is largely gone and internal connections are almost always TLS), but those are far-off plans
[23:00:26] <bblack>	 (at the time we made that call, I think rcstream and maybe... was it logstash? were the only ones making use of client affinity - 2x misc services, and none of our major services.  We moved those to using a single node for traffic or other such solution)
[23:04:01] <SMalyshev>	 bblack: so basically recommended solution for now is single-node?
[23:06:47] <bblack>	 if you can't fix the fundamentals some other way, yeah, use a single node for traffic (with the option to keep others hot and ready for manual failover)
[23:07:53] <bblack>	 usually with this issue, the fundamental problem is per-user state, in which case the better answer is keep using multiple nodes and track necessary per-user state in the applayer itself instead of hoping the balancers do it for you.
[23:08:19] <bblack>	 but in your case, I think this is more about consistent ordering for multiple queries that iterate a dataset via index
[23:11:34] <bblack>	 the other option is you build your own client affinity proxy layer on top of wdqs, inside the same nodes
[23:12:53] <bblack>	 conceptually the idea would be that all your nodes have apache or nginx at the front (as what varnish->LVS routes requests into randomly), and those apache/nginx instances hash on the received X-Client-IP header to choose a consistent applayer backend (which might be the same the request arrived at apache/nginx on, or one of the others)
[23:13:27] <bblack>	 it's what we would do in LVS, if LVS had visibility into the HTTP headers to make that decision
[23:14:53] <bblack>	 but still, you're solving the "inconsistent index" problem there by making it consistent for a given client IP, which is not the same as solving it generically
[23:15:19] <bblack>	 when two separate clients query a given index position in the data, they'll still see different results (with any client affinity solution here), which might eventually matter to some use-case