[08:49:49] <wikibugs>	 10Traffic, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 4 others: Opera mini IP addresses reassigned - https://phabricator.wikimedia.org/T187014#4115944 (10ema) >>! In T187014#4113030, @Nuria wrote: > @ema: on our end we just look at the ip passed along via varnishkafka  to geolocate, not...
[08:50:55] <vgutierrez>	 morning ema :D
[08:51:26] <ema>	 o/
[08:54:17] <vgutierrez>	 like bblack suggested, I'm looking at removing interface references from site.pp on lvs* node definitions, the "natural" place should be hieradata/common/lvs/interfaces.yaml
[08:55:22] <vgutierrez>	 should I massage the current lvs::interfaces::vlan_data or just define a new hash?
[08:59:10] <ema>	 vgutierrez: if you manage to use the current vlan_data without causing a pandemic, I'd suggest doing so 
[09:00:07] <vgutierrez>	 let's try then :D
[12:01:12] <ema>	 so, {bereq,beresp}.uncacheable and builtin vcl:
[12:01:52] <ema>	 (1) bereq.uncacheable is set by the varnish client side for requests which are either a pass or a hit on a hit-for-pass object
[12:02:47] <ema>	 (2) beresp.uncacheable is the equivalent of (1) on the varnish backend side. Setting it via vcl may result in the creation of a hit-for-miss object
[12:03:37] <ema>	 s/equivalent of/inherited from/
[12:04:23] <ema>	 the first thing vcl_backend_response does in the builtin vcl is checking if (1) is set. If so, it returns deliver immediately
[12:05:36] <ema>	 otherwise it checks if the response is uncacheable by looking at {Cache,Surrogate}-control. If the object is uncacheable, a 120s hit-for-miss is created
[12:06:36] <ema>	 no hit-for-pass is ever created by the builtin vcl
[12:08:43] <ema>	 also, from https://varnish-cache.org/docs/5.1/users-guide/increasing-your-hitrate.html#passing-client-requests:
[12:08:56] <ema>	 > When a request is passed, this can be recognized in the vcl_backend_* subroutines by the fact that bereq.uncacheable and beresp.uncachable are both true.
[12:10:11] <ema>	 I thought I understood the {bereq,beresp}.uncacheable thing till I found this statement ^
[12:10:57] <ema>	 vcl(7) says that beresp.uncacheable is "inherited from" bereq.uncacheable
[12:13:34] <ema>	 yet they can differ? In which cases? Only if a hit-for-miss is created in vcl_backend_response while on the varnish client side the request was neither a pass nor a hfp hit? 
[12:19:09] <ema>	 a hfm object is created by setting a ttl, setting beresp.uncacheable to true and returning deliver. If varnish does set beresp.uncacheable on its own in certain cases, does that mean that we might unwillingly end up creating hit-for-miss objects when setting a ttl and returning deliver?
[12:26:18] <ema>	 looking at bin/varnishd/cache/cache_fetch.c, bo->uncacheable is set if:
[12:26:33] <ema>	 - there's a Vary parse error
[12:26:48] <ema>	 - bo->do_pass is set
[12:27:00] <ema>	 - wrk->handling == VCL_RET_PASS
[12:28:17] <bblack>	 it seems to make basic sense to me, above
[12:29:07] <bblack>	 bereq.uncacheable means a decision to pass was made at request-time (vcl_recv returns pass, or hitting an hfp-type object), and would also imply bereq.uncacheable when the corresponding backend fetch is made.
[12:30:40] <bblack>	 however, even when bereq.uncacheable is false (a true hit on a real object, or a true miss that thinks the response could potentially be cacheable), it's possible to set beresp.uncacheable to true: internally for cases like Vary parse error, or in VCL because of observation of no-cache response headers or other such conditions (e.g. 5xx).
[12:31:39] <bblack>	 one of the critical areas for tuning varnish performance is coalesce behavior.  making sure you coalesce when that would help, and that you don't when that would hurt.
[12:32:08] <bblack>	 anytime bereq.uncacheable and the final state of beresp.uncacheable differ, it implies we're in some grey area about coalesce behavior that we should try to at least heuristically deal with as ideally as we can.
[12:32:22] <bblack>	 (e.g. with an hfp or hfm)
[12:32:43] <bblack>	 (or not, as the our best guesses based on response data indicate!)
[12:35:58] <bblack>	 one of the things that really bugs me from analysis last week, still, is that there's no parametric limitation on object lifetimes when they're hit/used.
[12:36:37] <bblack>	 by that I mean, when an object is found in storage, no check is done for whether it's over-stale.  It's assumed that the expiry thread performs perfectly and the object wouldn't exist if it wasn't useful, which seems faulty :P
[12:38:04] <bblack>	 (and then on top of that, it's up to VCL to distinguish grace-vs-keep.  They might as well be a single "grace" field, and your VCL using internal setting of special marker-headers to denote the grace/keep distinction, as the core doesn't care)
[12:40:18] <ema>	 why would that be a problem? The expiry thread never lags behind! :)
[12:40:25] <bblack>	 right :)
[12:41:02] <bblack>	 and then, apparently the critical trigger condition for 1799 is vcl_hit ever returning "miss", which is exactly the only way to distinguish grace from keep usefully.
[12:41:27] <ema>	 to avoid a background fetch, yes
[12:41:48] <bblack>	 when you combine the behvaiors of Varnish C code and the builtin VCL (or any sane VCL with perfect knowledge of how this is supposed to work)
[12:42:04] <bblack>	 a vcl_hit on an object within TTL is a normal hit (and should normally just return deliver)
[12:42:30] <bblack>	 a vcl_hit on an object past its TTL (and thus within grace or keep or expiry thread lag beyond that) which returns deliver is grace-like behavior (return object and bgfetch replacement)
[12:43:06] <ema>	 I think this is only partially true ^
[12:43:07] <bblack>	 a vcl_hit on an object past its TTL (and thus within grace or keep or expiry thread lag beyond that) which returns miss is keep-like behavior (stalling fetch, but can use the old object as a conditional-fetch source to avoid body transfer as an optimization)
[12:43:59] <ema>	 what I've seen with vtc is that only the first hit on a object w/ keep, w/o grace and w/ return(deliver) behaves like grace 
[12:44:42] <bblack>	 well, that makes sense, as the bgfetch happens
[12:45:09] <bblack>	 a further request during bgfetch would coalesce on the bgfetch part, or would use the new cache object created by the bgfetch if it's later
[12:47:21] <bblack>	 the critical-est question before us in the moment, I think, is what to do about the present situation, where the patch that wiped out our vcl_hit logic also solved our major outstanding issues.
[12:47:28] <ema>	 yes but a further request during bgfetch would result in the object being returned, hence grace!=keep/lagging expiry in this case
[12:47:45] <ema>	 a further request during bgfetch on a object w/ grace, that is
[12:47:50] <bblack>	 if we revert to making a vcl_hit->miss distinction, we regain our problems.  If we stay as we are, we're clearly using a highly-illegal and eventually probably very problematic 7d grace period.
[12:48:39] <bblack>	 if we kill grace as the answer, it may be the undoing of expiry optimizations we've been implicitly relying on for a while
[12:49:46] <bblack>	 (that our large keep values commonly put much object expiry purging beyond the lifetime of the daemon's weekly restarts)
[12:50:45] <ema>	 I think first of all we should go back to weekly restarts rather than every 3.5 days
[12:51:06] <bblack>	 but you could also theorize those are mis-optimizations.  perhaps everything is better without keep times at all (and thus having the expiry thread doing more work based on pushing out stale-TTL objects, freeing more room faster and avoiding in-the-moment nukes for space for new requests, which are the ones that use the mailbox)
[12:51:10] <ema>	 just to confirm that indeed the situation has improved because of the vcl_hit logic thing instead of more frequent restarts
[12:52:06] <bblack>	 (we'd still find keep times desirable, as they can reduce transfer bursts over our transit links when bringing users back to a stale cache, but that optimization seems less important than other problems here, and eventually gets solved by geodns ramp-in knobs)
[12:52:17] <bblack>	 (s/transit/transport/ above)
[12:53:11] <bblack>	 +1 we should start with reverting the 3.5-day restarts
[12:53:23] <gehel>	 ema / bblack: Hello! I'm trying to configure a new endpoint for our new wdqs-internal cluster (https://gerrit.wikimedia.org/r/#/c/424587/ & https://gerrit.wikimedia.org/r/#/c/424599/ ).
[12:53:27] <bblack>	 let's discuss your vtc results about grace behavior a bit more though, clearly I don't understand something
[12:54:04] <gehel>	 It looks like the way we manage LVS has changed since the doc was written (https://wikitech.wikimedia.org/wiki/LVS#Add_a_new_load_balanced_service) I did my best to guess, but a careful review would be welcomed!
[12:54:16] * gehel promise to update the doc...
[12:54:29] <bblack>	 gehel: my first thought at a glance is lose the "-internal" part of the hostname.  Everything in .svc.*.wmnet is "internal" :)
[12:55:13] <bblack>	 and then my next question after spending 10 more seconds orienting myself, is why you need a separate one from the wdqs service you already have?
[12:55:24] <gehel>	 we already have a "wdqs" service which is external
[12:55:49] <bblack>	 but it's not really external.  it's an internal service which is consumed by the misc cluster to give it an externally-reachable point.
[12:56:16] <gehel>	 because the external traffic is (obviously) uncontrolled and can lead to widely varying load and response times of the service
[12:56:34] <gehel>	 so we want to isolate internal (safe) traffic from external (experimental) traffic
[12:56:35] <bblack>	 the normal pattern for other services like this is we have a foo.svc.eqiad.wmnet, and all internal clients of the foo service contact that directly
[12:56:49] <bblack>	 and we set up a public endpoint via the traffic clusters which also backends to that foo.svc.eqiad.wmnet
[12:57:53] <gehel>	 in our case, we want more isolation than that. Some of the rational is on T178492
[12:57:54] <stashbot>	 T178492: Create a more controlled WDQS cluster - https://phabricator.wikimedia.org/T178492
[12:58:21] <bblack>	 yeah it's an interesting case...
[12:58:57] <bblack>	 but to be an annoying perfectionist, that ticket starts with the premise and initial statement "WDQS is by design a fragile service"
[12:59:03] <bblack>	 maybe that's the problem to fix here :)
[12:59:39] <gehel>	 basically, WDQS is similar to exposing the full power of an SQL endpoint to the internet. It does have value, but making it robust enough while keeping its power is probably near impossible
[13:00:44] <bblack>	 right
[13:00:53] <bblack>	 arguably, it shouldn't ever really be exposed broadly to the public
[13:01:28] <bblack>	 (it should be, like a SQL database server, an internal service against which more constrained/resilient/sanity-enforcing public-facing services are written atop of)
[13:01:29] <gehel>	 that decision was made a long time ago and it has proven to be useful enough that killing that external service is probably not an option
[13:02:12] <gehel>	 and in the end, having a service with varying response time is fine as long as it is part of the expectations of the users.
[13:02:21] <bblack>	 and I think that's probably what you're aiming towards.  the wdqs-internal service is the backer of these other more-resilient things that ride atop it internally.
[13:02:30] <gehel>	 yep, exactly
[13:03:18] <bblack>	 but it still begs the question why you'd (even as readonly or only-SELECT in sql-metaphor terms) want the public "experimental" interface exposed without control
[13:04:12] <bblack>	 (e.g. logins limited to known researchers or something, as opposed to just wide open for abuse by someone spamming awful degrading queries into it)
[13:04:37] <gehel>	 good question!
[13:05:17] <bblack>	 I guess the answer to that question doesn't solve anything, though.
[13:05:30] <bblack>	 either way, your known researchers will sometimes kill performance accidentally
[13:05:46] <gehel>	 yep, even with an authenticated public endpoint, we'll want a second internal endpoint
[13:05:51] <gehel>	 s/public/external/
[13:06:04] <bblack>	 I think I'm just biased against the naming here, which is pure bikeshedding :)
[13:06:29] <gehel>	 yeah, I'm not entirely convinced by the name, but no one proposed something better yet :)
[13:06:40] <gehel>	 and at least it makes its purpose quite clear
[13:06:52] <bblack>	 better would probably be to flip the defaults, but that would be annoying to go through for bikeshedding purposes at this point
[13:07:06] <bblack>	 (s/wdqs/wdqs-public/ + s/wdqs-internal/wdqs/)
[13:07:27] <gehel>	 or even wdqs-public and wdqs-internal, be explicit on both...
[13:07:32] <bblack>	 sure
[13:07:57] <bblack>	 at the end of the day, it's confusing and odd to me to have two internal service endpoints, one called "foo" and one called "foo-internal" :)
[13:07:59] * gehel prefers explicit :)
[13:08:20] <bblack>	 it makes my brain go, "well, foo was already internal, what's going on here?"
[13:08:29] <gehel>	 agreed...
[13:09:08] <bblack>	 ok now that I'm done wasting time going in circles, I should really look at the patches heh
[13:09:10] <gehel>	 we could find some variation (wdqs-controlled / -stable / -...) but to clear the ambiguity, we probably want to rename wdqs to wdqs-public
[13:09:26] * gehel adds that to his todo-list
[13:11:28] <bblack>	 gehel: so your pair of patches do seem to define the LVS side of things
[13:11:50] <bblack>	 but I think what (might?) be missing unless I'm failing to see some automagic, is the host-side stuff for the loopback IP
[13:12:07] <ema>	 s/semiweekly/weekly/ -> https://gerrit.wikimedia.org/r/#/c/425046/
[13:13:40] <gehel>	  bblack: that might well be missing...
[13:14:05] <bblack>	 yeah +1 that and abandon https://gerrit.wikimedia.org/r/#/c/421943/ or vice-versa (whatever, but I lost my gerrit cookie and it will be a few before I get my yubikey and get it back)
[13:14:37] <vgutierrez>	 ema: including interface_tweaks settings on vlan_data hash was a nightmare, I went for this instead: https://gerrit.wikimedia.org/r/#/c/425040/
[13:17:01] <gehel>	 bblack: Oh, I see, I missed the include of "::role::lvs::realserver". And if I add that, IP is automagically resolved to what is in 	hieradata/common/lvs/configuration.yaml, right?
[13:18:15] <gehel>	 I'm also entirely unsure what is needed for DNS based service discovery...
[13:18:36] <wikibugs>	 10Traffic, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 4 others: Opera mini IP addresses reassigned - https://phabricator.wikimedia.org/T187014#4116636 (10Ottomata) > No, X-Client-IP is either:  ...ehhh wha?  We used to collect XFF on the webrequest side, and then parse it to get `ip`....
[13:22:48] <bblack>	 gehel: on the puppet side, yes, I think modules/role/manifests/wdqs_internal.pp is missing "include ::role::lvs::realserver" like wdqs.pp has, to arrive at a sane final state.
[13:24:01] <bblack>	 gehel: there's probably some quibbles about arriving at that sane state in steps where nothing is failing in the interim, but for a new service I think things can be faily for a few while things get puppetized in various places
[13:24:26] <gehel>	 Patch updated
[13:24:52] <bblack>	 (i.e. you could arrive there in steps by first just defining the IPs but not using them for the LVS-side config, and then doing the role::lvs::realserver to bring up the IPs on the new hosts, and then bringing up the LVS part that tries to hit them, as separate steps)
[13:25:02] <bblack>	 but like I said, probably unecessary in practice
[13:25:33] <gehel>	 Ok, I'll try to add that to the doc for next time
[13:25:45] <bblack>	 DNS service discovery is ???, I don't recall all the details off-hand, or the distinctions for where we do and don't use that
[13:26:46] <gehel>	 I was thinking about https://wikitech.wikimedia.org/wiki/DNS/Discovery
[13:26:49] <bblack>	 is there even a wdqs_internal service in eqiad (yet? soon?)
[13:27:07] <gehel>	 there is already eqiad and codfw
[13:27:17] <gehel>	 both for the internal and external part actually
[13:27:39] <bblack>	 I only see codfw definitions in the patch
[13:27:56] <bblack>	 oh no, I just missed something mentally, it's there
[13:28:05] <bblack>	 ok
[13:28:13] <gehel>	 :)
[13:28:16] <bblack>	 yeah the wikitech dock on dns discovery is lacking
[13:28:19] <bblack>	 s/dock/doc/
[13:28:54] <gehel>	 I can try to guess, based on another service, but I don't even know which one to take as an example :/
[13:28:56] <bblack>	 a while back when all of that mechanism was put in place, there were some outstanding design discussions that I guess were never resolved, and the current state of affairs just isn't well-documented
[13:29:17] <bblack>	 (re where and how a new service is patched in functionally without breaking things, in various puppet- or dns- side commits)
[13:30:07] <bblack>	 I think you have to patch the ops/dns repo's config-geo-test file separately first for non-breakage
[13:30:26] <bblack>	 after that, I'm fuzzy, but I think then there's some bits to do on the puppet side...
[13:30:50] <bblack>	 maybe at this point, https://gerrit.wikimedia.org/r/#/c/424599/1/conftool-data/discovery/services.yaml covers all of that latter part
[13:30:54] <gehel>	 I found https://gerrit.wikimedia.org/r/#/c/424599/1/conftool-data/discovery/services.yaml which seems to be related to discovery
[13:31:51] <bblack>	 if you don't patch the new name into config-geo-test first, when you puppetize the puppet side of DNS discovery the deploy to the DNS servers will cause DNS sanity-checks to start failing artificially for no good reason
[13:33:27] <gehel>	 ok, I'll add that...
[13:36:20] <gehel>	 any idea how disc-wdqs-internal (the entry in config-geo-test) is mapped to the correct service? Just by naming convention?
[13:43:59] <bblack>	 it's not, the whole point of that file is just to mock fake data to make a CI/deployment check not fail
[13:45:24] <gehel>	 That makes more sense! I was all confused by "mock" in that file.
[13:46:27] <gehel>	 So the actually entry is not in the dns project at all, but purely generated from puppet?
[13:46:34] * gehel needs to read some more puppet code
[13:47:51] <bblack>	 right, more-or-less
[13:48:33] <bblack>	 puppetization on the dns servers creates some dns config data related to DNS-disc, which is mostly-independent of the usual ops/dns + authdns-update route of changing DNS things
[13:48:47] <bblack>	 but the mock test data is where the collision of all such things is problematic
[13:50:26] <gehel>	 I'll just your word on it :)
[13:51:09] <bblack>	 the usual gerrit CI checks for ops/dns commits, and also the preflight checks that happens on the DNS hosts themselves during authdns-update I think, cannot see the data generated by the dns-discovery puppetization.
[13:51:44] <bblack>	 so the config-geo-test mock data makes that work, otherwise they would see zonefile references to the new service, but no gedns definition (which is the part coming from puppet), and fail.
[13:53:20] <bblack>	 gehel: for using discovery dns, you also need a fixup to your DNS side patch in https://gerrit.wikimedia.org/r/#/c/424587/1/templates/wmnet
[13:53:33] <bblack>	 to add the entry at the bottom of that file, the one for the original wdqs looks like:
[13:53:38] <bblack>	 wdqs           300/10 IN DYNA geoip!disc-wdqs
[13:53:42] <bblack>	 (in origin discovery)
[13:54:15] <gehel>	 right, that was the magic I did not understand!
[13:54:37] <bblack>	 and that's the part that, when committed to the DNS repo (or even attempted through gerrit CI), will fail until the mock data is present
[13:54:49] <gehel>	 so I should probably merge the change to config-geo-test in the same patch so that things don't break, right?
[13:55:04] <bblack>	 yeah, I think that works
[13:55:43] <bblack>	 none of how it works makes a ton of logical sense to me, and thus I have a hard time remembering procedures about it :)
[13:56:09] <gehel>	 at least that makes me feel less bad about doing it all wrong on my own :)
[13:56:34] <bblack>	 there's a lot of layers of abstraction involved and they're not very clean! :)
[13:58:33] <bblack>	 so yeah, if you want to make something on wikitech that makes this all make sense and makes some future discovery+lvs service deployment simpler for someone
[13:58:36] <ema>	 bblack: does netmapper.map("proxies", ...) read /var/netmapper/proxies.json or am I going mad?
[13:58:37] <bblack>	 feel free, it would help :)
[13:58:45] <bblack>	 ema: it does
[13:59:12] <ema>	 bblack: mmh, that file contains {"Test1": [], "Test-all": []} on all cp hosts...
[13:59:16] <gehel>	 bblack: I'm not sure I understand enough to actually document it, but I'll try...
[13:59:45] <bblack>	 ema: yeah, that's why I said " is defined by Zero's proxy data, which at least in the past included some definition of OperaMini", because I'm pretty sure I saw the same recently
[14:00:04] <bblack>	 and that may in the end be the change they're looking for here: when did Zero stop giving Varnish a list of OperaMini proxy IPs? :)
[14:00:13] <ema>	 ah!
[14:00:53] <bblack>	 I don't know how well we can even track the history of that
[14:01:17] <bblack>	 all the data from zero (proxies.json + carriers.json) is something we grab from the internal Zerowiki, which partnerships manually manages the data in
[14:01:30] <bblack>	 I don't know if whatever scripts/tools they use for that even record history in the usual wiki sense of changes
[14:02:14] <bblack>	 but either way, there's some mixing of understandings and purposes I think
[14:03:06] <bblack>	 as far as Zero's concerned, the only reason they'd even create an OperaMini entry in proxies.json data is to serve their own purposes.  e.g. if they ran out of cases where they have active contracts with carriers that need OperaMini proxy data for us to parse Zero banner-setting correctly, they'd have no reason to keep including that data
[14:03:15] <bblack>	 whereas analytics may have been independently relying on it
[14:03:38] <bblack>	 (indirectly, in its effects on our XFF parsing)
[14:04:24] <wikibugs>	 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 4 others: Thumbor incorrectly normalizes .jpe and .jpeg into .jpg for Swift thumbnail storage - https://phabricator.wikimedia.org/T191028#4116773 (10Gilles) 05Open>03Resolved The issue should fix itself (and it has, for the file menti...
[14:05:44] <bblack>	 in the broader and longer-term sense, we've had various discussions at various levels before (there's even a ticket somewhere!) about the need to have a proxies.json that netmapper sees which is managed outside of Zero, publically, more like community-edited/managed and trying to keep up to date with all known trustable 3rd-party proxies.
[14:05:54] <bblack>	 e.g. put it on metawiki
[14:06:33] <bblack>	 something like a better replacement for the community-managed XFF-parsing stuff that already exists at some level (for wiki admins dealing with abuse cases, etc)
[14:06:41] <bblack>	 (but is also poorly-managed)
[14:06:52] <ema>	 does that have the ultimate goal of knowing the IP of the actual client?
[14:07:20] <bblack>	 https://phabricator.wikimedia.org/T89838
[14:07:40] <bblack>	 it has the ultimate goal of trusting proxies that we deem trustworthy
[14:08:08] <bblack>	 we can't ever be sure we know the IP of the actual client.  There will be proxies that choose to be intentionally-silent about client IPs behind themselves.
[14:08:31] <bblack>	 There will also be non-proxies that choose to send us fake/invalid/misleading XFF headers that we should ignore-by-default
[14:09:25] <bblack>	 but for cases where the proxy is a known entity deemed trustworthy in its XFF data (e.g. OperaMini), we'll accept their XFF data (at least, the next entry beyond the proxy itself, but not other junk the client may have injected before the proxy that the proxy failed to clean out)
[14:09:39] <ema>	 fair enough
[14:10:19] <bblack>	 the varnish code is all already in an ideal state on all such matters.  the problem is the source/management of the proxies.json data that drives it.
[14:10:36] <bblack>	 (well, reasonably-ideal anyways!)
[14:12:26] <bblack>	 tripping down a few links beyond the above-linked ancient ticket, we arrive at:
[14:12:29] <bblack>	 https://phabricator.wikimedia.org/T120121
[14:12:49] <bblack>	 which talks about the parts of the varnish implementation that are not-yet-perfectly-ideal (but still, proxies.json data is the bigger problem)
[14:15:50] <bblack>	 it'd be nice if our typical lag-time on resolving tickets like these was < 3 years heh
[14:16:07] <bblack>	 (but unrealstic given resourcing!)
[14:18:37] <ema>	 <3 years (sorry, I had to)
[14:19:11] <bblack>	 aside from operamini (and apparently zero used to give us a similar nokia proxy), really we probably receive a lot of trustworthy XFF data from bigger proxies like google and fb that it would be nice to see through too
[14:19:24] <bblack>	 (and set X-Trusted-Proxy accordingly as well)
[14:19:52] <bblack>	 I'm sure there's others, and at this point in history it's probably a much more manageable dataset than it used to be.
[14:20:10] <bblack>	 it used to be that the world of somewhat-legitimate proxies was vast with a long tail, pre-HTTPS
[14:20:33] <bblack>	 in the post-HTTPS world, I think there are probably fewer major cases we care about that hide lots of clients behind a proxy service.
[14:25:31] <bblack>	 e.g. I can see some live reqs right now, when I dig, that look like legitimate pageview types of traffic, but are characterized by:
[14:26:07] <bblack>	 -   ReqHeader      X-Client-IP: 64.233.172.135    <which is a Google IP>
[14:26:14] <bblack>	 -   ReqHeader      User-Agent: Mozilla/5.0 (Linux; Android 4.2.1; en-us; Nexus 5 Build/JOP40D) AppleWebKit/535.19 (KHTML, like Gecko; googleweblight) Chrome/38.0.1025.166 Mobile Safari/535.19
[14:26:30] <bblack>	 <googlewebglight probably meaning some kind of proxy for mobile performance like AMP or similar)
[14:26:59] <bblack>	 and the initial XFF from nginx->varnish is:
[14:27:16] <bblack>	 -   ReqHeader      X-Forwarded-For: <some legit client IP>, 64.233.172.135
[14:27:44] <bblack>	 we could probably research/acquire a list of Google's proxy IPs for such traffic...
[14:28:52] <bblack>	 anyways, like I said, if we stick to the statistically-significant cases, there's probably only a few major ones to care about these days for HTTPS traffic.
[14:29:10] <bblack>	 (for better or worse, poor users not understanding trust/privacy and all)
[14:34:30] <bblack>	 I get what these proxy companies are doing that's positive (optimizing pages for speedy loading on slow devices+links is the gist of it), but they're marketed as pure benefits with no downsides.  They don't exactly go out of their way to say "Hey but if you use this optimized service, our company gets to play man-in-the-middle for all your traffic, which amounts to you putting a ton of trust in o
[14:34:36] <bblack>	 ur hands, maybe more than you do your bank"
[14:36:58] <bblack>	 ("... and perhaps moreso than your bank, we're basically unregulated, like to move fast and break things, have a history of carelessness with user data, and also make money from selling correlated user data...")
[14:44:26] <ema>	 bblack: I've tried to write a vtc test to show some interesting grace/keep behavior https://phabricator.wikimedia.org/P6970
[14:44:36] <ema>	 this one here uses 0 grace and a 5s keep
[14:45:11] <ema>	 note how the request sent by c4 actually receives the second origin server response
[14:45:44] <bblack>	 I have a feeling this is a deep rabbithole.  it's an essential one I need to stare at, but it'll have to wait for sometime later today :)
[14:46:48] <bblack>	 is the gist that there are functional differences if vcl_hit just returns deliver, and you swap grace=0/keep=5 for grace=5/keep=0, or something more subtle than that?
[14:47:18] <bblack>	 (without some VTC evidence, I still tend to believe that grace/keep are identical other than logic in vcl_hit)
[14:47:56] <ema>	 the gist is that swapping grace and keep in that vtc leads to important functional changes
[14:48:13] <bblack>	 hmmmm
[14:48:16] <ema>	 w/ grace=0/keep=5 we send two requests to the origin server
[14:48:32] <ema>	 w/ grace=5/keep=0 we only send one
[14:49:11] <ema>	 also, the vtc test above results in two different responses, while varnish counters still say there's been only one cache miss
[14:49:50] <bblack>	 also, I would've thought your server s1 responses were funny, but maybe I'm misremembering standards
[14:50:14] <bblack>	 in the absence of other indicative headers, shouldn't max-age:2 + LM:<off in the distant past> imply a negative TTL from the outset?
[14:50:25] <bblack>	 or does LM never imply the start-time, even in the absence of other headers?
[14:51:01] <bblack>	 I guess it should be relying on either Age or Expires normally
[14:51:17] <ema>	 I thought LM didn't imply start time
[14:51:18] <bblack>	 I just kinda figured (now-LM) filled in for Age if neither of those were present
[14:51:33] <bblack>	 maybe that's a bad assumption on my part
[14:51:37] <ema>	 it can be omitted anyways, I've added it to check if keep would result in a IMS bgfetch (it doesn't)
[14:51:43] <bblack>	 ok
[14:52:08] <wikibugs>	 10netops, 10Operations, 10ops-eqiad: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4117109 (10Marostegui) I have double checked with @ayounsi the hosts that need to have a longer downtime as they have to be moved physically to another rack, and those are the ones...
[14:52:22] <ema>	 which is yet another part of the puzzle: I thought keep was there to issue conditional requests to the origin...
[14:52:29] <bblack>	 the grace/keep-swapping differential in behavior seems at odd with source analysis that only the expiry thread pays attention to grace/keep values at all, by just summing them.
[14:52:42] <bblack>	 well
[14:53:08] <bblack>	 maybe something I'm failing to take into account mentally (you too?) is that there is an expiry thread running for the varnishd of the VTC too right? and it probably operates efficiently.
[14:54:03] <bblack>	 still shouldn't make a differential in swapping grace and keep though, I think
[14:54:11] <ema>	 but the test lasts < 5s (keep value)
[14:54:16] <bblack>	 but it would be nice to be able to halt the expiry thread for a special VTC test, too :)
[14:54:56] <bblack>	 meeting soon!
[14:56:37] <ema>	 bblack: not the SRE meeting right? That's in one hour on my calendar
[14:58:48] <vgutierrez>	 ~18.00
[14:59:37] <bblack>	 nice, timechange?
[14:59:41] <vgutierrez>	 yup
[15:00:02] <bblack>	 oh, it's correct on my calendar as well too
[15:00:20] <bblack>	 I have another entry before it, and I wasn't expecting that, so I assume it was our SRE meeting :)
[15:01:00] <bblack>	 (it's the CTO Office Hour in -staff)
[15:02:25] <bblack>	 I kind of like the office-hour thing, it's an interesting tool for having a focal point in time and space for $random questions
[15:02:46] <bblack>	 (from parties that might not be part of your normal heavy activity flows)
[15:03:09] <bblack>	 maybe there should be a #traffic office hour once a week heh
[15:17:14] <wikibugs>	 10Traffic, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 4 others: Opera mini IP addresses reassigned - https://phabricator.wikimedia.org/T187014#4117194 (10mbaluta) >>! In T187014#4111935, @ema wrote: >>>! In T187014#4111691, @mbaluta wrote: >> If you provided IP address of our server, we...
[15:33:04] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: Renew unified certificates 2017 - https://phabricator.wikimedia.org/T178173#4117269 (10RobH)
[16:53:57] <wikibugs>	 10Traffic, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 4 others: Opera mini IP addresses reassigned - https://phabricator.wikimedia.org/T187014#4117606 (10ema) >>! In T187014#4116636, @Ottomata wrote: > ...ehhh wha?  We used to collect XFF on the webrequest side, and then parse it to get...
[17:25:42] <wikibugs>	 10Traffic, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 4 others: Opera mini IP addresses reassigned - https://phabricator.wikimedia.org/T187014#4117734 (10BBlack) Right.  There was a time in the past when Zerowiki definitely provided some useful data on OperaMini (and also Nokia?) proxy...
[17:26:47] <wikibugs>	 10Traffic, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 4 others: Opera mini IP addresses reassigned - https://phabricator.wikimedia.org/T187014#4117741 (10BBlack) Ping @DFoy - might know better about when OperaMini proxy data dropped from the Zero data, I don't have any good insight into...
[18:26:01] <mutante>	 fwiw, i see this file changed on each puppet run on bast5001
[18:26:02] <mutante>	 Notice: /Stage[main]/Role::Prometheus::Ops/File[/srv/prometheus/ops/targets/node_site_eqsin.yaml]/content: 
[18:26:33] <mutante>	 it removes some cp hosts and adds others
[18:26:38] <mutante>	 and then does it again on the next run
[18:39:18] <wikibugs>	 10Traffic, 10Operations, 10ops-codfw: cp2022 memory replacement - https://phabricator.wikimedia.org/T191229#4117930 (10Papaul) DIMM 6 replaced DIMM 3 = bad DIMM sent from DELL need replacement again   Fan #5 replaced
[18:40:21] <wikibugs>	 10Traffic, 10Operations, 10ops-codfw: cp2006 memory replacement - https://phabricator.wikimedia.org/T191223#4117932 (10Papaul) DIMM B2 replaced  DIMM B6 replaced  Server is back up
[18:46:35] <wikibugs>	 10Traffic, 10Operations, 10ops-codfw: cp2010 memory replacement - https://phabricator.wikimedia.org/T191225#4117960 (10Papaul) DIMM B2 replaced DIMM B6 replaced  Server is back up
[18:59:13] <wikibugs>	 10Traffic, 10Operations, 10ops-codfw: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4117984 (10BBlack)
[18:59:16] <wikibugs>	 10Traffic, 10Operations, 10ops-codfw: cp2006 memory replacement - https://phabricator.wikimedia.org/T191223#4117982 (10BBlack) 05Open>03Resolved Re-pooled into service.
[18:59:23] <wikibugs>	 10Traffic, 10Operations, 10ops-codfw: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4076372 (10BBlack)
[18:59:25] <wikibugs>	 10Traffic, 10Operations, 10ops-codfw: cp2010 memory replacement - https://phabricator.wikimedia.org/T191225#4117985 (10BBlack) 05Open>03Resolved Re-pooled into service.
[19:01:01] <wikibugs>	 10Traffic, 10Operations, 10ops-codfw: cp2017 memory replacement - https://phabricator.wikimedia.org/T191227#4118003 (10Papaul) DIMM A2 replaced DIMM A6 replaced DIMM A8 replaced   Server is back up
[19:30:49] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team (Radar): Support brotli compression - https://phabricator.wikimedia.org/T137979#4118113 (10Krinkle) >>! In T137979#2386228, @Krinkle wrote: >>>! At **<http://caniuse.com/#feat=brotli>** in June 2016: >> Global 45%: Firefox 45+, Chrome 50+, Opera 38+, Chrome for And...
[19:39:38] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team (Radar): Support brotli compression - https://phabricator.wikimedia.org/T137979#4118200 (10BBlack) The tricky part is this:  Varnish does our compressing (which is in this case the right place to be doing it), and it compresses hittable things on their way into cac...
[19:42:06] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team (Radar): Support brotli compression - https://phabricator.wikimedia.org/T137979#4118215 (10BBlack) Re-reading above: probably the better blend of ooptions would be to swap gzip for brotli in Varnish one-for-one (without the whole storing-dual-forms mess) and then h...
[19:58:08] <wikibugs>	 10Traffic, 10Operations, 10ops-codfw: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4118259 (10BBlack)
[19:58:11] <wikibugs>	 10Traffic, 10Operations, 10ops-codfw: cp2017 memory replacement - https://phabricator.wikimedia.org/T191227#4118257 (10BBlack) 05Open>03Resolved cp2017 repooled into service
[20:00:40] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team (Radar): Support brotli compression - https://phabricator.wikimedia.org/T137979#4118261 (10Gilles) For WebP [[ https://phabricator.wikimedia.org/T27611#4090235 | my proposed strategy ]] is to only offer the variant to popular thumbnails (eg. more than X hits on the...
[20:22:12] <wikibugs>	 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 11 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#4118412 (10Fjalapeno)