[06:03:53] <wikibugs>	 10Traffic, 10SRE, 10GitLab (Initialization), 10Patch-For-Review, and 2 others: open firewall ports on gitlab1001.wikimedia.org (was: Port map of how Gitlab is accessed) - https://phabricator.wikimedia.org/T276144 (10Sergey.Trofimovsky.SF) >>! In T276144#6887981, @Dzahn wrote: > @Sergey.Trofimovsky.SF Do yo...
[06:42:34] <wikibugs>	 10netops, 10Datasets-General-or-Unknown, 10Dumps-Generation, 10SRE: Packets discarded on dumpsdata1001 - https://phabricator.wikimedia.org/T273713 (10ArielGlenn) 05Open→03Resolved Never did merge the task but I'm closing it now. Dumpsdata1001 traffic looks good and so does 1003.
[09:23:00] <olliv>	 Hey all - I'm wondering what the best way to schedule the deploys for https://phabricator.wikimedia.org/T274784 would be?  Is it possible to start these today/tomorrow? 
[10:18:35] <vgutierrez>	 hey olliv, bblack mentioned on our weekly meeting that he would handle that
[10:18:46] <vgutierrez>	 so probably in the EU afternoon
[10:19:48] <olliv>	 perfect, thank you! 
[13:51:03] <elukey>	 !log drain + reimage an-worker110[4,5] to Buster
[13:51:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:05:33] <bblack>	 olliv: patch uploaded here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/669840/
[15:05:58] <bblack>	 olliv: I wasn't in the meeting last time I don't think, but to what kind of tight timing do we need to coordinate the changes on both ends?
[15:19:19] <olliv>	 Last time we set up a meeting and deployed both at the same time.  I think the cache change would need to be during or immediately after we deploy our changes
[15:20:27] <olliv>	 bblack: we can try setting up a meeting at 13:00 your time today or tomorrow, if you're available.  I know it's short notice, so we can also do tomorrow and wednesday for the two deploys
[15:20:32] <bblack>	 olliv: yeah, we have to be confident your mediawiki-side change is fully in effect before the cache change for sure, or else we're not sure we haven't defeated our purpose here.  Formally, we can't really guarantee anything about the "same moment in time", but we can keep it down to a few minutes.
[15:21:13] <bblack>	 olliv: 13:00 works for me today, let's do that if it still works for you!
[15:22:19] <olliv>	 bblack: works for us! 
[15:22:25] <bblack>	 ok, see you then!
[18:31:15] <ryankemper>	 bblack: I've got a question about http host headers wrt dyna.wikimedia.org (context follows)
[18:31:51] <ryankemper>	 So I've merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/668173/ to configure the new mapping for `query-preview.wikidata.org` as well as the microsite itself https://gerrit.wikimedia.org/r/c/operations/puppet/+/668543, but have not yet merged the dns entry (https://gerrit.wikimedia.org/r/c/operations/dns/+/668255)
[18:33:16] <ryankemper>	 It'd be nice to be able to do an "end-to-end" sort of test before actually going live with the DNS change, so I tried setting my `/etc/hosts` to map `query-preview.wikidata.org` to `198.35.26.96` which is the IP for `dyna.wikimedia.org`
[18:35:44] <ryankemper>	 However when I go to `query-preview.wikidata.org` in my local browser it ends up dropping me into https://www.wikidata.org/wiki/Wikidata:Main_Page, so my question is whether what I'm trying to do would even work (as in, would the HTTP Host Header still get set the same way it would if the DNS entry were really there, etc)
[18:41:13] <ryankemper>	 Hmm, maybe when I tried it at the end of last week it was before the ATS changes had propagated because going to `query-preview.wikidata.org` now gives me a page titled `Wikimedia Error` with this debug message at the bottom of the page: `Request from 98.171.175.63 via cp4029.ulsfo.wmnet, ATS/8.0.8 Error: 502, connect failed at 2021-03-08 18:34:21 GMT
[18:41:49] <ryankemper>	  so that might actually mean that the mapping is working properly and that the trafficserver is getting 502s when it tries to talk to `wdqs1009.eqiad.wmnet` (which while indicating a problem at least means the mapping itself is working)
[18:42:38] <ryankemper>	 Forgot the trailing backtick 2 messages back, the debug message is `Request from 98.171.175.63 via cp4029.ulsfo.wmnet, ATS/8.0.8 Error: 502, connect failed at 2021-03-08 18:34:21 GMT`
[19:49:55] <wikibugs>	 10Traffic, 10Desktop Improvements, 10SRE, 10Bengali-Sites, and 5 others: CDN cache revalidation on several wikis for desktop improvements deployment pt 2 - https://phabricator.wikimedia.org/T274784 (10BBlack) ^ There was a last-minute change of plans, so we made a last-minute call to expend a little bit of...
[19:54:08] <bblack>	 ryankemper: taking a look at it...
[20:03:07] <bblack>	 ryankemper: so, two problems seem to stand out now, when I do my own /etc/hosts testing and then look at it interally from ATS's perspective:
[20:03:37] <bblack>	 1) wdqs1009 is not accepting connections from our caches on port 443, whereas by comparison e.g. wdqs1005 is for the older special mappings...
[20:04:20] <bblack>	 2) webserver-misc-apps for the static part, the SAN list of the TLS cert for that service, doesn't have a match for query-preview.wikidata.org either.
[20:04:54] <bblack>	 I'm going to poke at the latter one first, it's probably just missing some puppetization to get a cert re-issued with a new name added
[20:05:09] <bblack>	 the former, might take some digging, it could even just be a firewall thing, I donno
[20:05:33] <ryankemper>	 ack (and thanks), yeah my hunch for (1) is that there's probably a ferm rule or similar buried in our wdqs puppet code, I'll see if I can find anything
[20:16:02] <bblack>	 yeah for (2), it's a cergen certificate, which is the procedure in https://wikitech.wikimedia.org/wiki/Cergen#Update_a_certificate
[20:16:31] <bblack>	 which, seems a little bit manual and scary!
[20:17:52] <bblack>	 yeah, quite
[20:18:56] <bblack>	 because I can see that the SERVICENAME that's manually templated into those manual steps, has two different values in the current setup on the filesystem (webserver-misc-apps vs webserver_misc_apps), and I also don't see much about how to coordinate the updates of the public and private parts.  I guess puppet-disable to coordinate the rollout.
[20:19:23] <bblack>	 hmmm
[20:20:21] <bblack>	 I think maybe I can stab at this though
[20:34:34] <bblack>	 ryankemper: ok, I fixed the webserver-misc-apps part now, it seems like it's dumping a better-looking output for the root of https://query-preview.wikidata.org/ with /etc/hosts hacked
[20:35:57] <ryankemper>	 bblack: awesome! the microsite works perfectly, looks exactly like query.wikidata.org as it should
[20:36:23] <bblack>	 ok
[20:36:26] <bblack>	 so the wdqs1009 part
[20:36:26] <ryankemper>	 Obviously we can't actually run queries until I figure out (1), at first glance I haven't found a ferm rule for wdqs1005 so not sure if I'm looking in the wrong place or we use some other mechanism
[20:36:50] <bblack>	 apparently it has a very similar setup to webserver-misc-apps: there's a cergen-based certificte and an envoy tlsproxy on wdqs1005
[20:36:54] <bblack>	 and wdqs1009 lacks all of that
[20:37:20] <ryankemper>	 ah, of course
[20:37:25] <bblack>	 so probably (1) the cergen certificate needs query-preview hostname added, like I just went through with the other one
[20:37:36] <bblack>	 and then (2) we need to puppetize the envoy tlsproxy setup onto 1005 like it is on 1009
[20:37:53] <bblack>	 err, onto 1009 like it is on 1005
[20:39:18] <ryankemper>	 makes sense to me, I've done both cergen and envoy work on other stuff previously so I should have a decent idea of how to proceed
[20:39:43] <bblack>	 ok
[20:40:03] <ryankemper>	 Hmm I seem to remember `hieradata/common/profile/services_proxy/envoy.yaml` being the relevant file here but that one just has definitions for `wdqs-internal` but not the public wdqs that `wdqs1005` is a part of
[20:40:28] <bblack>	 yeah role::wdqs::public has:
[20:40:30] <bblack>	     # Public endpoint specific profiles
[20:40:31] <bblack>	     include ::profile::tlsproxy::envoy # TLS termination
[20:40:36] <bblack>	 which role::wdqs::test lacks
[20:40:44] <bblack>	 and then I'm sure there's some relevant hieradata and such too
[20:41:12] <ryankemper>	 Ah yeah I see it now
[20:41:19] <ryankemper>	 `hieradata/role/common/wdqs/public.yaml:profile::tlsproxy::envoy::global_cert_name: "wdqs.discovery.wmnet"`
[20:41:34] <bblack>	 and then yeah, there's a 'wdqs' cergen certificate which currently has query.wikidata.org as one of its SANs in the cergen yaml file
[20:41:45] <bblack>	 which probably needs query-preview.wikidata.org added to it
[20:42:25] <bblack>	 once you have a 443 listener on wdqs1009 that has query-preview in the cert's SAN set, then at best we might just have a ferm rules issue (or not)
[20:43:28] <ryankemper>	 I'm guessing `profile::tlsproxy::envoy` must setup a ferm rule itself if it has been "just working" on wdqs with envoy and no specific ferm rule (at least that I can find)
[20:43:52] <bblack>	 yeah I'd guess so.  anyways, sounds like you've got this from here I think :) I'm running out for lunch and a couple errands, leave me a note somewhere if there's more to debug!
[20:55:01] <ryankemper>	 will do! thanks for the guidance
[21:20:52] <AndyRussG>	 hi all! quick question: what is the state (e.g., nothing, some initial planning, or working infrastructure) of ESI, or something similar, for our sites?
[21:24:05] <AndyRussG>	 I see https://phabricator.wikimedia.org/T34618 and https://www.mediawiki.org/wiki/Requests_for_comment/Partial_page_caching, no idea if there's been any consideration or work since
[21:43:09] <AndyRussG>	 bblack ^ (sorry for the bother, not super urgent btw)
[21:47:08] <wikibugs>	 10Traffic, 10DNS, 10SRE, 10serviceops, and 3 others: DNS for GitLab - https://phabricator.wikimedia.org/T276170 (10Dzahn) The change above is abandoned because it was about adding a director to the traffic/caching layer and now the VM moved from private to public network. So it is not behind caching anymor...
[21:55:23] <wikibugs>	 10Traffic, 10DNS, 10SRE, 10serviceops, and 3 others: DNS for GitLab - https://phabricator.wikimedia.org/T276170 (10brennen) > Do you really want "gitlab" for the MVP though, not like gitlab-test.wikimedia.org or gitlab-beta.wikimedia.org ?  Yeah, I think so.  Users (and other consumers of data, like bots)...
[21:58:30] <wikibugs>	 10Traffic, 10DNS, 10SRE, 10serviceops, and 3 others: DNS for GitLab - https://phabricator.wikimedia.org/T276170 (10Sergey.Trofimovsky.SF) >>! In T276170#6894252, @Dzahn wrote:   > Do you really want "gitlab" for the MVP though, not like  gitlab-test.wikimedia.org  or gitlab-beta.wikimedia.org ?  For testin...
[22:09:57] <wikibugs>	 10Traffic, 10DNS, 10SRE, 10serviceops, and 3 others: DNS for GitLab - https://phabricator.wikimedia.org/T276170 (10Dzahn) > not like gitlab-test.wikimedia.org or gitlab-beta.wikimedia.org ? >  > Yeah, I think so.  Users (and other consumers of data, like bots) are going to have git remotes set for long-ter...
[22:13:58] <bblack>	 AndyRussG: we don't have ESI at the edge, no.  There's been some consideration since in various forums over the past couple of years, around the idea of "content composition" in general.
[22:14:11] <bblack>	 it comes up as a pain point in many different ways!
[22:15:01] <AndyRussG>	 bblack ah okok thanks... I imagine there'd be a significant (or maybe huge) amount of work, as well as hardware investment, involved in making it happen?
[22:15:39] <bblack>	 it's probably mostly a software-level problem, although who knows if it leads to scaling needs for hardware, too.
[22:16:14] <AndyRussG>	 hmmm right
[22:16:18] <bblack>	 the software-level problems and solution space cuts across a lot of teams, and runs into a lot of deep legacy issues / complexities that are hard to see from the outset!
[22:16:28] <AndyRussG>	 bblack: have you seen this presentation? https://docs.google.com/presentation/d/1uImB6P3RXg3FJbqkVCJc7lJtlDMHoFhDgCjKi93KuqI/edit?usp=sharing
[22:16:29] <bblack>	 it's been hard to get a solid plan going!
[22:17:04] <AndyRussG>	 there is a danger that the entire site's search rankings could be affected by a thing called the "banner bump"
[22:17:38] <AndyRussG>	 in a nutshell, Google expects to start taking into account page stability in search rankings (though it's not clear how much)
[22:17:55] <bblack>	 yeah
[22:18:14] <AndyRussG>	 currently CentralNotice can't decide what banners to show a user until after JS runs, so banners inject content and cause the page content to jump. This goes for both FR and community banners
[22:18:34] <bblack>	 mostly when we're looking at the content-composition problem, we're looking at much more difficult things than banners.  like article-content vs the left navbar and other user-specific chrome around the edges.
[22:18:48] <AndyRussG>	 it *would* be possible to fix it with ESI and a significant investment in the needed software rewrite
[22:19:09] <bblack>	 fixing banner issues might not need the same whole general-purpose solution as composition
[22:19:39] <AndyRussG>	 bblack hmmmm... well, currently there is quite a lot of code that runs on the client for banner decisions
[22:20:05] <AndyRussG>	 based on data that we don't have on the servers, and also based on data that we could in theory have on the servers but that our setup makes it difficult to get at
[22:20:08] <bblack>	 either way: assuming you had ESI, what would the solution look like? (I ask because it helps me understand what you're getting at)
[22:20:59] <AndyRussG>	 well it would be (1) trimming down the list of possible banners to show the user as much as possible (this part wouldn't necessarily involve ESI, though some caching adjustments might be needed)
[22:22:05] <AndyRussG>	 and (2) putting stuff that we currently store on the client in LocalStorage in a cookie in some compressed format (such as how many banners a user has seen from a given campaign, or whether or not they've seen a large-format FR banner yet)
[22:22:48] <AndyRussG>	 and processing that cookie using ESI so the base HTML already includes the banner as part of the initial response
[22:23:09] <AndyRussG>	 it's definitely a big project but frankly it's the only good solution
[22:24:44] <bblack>	 so (1) I'm not sure that we can "process a cookie using ESI" - ESI is fundamentally not much different from old SSI. and (2) transcluding more dynamic-ish things into the "base html" seems like it's running in the opposite direction of where content composition efforts would like to go (which is to break up the page into more distinct parts instead of havine one whole giant html output that's a 
[22:24:51] <bblack>	 mix of different things)
[22:25:26] <bblack>	 ESI does buy you something in this scenario
[22:25:51] <bblack>	 which is that the same anonymous pageview of the rest of the page could be cached in varnish, independent of what banners were shown.
[22:26:44] <bblack>	 so basically, it's saving some traffic/load between MW<->Varnish for this solution
[22:27:26] <bblack>	 but fundamentally, I don't know that the cookie-encoding of the banner determinants could be handled in the edge itself, is what I mean
[22:28:35] <bblack>	 (so there would still be a need to hit MW)
[22:29:28] <AndyRussG>	 hmmm so it's not clear that there's a way to do pretty simple dynamic content at the edge using something in a cookie as input?
[22:29:43] <bblack>	 I guess it depends on a lot of things
[22:30:31] <bblack>	 I don't really understand the total state-space that determines what banner and whether it's displayed, all of which I guess would be encoded in a special cookie ... in a determinstic way?
[22:30:59] <bblack>	 still, it's a multiplication of all the page outputs in terms of being able to cache the final version at the edge, at least.
[22:31:37] <bblack>	 also, I thought some of the banner stuff relied on client-side randomness too?
[22:31:39] <AndyRussG>	 well only deterministic if you consider a huge number of possible input states
[22:31:54] <AndyRussG>	 yep aaahhh that's true too
[22:32:13] <AndyRussG>	 though the random aspect is not super essential and could be done away with
[22:32:41] <bblack>	 huge possible inputs states + possible RNG sounds like for every given /wiki/Barack_Obama we might have a whole lot of copies of the base html output to cache with different banner-things at the top and otherwise mostly-identical
[22:32:57] <AndyRussG>	 yep
[22:33:19] <bblack>	 and again, this seems to go against the grain of what we're trying to aim at for all other purposes
[22:33:41] <AndyRussG>	 basically the inputs from the server could be a list of possible banners that could be displayed based on a user's geolocation, device, language, logged-in status, the wiki they're on, and their preferences
[22:33:42] <bblack>	 (to stop aggregating all the things into a giant html output, and instead break up the logically-different parts into different request URIs)
[22:35:39] <AndyRussG>	 and that input would be combined with the data from the client (i.e. in the cookie) about how many banners this particular user has seen from different campaigns recently, as well as whether they've seen a large FR banner, whether they've donated in the last year, or what banners they have closed in the past week
[22:36:36] <AndyRussG>	 hmmm I have no idea how that goal you mentioned would factor into this, or into Google's new metrics
[22:36:54] <AndyRussG>	 couldn't that lead to more page bumps, so also work against us for this metric?
[22:37:21] <bblack>	 no, it shouldn't, but that's really up to fancy UI / serviceworker / JS stuff
[22:37:44] <bblack>	 basically what banners are currently doing is more like what we wish the rest of the page did
[22:37:49] <bblack>	 other than the bumping around part :)
[22:37:56] <AndyRussG>	 ah I see
[22:38:37] <AndyRussG>	 so it could be pushing this stuff to the client, and service workers would be doing stuff in for our sites the background in the browser *ahead* of the actual page request, something like that?
[22:38:39] <bblack>	 but really, any truly dynamic (on the client) decision to include a banner or not seems bumpy by nature
[22:39:18] <AndyRussG>	 I remember work on service workers but I never learned enough about the tech to get into it  much
[22:39:30] <bblack>	 and any server-side (whether the server is MW or Varnish is irrelevant, it just pushes the problem around to make it someone else's problem) leads to performance and cache-explosion woes, etc
[22:40:38] <bblack>	 I guess, if you did have a service worker (which won't be true for all pageviews btw.  I think esp for banners we have to consider the one-shot visitors who don't have a preloaded SW), then yes, you could pre-make a decision to preload a banner and display it and the page together in a single render.
[22:41:00] <wikibugs>	 10Traffic, 10DNS, 10SRE, 10serviceops, and 3 others: DNS for GitLab - https://phabricator.wikimedia.org/T276170 (10brennen) Yeah, good question - sorry I conflated machine-specific with a `-test` / `-beta` hostname in my response.  I //think// `gitlab.wikimedia.org` is good, on my current understanding tha...
[22:41:08] <bblack>	 but that seems like a long way around for just this problem, too, and again doesn't cover all cases well
[22:42:13] <bblack>	 have you considered a solution where banners simply don't bump the content layout at all to begin with? client-side rendering is not my forte, so maybe that's not a reasonable solution either, but
[22:42:34] <bblack>	 but my naive idea there, is why not make the banner overlay and not-bump, like many of the "This site uses cookies..." overlays
[22:43:10] <AndyRussG>	 yeah overlay banners have been considered, as has reserving space for a banner ahead of time
[22:43:43] <AndyRussG>	 both of those run into another SEO metric (in the same presentation) which is speed of rendering of the largest page element
[22:43:44] <bblack>	 yeah, reserve could work too, especially if we had another use for the space
[22:44:45] <AndyRussG>	 they both also run up against the fact that there is a tried-and-tested FR banner flow and it doesn't feel like the right time to try something radical
[22:44:50] <bblack>	 I donno, like some generic fill-in for when no true banner is needed, like a selection of "quote of the day" things, or some interesting pointers like the Main_page "in the news" or "featured article" stuff as a generic non-banner use of the space
[22:45:30] <AndyRussG>	 and the reserve-space option would have to have some design and community-supported option, so also really a huge project
[22:46:43] <AndyRussG>	 so ESI isn't actually fully dynamic injection of content at the outer cache layer?
[22:47:32] <bblack>	 it can be, but then that would defeat some of the reasons the outer cache layer exists
[22:48:12] <AndyRussG>	 the most simplified conceivable logic for dynamic banner decisions wouldn't be extremely complex, would involve 0 DB queries, many orders of magnitude simpler than firing up an entire PHP MW response
[22:48:13] <bblack>	 like, as a trivial exploratory idea, let's say we did it this way:
[22:49:14] <bblack>	 1) All MW page outputs now include some ESI tag in the banner spot, and this is what gets cached in varnish as the response to /wiki/Foo (what's there today, plus the ESI reference near the top).
[22:50:01] <bblack>	 2) Tne ESI reference transcludes some other thing from mediawiki like /foo/bar/showbanner-based-on-cookies.php
[22:50:35] <bblack>	 and in the Varnish world, we turn on ESI support so that it can compose the ESI there by making the extra request, using the primary request's cookie, and then serve the combined result to the user.
[22:51:02] <AndyRussG>	 yep
[22:51:25] <bblack>	 this naive bad solution doesn't work because all of what were "cached" pageviews with zero MW load, now all involve a call into the mediawiki appservers for that showbanner-based-on-cookies.php
[22:51:40] <bblack>	 (and for a few other reasons, but we can get to those after we fix this, iteratively)
[22:51:54] <AndyRussG>	 right so let's say we can make a solution that doesn't go to the appservers
[22:52:08] <AndyRussG>	 and definitely doesn't run Mediawiki
[22:52:36] <bblack>	 so we say: ok, maybe there are only 442 unique possible deterministic cookie values.  So we also enable caching of the response to showbanner-based-on-cookies.php, varying on that determinstic cookie.
[22:53:15] <bblack>	 so now we have a solution that doesn't always involve a roundtrip to mediawiki, but does create 442 new cached objects per-wiki or whatever.
[22:53:35] <AndyRussG>	 right that's a huge cache explosion, so doesn't sound great
[22:53:54] <bblack>	 well, it's not huge yet.  it's just 442 new and probably-small objects, per-wiki (not per-article)
[22:54:59] <bblack>	 but the issue we now face is that our edge performance doesn't scale well, because basically-all pageviews are now ESI-rendered from two cache objects, and have to be combined on the fly as we output them.
[22:55:37] <bblack>	 so yes, then we start saying we should cache the combination of them at the outermost cache, which leads to the explostion of large cache objects (442 slightly-different copies of /wiki/Barack_Obama, and every other large output)
[22:55:57] <AndyRussG>	 hmmm
[22:56:30] <bblack>	 I mean, we can push the problem around a number of different ways, but it doesn't get any magically easier by moving it to a different layer
[22:56:52] <bblack>	 (you could also have the edge ignore all of thise and split mediawiki parsercache x442 for this and handle it there, or something equivalent)
[22:56:55] <AndyRussG>	 so, I guess this doesn't address the last point, but what about instead of something that goes to the app servers, some like simple logic in some language that gets the list of possibly available banners for this user (such lists are currently cached as RL modules) , looks at the cookie, and crosses banner 3 of the possible banners off the list, and injects the highest-priority not-crossed-off-one
[22:57:46] <AndyRussG>	 as regards caching the full, combined result of stuff at the outermost layer, isn't avoiding that what ESI is meant to avoid?
[22:58:05] <AndyRussG>	 btw thanks for taking the time to look at this eh!!
[22:58:33] <bblack>	 to your first point: probably the only way we get to that, efficiently, is for both the ESI-combining itself and the cookie-parsing/banner-selecting logic to happen inside the edge software (which in practice means in our varnish-fe VCL code)
[22:59:00] <AndyRussG>	 ok
[22:59:16] <bblack>	 and to also hope that both are efficient enough that it doesn't drastically alter the scalability of our edge caches (which might not be a good thing to bet on, even if we were ok with all the new code complexity and layer-entanglements going on there)
[22:59:49] <AndyRussG>	 ok hmm
[23:00:11] <bblack>	 re: what ESI is meant to avoid, I think it's an open question exactly what real-world problems ESI fixes and for whom
[23:00:25] <AndyRussG>	 I mean, I can imagine that as being difficult to code, and also a risky bet as regards efficiency, but not unthinkable?
[23:00:47] <bblack>	 it can still, in some environments, be a win to use ESI to combine 1 heavy page composition with some other lightweight things from the backend.
[23:01:39] <bblack>	 AndyRussG: it depends on the specifics.  But our design aim is not to embed more business logic in VCL, for sure.  At least not actual code if we can help it.
[23:02:19] <bblack>	 if we had a better abstraction for running edge-side code efficiently, we might look at this differently, but we don't
[23:03:06] <bblack>	 for one simple static case, I could probably code up a VCL solution
[23:03:51] <bblack>	 you give me the 5 factors about the client that determine the banner, and how those encode determinstically into 5 different cookie values, and the list of 23 possible banners, and the logic of how we choose the banner based on the cookie and probably some local RNG input
[23:04:23] <bblack>	 and then I write some complex VCL and it "works" (well + ESI support in varnish efficiently as well, we're assuming, for dynamic output-time combination of cached URIs)
[23:05:02] <bblack>	 but if we start talking about a lot of factors, and a lot of different banner URIs over time, and logic that keeps evolving, etc... it starts looking hard for us to maintain that out in the world of VCL, which is the only true edge-side-code we have today.
[23:05:40] <AndyRussG>	 hmmm right
[23:06:31] <bblack>	 last time we did try using varnish's ESI, also, it was basically-broken and crashy
[23:06:41] <AndyRussG>	 I guess we're currently all Varnish at the outer edge?
[23:06:46] <bblack>	 but that was many versions ago and we never tried again.  for all I know it works amazingly and efficiently now.
[23:07:06] <bblack>	 but either way, just exploring ESI support and offering it as part of our edge service is a project in itself
[23:07:32] <bblack>	 our edge caches currently are done in two layers: varnish at the user-facing side, and ATS at the applayer-facing side.
[23:07:41] <AndyRussG>	 Is there any big-investment-requiring infrastructure-shifting other option out there that you could think of?
[23:07:58] <bblack>	 but if we did the ESI composition in the backend ATS layer, then varnish-frontend would have to cache all the exploded outputs for sure.
[23:08:16] <AndyRussG>	 yeah that last option doesn't sound like a huge win
[23:09:03] <bblack>	 also, not very long ago, we were on a path to get rid of our varnish-frontend layer, and the plan wasn't working out so well, so we stopped and decided to accept varnish for now in that role.
[23:09:12] <bblack>	 but that could change down the road
[23:09:24] <AndyRussG>	 ahh yeah that's why I was confused it's still all Varnish
[23:09:37] <bblack>	 we replaced some of our varnish (the backend layer), but not all
[23:10:15] <AndyRussG>	 I guess this is problem that's way beyond the scope of any one team
[23:10:26] <bblack>	 the constraints in play here seem to be:
[23:10:59] <bblack>	 1) We rely on a lot on being able to cache a singular html blob output per-article for the bulk anonymous pageviews, to get scalability.
[23:11:17] <bblack>	 2) Pre-encoding a banner output in them, no matter where/how you do that, seems to create some challenges for that.
[23:11:26] <bblack>	 3) Not-pre-encoding them seems to create layout-bump issues
[23:12:08] <bblack>	 there are probably some answers to those challenges, but not easy ones.  might be a whole lot of hardware to make up for some scalability loss, or a whole lot of software-level investment somewhere.
[23:13:11] <bblack>	 (well or you could not-pre-encode and also avoid layout-bump by instead sacrificing paint/render delay for the main part, but that's not good either)
[23:15:07] <bblack>	 at a more meta-level, the problem is that even exploring the option space of what we could try to do here is kind of a heavy lift from where we're at, and the team doesn't have capacity to attack that in the short term for sure, maybe doesn't have the capacity to prioritize it in the medium term either.
[23:15:28] <AndyRussG>	 yep the paint delay is the other undesirable but theoretically possible tradeoff
[23:16:27] <AndyRussG>	 right understood
[23:17:16] <AndyRussG>	 I just feel like some investment, possibly a big one, needs to be made somewhere ASAP
[23:17:35] <bblack>	 at the same time, I don't want to say "well but it's just banners" either.  because banners make our paycheck, among other things!
[23:17:47] <AndyRussG>	 yeah
[23:18:01] <wikibugs>	 10Traffic, 10DNS, 10SRE, 10serviceops, and 3 others: DNS for GitLab - https://phabricator.wikimedia.org/T276170 (10Sergey.Trofimovsky.SF) >>! In T276170#6894358, @brennen wrote: > Yeah, good question - sorry I conflated machine-specific with a `-test` / `-beta` hostname in my response. >  > I //think// `gi...
[23:18:15] <AndyRussG>	 I mean, what if there's a reserved space all the time, and when FR banners start to go there, people have already been conditioned to ignore it?
[23:19:00] <bblack>	 yeah, but it just seems a shame to waste valuable pixels, esp on mobile screens, that could've been showing content.
[23:19:15] <AndyRussG>	 yep also
[23:19:32] <bblack>	 we can fill the pixels with something else light/optional and slightly useful
[23:19:57] <bblack>	 but if it's useless-enough that we can replace it with a banner whenever we feel like it, was it useful enough to have it wasting content real estate the rest of the time? :)
[23:20:24] <AndyRussG>	 yeah probably goes against the grain of what design folks have been working on
[23:20:57] <bblack>	 remind me, what was the argument against an overlay that doesn't shift the content?
[23:21:40] <AndyRussG>	 that another metric that can bring down ranking is how quickly the largest visible element renders
[23:22:08] <AndyRussG>	 currently the large mobile FR banners take up all or maybe more than an entire mobile screen
[23:22:15] <bblack>	 oh, and the presence of the overlay changes the definition of "largest visible"?
[23:22:24] <bblack>	 hmmm ok
[23:22:28] <AndyRussG>	 I think so, but I guess I might have misunderstood
[23:22:52] <AndyRussG>	 either way, it's also a significant shift in banner layout strategy
[23:22:59] <bblack>	 yes, for sure
[23:23:47] <AndyRussG>	 I don't know enough about service workers to know how much of a solution that could be
[23:23:53] <bblack>	 what's the time horizon, realistically? by next FR season?
[23:24:18] <AndyRussG>	 in theory the new ranking metrics could start being applied by Google in May
[23:24:20] <bblack>	 well the problem with SW is they really only help with the repeat visitors, the heavy users/editors or whatever.  For a lot of pageviews, the SW won't be there.
[23:24:35] <bblack>	 (at least, that's my understanding)
[23:25:14] <AndyRussG>	 though the actual hit in terms of rankings could come slowly, as FR and community campaigns roll out in different locations
[23:26:07] <AndyRussG>	 also a risk that the search ranking hit could be in less-visible segments (small languages, countries with higher ratio of mobile users, poor internet speeds, etc.)
[23:26:43] <bblack>	 right
[23:27:15] <AndyRussG>	 with such a time scale, I just feel like it's a potential emergency that folks aren't treating as such
[23:27:18] <bblack>	 so, if I avoid ideas that are not even remotely realistic for #traffic to attack in that kind of timeframe, all constraints considered.
[23:28:42] <bblack>	 I think our best bet (and it would still be a bit of a moonshot) would be to (1) Explore the current state of Varnish ESI support, and assume the answer is "yes it works and it performs brilliantly enough that we can use it to compose dynamic outputs for most pageviews that otherwise would've been singular cache hits before"
[23:28:54] <bblack>	 and get it deployed and working quickly
[23:29:31] <AndyRussG>	 K wow
[23:29:37] <bblack>	 + 2 leave the cookie-decoding-banner-decider off in mediawiki (well, the applayer in general, it doesn't matter much where in the applayer), rather than in our edge VCL.
[23:29:50] <bblack>	 so we're in one of those intermediate scenarios from above:
[23:30:16] <bblack>	 we get 442 new unique outputs to cache per wiki or something (basically every unique kind of banner response for that wiki, in isolation.  the state-space of the deciding factors).
[23:30:48] <bblack>	 but the html cache of the wiki outputs stays about the same, but we're injecting one of these cacheable 442 responses into each anonymous pageview with ESI.
[23:32:13] <bblack>	 from the cache-explosion perspective this solution is fine.  it requires applayer work (write some php to do the banner-deciding stuff based on the determinstic-ish cookies).  it requires us figuring out ESI, and it requires that the result of the exploration of ESI is that it perform well for dynamic outputs all the time.
[23:32:40] <bblack>	 and a little bit of generic VCL work to use cookies correctly for this scenario, differently from all the other ways we look at cookies, I guess.
[23:32:59] <AndyRussG>	 K also wow
[23:33:14] <bblack>	 so yeah it's possible
[23:33:36] <bblack>	 but we'd have to prioritize it pretty hard, and it's possible that once we've explored this enough we find out that it just can't work, realistically at scale.
[23:34:37] <bblack>	 it's also kind of a big architectural shift, to start supporting ESI in general.
[23:34:59] <bblack>	 once we do it, other uses of the capability will materialize, and we may find ourselves locked into supporting it forever, for better or worse.
[23:35:29] <bblack>	 (even better: we might find ourselves locked into supporting exactly "Varnish ESI", because even other vendors with ESI support might support a different subset of features)
[23:36:25] <bblack>	 so it's, kinda hard to just make a spot decision on that and be comfortable with the long term implications
[23:36:50] <AndyRussG>	 absolutely all of that makes sense
[23:37:49] <bblack>	 but if it's important, we can try to have some higher-level conversations I think, about whether we should prioritize appropriately and try our best to find a solution.
[23:38:10] <bblack>	 I mean, for all I know just in typing about it here for a while, we've only scratched the surface of possible ideas, and maybe there are better and simpler ones!
[23:38:21] <AndyRussG>	 yeah also good point
[23:38:48] <bblack>	 we might get some better ideas if we have a few people stew on it and discuss it more
[23:39:11] <AndyRussG>	 yep 
[23:41:28] <AndyRussG>	 I mean I feel it's very much not my place to suggest which teams should work on what.... mainly just wanted to get a sense of what is technically possible from a Traffic perspective, so that I can try to make a relevant suggestion in the discussion that is happening mainly in the Advacnement sphere
[23:41:51] <bblack>	 yeah I hear you
[23:42:46] <bblack>	 if we had an easy solution that we were comfortable with, to move this problem over and make it our problem trivially, that would be ideal, but we're not there right now.
[23:43:39] <AndyRussG>	 yeah, trivial is the opposite of this from several perspectives
[23:43:49] <AndyRussG>	 FWIW FR-Tech also has the expectation of offloading CentralNotice to core platform team (?) in the medium term
[23:44:17] <bblack>	 that sounds like a sensible move, really.
[23:44:35] <AndyRussG>	 yeah, in fact the majority of banners are not FR-related
[23:45:07] <AndyRussG>	 and there has indeed been quite a lot of maintenance work, actually maybe the majority of it, coming from other teams recently
[23:48:18] <AndyRussG>	 but this also feels like a potentially grave enough problem that we shouldn't just shrug it off as something "someone else" has to figure out
[23:49:04] <AndyRussG>	 hugely appreciate all your input and thoughts here!!!! 
[23:49:47] <bblack>	 yeah
[23:50:09] <AndyRussG>	 :)
[23:51:17] <bblack>	 anyways, I can try to bubble some of this up in various places this week, too, see what others around Tech think a bit
[23:51:29] <AndyRussG>	 yeah that'd be fantastic thanks so much
[23:51:33] <bblack>	 np
[23:51:54] <AndyRussG>	 I'll try summarize to fr-tech/Advancement the stuff you've said 
[23:54:07] <AndyRussG>	 I can also let u know how it goes.... and pls don't hesitate to ping anytime if u think of more you'd want to add :)
[23:56:40] <bblack>	 ok :)
[23:57:17] <AndyRussG>	 bblack k, gonna break for now for a bit but I'm still around this evening, so so much once again!!!!