[06:24:52] <_joe_>	 bblack: cool! I intend working on that this morning
[11:42:03] <wikibugs_>	 10Domains, 10Traffic, 10Education-Program-Dashboard, 06Operations: Create short link for outreachdashboard.wmflabs.org - https://phabricator.wikimedia.org/T146332#3060800 (10ema)
[13:13:08] <bblack>	 in the long run with the DNS discovery stuff, we're going to have to do some side-work on the dns-repo jenkins linter I think
[13:13:37] <bblack>	 what I've done for now for the initial structural change is have it just validate the zones + geo config.  That gives us the basic structure change without really changing linting much.
[13:14:15] <bblack>	 but once the zonefiles reference the discovery stuff, they'll fail validation unless the configuration used by the linter has those resources names available too, and currently they're only from confd-driven templates
[13:15:29] <bblack>	 from the puppet perspective, we could template-generate static testing variants of the confd-driven stuff.  it doesn't have to have correct IPs or behavior, just the right list of resource names in the config (which validates that zonefiles referencing them aren't referencing typos or missing resources, etc)
[13:15:57] <bblack>	 but I'm not sure if that mucks with the pipeline to the CI host that runs the linting, if it depends on puppet templated outputs now too (before, it didn't in practice)
[13:16:38] <bblack>	 _joe_: ^ FYI
[13:22:19] <_joe_>	 bblack: I'll read this later, I'm going to lunch, sorry
[13:22:47] <paravoid>	 are you guys thinking of using our main DNSes for this?
[13:23:12] <paravoid>	 I was thinking that since this is a separate sub-zone anyway it might make sense to split it off to a separate cluster
[13:24:20] <paravoid>	 the thinking behind this idea is that our main NSes are a pretty big SPOF, so the more we keep them simple & stupid the less chances there are something will go wrong
[13:25:28] <paravoid>	 incl. the multifo plugin (or whatever we're using) but also the whole etcd/confd dependency
[13:26:06] <paravoid>	 including from our edge pops which are a bit agnostic to this whole dynamic discovery thing
[13:44:18] <bblack>	 either way DNS-traffic-wise it's going through the recursors for basic caching
[13:44:28] <bblack>	 the plugins are all old / well-tested
[13:45:02] <bblack>	 and confd isn't templating config files on the fly.  it's templating the statefile that says "ResourceFoo => DOWN" (like admin_state, but a different file meant for machine input)
[13:45:37] <bblack>	 but puppet's templating the config that causes it to consume those, that's the lint annoyance
[13:47:18] <bblack>	 I'm not sure about the whole long term thing on whether the edge pops are involved
[13:47:25] <bblack>	 at present, no
[13:47:28] <paravoid>	 in what sense?
[13:48:08] <bblack>	 well, eventually when we have TLS-to-app it will no longer be critical that we route eqiad apps from caches in eqiad, etc
[13:48:25] <bblack>	 at which point it makes sense for pass-traffic to directly to the geo-closest active/active endpoint that's up
[13:48:43] <bblack>	 and the easiest way to do that is to send it to the dns discovery hostname just like non-cache traffic
[13:49:34] <bblack>	 thinking through those possible future scenarios raises some questions (that are probably edge-case relevant today too)
[13:49:49] <bblack>	 about how we deal with split-brain re: DCs and the etcd cluster
[13:49:50] <paravoid>	 that's different than the (anycast even) auth NS for service IPs all being able to track the status of dynamic hostnames
[13:50:15] <paravoid>	 but we're pretty far from all of that anyway
[13:50:25] <bblack>	 it's not really different
[13:50:47] <bblack>	 cp3030 is still going to query the esams dns recursor, which then queries authdns (locally, if they're the same)
[13:51:02] <bblack>	 (or if they're deployed everywhere)
[13:51:03] <paravoid>	 it's not necessarily local now
[13:51:16] <bblack>	 but it should be
[13:51:22] <bblack>	 would be, if we finished anycast plans
[13:51:39] <paravoid>	 it should be only if we have per-site redundancy, which we don't :)
[13:51:49] <paravoid>	 that's what i meant by "pretty far from all that"
[13:51:50] <bblack>	 ?
[13:52:00] <paravoid>	 multiple authNSes per site
[13:52:04] <paravoid>	 and recursors for that matte
[13:52:05] <paravoid>	 r
[13:52:08] <bblack>	 yes, we should
[13:52:38] <bblack>	 the only thing between us and that is puppet commits (the time to contruct them and test them and deploy them), really
[13:52:50] <paravoid>	 all of these are dependent on a) anycast recursors (probably), b) anycast authNS, c) TLS-to-app layer
[13:52:56] <paravoid>	 that's what, a year off? :)
[13:53:51] <bblack>	 doesn't require a (although it's nice to have), doesn't require (b) publically, only privately
[13:53:57] <bblack>	 (c) is the major bottleneck
[13:54:15] <paravoid>	 what do you mean by 'publically'?
[13:54:37] <paravoid>	 oh you mean just authNS behing LVS?
[13:54:39] <bblack>	 we can do anycast authdns in private space, too, especially for the hardcoded reference for wmnet from the recursors
[13:54:49] <paravoid>	 yeah, sure
[13:54:51] <bblack>	 if we're stalled on the public part still
[13:55:32] <bblack>	 either way, I just don't see a solid reason to be less than fully general here
[13:56:00] <paravoid>	 how do you mean?
[13:56:26] <bblack>	 we have a network of authdns servers, they're already using the same code for geoip / admin_state, they're already planned to eventually be at all sites and anycasted, etc
[13:56:51] <bblack>	 I just don't evaluate the same balance of "fear of confd-templated input file" vs "deploy a separate, limited authdns cluster and delegate to it"
[13:59:42] <_joe_>	 I took a look at our puppet classes, we'd need to do a substantial refactoring in order for the latter option to be available
[14:02:14] <paravoid>	 well you wouldn't base this off the authdns module
[14:02:15] <bblack>	 what I do worry about, though (probably mostly out of ignorance of where we're at and going) is how we deal with etcd data across 5x sites
[14:02:43] <paravoid>	 you presumably wouldn't need all these templating for a dynamic zone
[14:02:59] <bblack>	 what happens in a 3/2 split of all sites? does the losing side have data at all? stale data? can we update the data on the losing side manually to cope operationally? etc
[14:03:32] <paravoid>	 in your current plan would we be able to *add* new dynamic hostnames with etcd/without any git commits?
[14:03:47] <bblack>	 paravoid: no
[14:04:03] <bblack>	 paravoid: etcd just provides pooled=yes/no state on a per-DC level for hard-configured stuff
[14:04:30] <paravoid>	 right
[14:05:42] <bblack>	 the basic way things plug together is like:
[14:05:56] <paravoid>	 this reminds me of the pybal etcd integration too, I was petitioning for dynamically creating service IPs using etcd back then too :)
[14:06:12] <paravoid>	 I think this is the only way we can actually automate some parts of e.g. service creation
[14:06:19] <bblack>	 wmnet zone has "restbase.discovery 300 in A metafo!restbase"
[14:06:37] <_joe_>	 paravoid: I agree, and it's pretty easy to implement too
[14:06:59] <paravoid>	 is it? wouldn't you need to modify the wmnet zone?
[14:07:00] <_joe_>	 (creating service IPs from etcd
[14:07:10] <bblack>	 puppet templates (from hieradata) the configuration of that, which includes "eqiad => 1.2.3.4, codfw => 4.3.2.1", and references a health-checker plugin statically, which is configured to watch some file like /var/whatever/discovery-restbase
[14:07:20] <_joe_>	 paravoid: oh, for the DNS side, yes
[14:07:32] <paravoid>	 you can't create a discovery.wmnet zone on the authNSes since they serve wmnet already
[14:07:35] <bblack>	 and then confd writes the var-file above with "restbase/eqiad => UP, restbase/codfw => DOWN"
[14:07:58] <paravoid>	 you'd have to resort to including it via jinja and rerunning the generator
[14:08:13] <bblack>	 (and it uses our geoip map with the nets stuff to service active/active from closest if both up)
[14:08:26] <_joe_>	 yes, I was just thinking of the pybal side
[14:08:44] <bblack>	 paravoid: that's pretty easy to fix though: don't use wmnet.
[14:08:53] <bblack>	 use wmdiscovery. or whatever
[14:09:13] <paravoid>	 ugh
[14:09:16] <paravoid>	 yeah I suppose so
[14:09:28] <paravoid>	 and teach the authdns machinery to not clean it up
[14:09:40] <bblack>	 ?
[14:10:08] <bblack>	 oh, right
[14:10:13] <paravoid>	 authdns-gen-zones will remove zones that are not in templates/
[14:10:32] <bblack>	 or just make it a template include in wmnet
[14:10:49] <bblack>	 but then authdns-gen-zones has to be involved in confd updates
[14:10:51] <paravoid>	 and then trigger template regeneration from confd?
[14:10:56] <paravoid>	 yeah that's scary
[14:11:36] <paravoid>	 my fear is sourced at the fact that messing up DNS is one of the few things that can bring us completely down
[14:11:51] <paravoid>	 an depending on the fuckup, it could be for hours
[14:12:12] <bblack>	 "messing up DNS" how, that's a new risk?
[14:12:20] <paravoid>	 it's already a set of fate-shared DNS servers in many ways which is scary on its own
[14:12:35] <bblack>	 you can have fate-sharing or automation, take your pick :P
[14:12:43] <_joe_>	 eheh
[14:12:57] <bblack>	 lack-of-fate-sharing, I mean above
[14:13:18] <paravoid>	 which is being alleviated by a) running super stable/trusted software b) not doing too many crazy things with them
[14:13:19] <bblack>	 bad dns commit -> bad dns servers (already a risk, and how do we solve that without race conditions on normal updates?)
[14:13:40] <bblack>	 the risk is in our change commits, not in the software, IMHO
[14:14:03] <paravoid>	 not too long ago a confctl command restarted all of our varnishes
[14:14:14] <paravoid>	 it was a series of things that went wrong, but still
[14:14:24] <bblack>	 and not a realistic scenario here
[14:14:58] <paravoid>	 the exact same thing? no, certainly not
[14:15:07] <bblack>	 confd for varnish controls the actual configuration of available backend hosts in VCL, and thus it was possible for confctl to undefine them all, which crashed varnishd (vslp bug)
[14:15:22] <bblack>	 the key difference here is that confctl/confd doesn't control configuration for gdnsd
[14:15:30] <bblack>	 it controls a statefile input for up/down states
[14:15:40] <bblack>	 it can't delete a resource or a zone or a hostname
[14:16:09] <bblack>	 at worst, it can kill what it controls: it can mark all the discovery hostnames dead (which config should handle, but still)
[14:16:31] <bblack>	 and that same risk exists in a separate authdns cluster or in the main one
[14:17:22] <_joe_>	 bbiab
[14:17:48] <paravoid>	 uhm, ok
[14:18:07] <paravoid>	 I don't feel so strongly to fight for this, nor do I know all the details of your plan yet
[14:18:30] <bblack>	 well, I've outlined what's important here above, and it's in a WIP commit that's mostly-complete
[14:18:48] <bblack>	 to recap:
[14:19:13] <bblack>	 the zonefiles remain static in the dns repo, with new entries like "restbase.discovery 300 IN A metafo!restbase-discovery"
[14:19:19] <paravoid>	 yeah got it
[14:19:40] <bblack>	 there's puppet-generated config (from hieradata) which defines the dc=>ip for those
[14:19:53] <bblack>	 there's a statefile in var that confd emits which flags the IPs up/down
[14:20:30] <paravoid>	 what happens if you put "geoip/text*/esams => DOWN" in that state file? :)
[14:20:37] <bblack>	 nothing
[14:20:40] <mark>	 we just need to integrate our dns servers with ORES to make decisions
[14:20:40] <paravoid>	 ok :)
[14:21:15] <bblack>	 because this file is referenced by an explicit service_type that's only attached (in config) to discovery resources
[14:21:27] <bblack>	 not to our global public geoip stuff that we do admin_state on
[14:21:31] <paravoid>	 ok ok
[14:21:36] <paravoid>	 just checking and thinking of things that could go wron g:)
[14:21:42] <bblack>	 but the reverse does work
[14:21:54] <bblack>	 we can edit admin_state to override decisions from etcd->confd about discovery stuff
[14:26:55] <_joe_>	 bblack: I'm not sure about where to put 503-as-a-service though :P
[14:27:15] <bblack>	 yeah
[14:27:31] <bblack>	 k8s? ganeti?
[14:28:18] <bblack>	 there's 2x remaining issues in the commit (which could be worked out after initial proof of concept) related to that:
[14:28:50] <bblack>	 1. For active/active, do we still want all-down => 503-as-a-service? or all-down => assume human error and treat like all up
[14:29:06] <_joe_>	 I would say the latter
[14:29:13] <_joe_>	 I was thinking about that yesterday
[14:29:16] <bblack>	 (since the purpose of 503-as-a-service was for active/passive where overlap is not allowed, so we can configure all-down=>503 temporarily between the move from one side to the other)
[14:29:41] <_joe_>	 bblack: it's still something that no service supports, btw
[14:29:58] <bblack>	 2. for the active/passive scenario, ideally we should have something in the confd templating that disallows "both sides up" (returns some kind of error at template generation time or whatever)
[14:30:02] <_joe_>	 no service has a distinction between mw-readonly and mw-readwrite for now
[14:30:21] <_joe_>	 2. I can work on
[14:30:25] <bblack>	 yeah but they're going to need one
[14:30:58] <_joe_>	 bblack: agreed, for now they'll just find a RO database
[14:31:00] <bblack>	 unless we scrap active/active readonly as a temporary point in this plan and just move straight from "whole active/passive" to "whole active/active"
[14:31:43] <bblack>	 we've spent a long time waiting for all the bits to come together for sticky-cookie RO/RW split stuff, seems like in that time we could've been working with better long-term efficieny heading straight for the active/active goal
[14:32:28] <_joe_>	 sorry, I'm not following the latter part
[14:32:50] <_joe_>	 isn't the sticky-cookie RO/RW stuff the way to make mediawiki active-active?
[14:32:55] <paravoid>	 broader comment: I think we should realign ourselves to whatever the outcome of the annual plan conversation is
[14:33:07] <_joe_>	 paravoid: ack
[14:33:14] <bblack>	 _joe_: no, it's a hacky way to make mediawiki half-ass active/active at a huge complexity cost for MW itself and for consumers of MW
[14:33:29] <paravoid>	 right now the multiDC goal is shaky and may not be part of the technology mandate for next year
[14:33:30] <_joe_>	 bblack: and what is the better solution then?
[14:33:30] <bblack>	 as a partial step towards an eventual better plan to make MW actually-active/active for all traffic
[14:33:56] <_joe_>	 is that even a plan? I'm not sure
[14:34:06] <paravoid>	 if that's the case, then we should stop hoping that it will happen at some point and change our architecture
[14:34:10] <bblack>	 it should be
[14:34:22] <paravoid>	 what should be?
[14:34:28] <paravoid>	 sorry not sure who you're responding to :P
[14:34:36] <bblack>	 if we're doing multi-dc, it should be in the plan to get all services to active/active
[14:34:46] <mark>	 define 'multi DC'
[14:35:19] <bblack>	 who and in what sense?
[14:36:02] <paravoid>	 there's an annual plan proposal for -AIUI- MW multiDC
[14:36:14] <mark>	 multi dc can mean anything from "having a single backup dc to failover to" to "running >= 3 data centers with all services active everywhere
[14:36:32] <bblack>	 paravoid: are you referring to MW multi-dc when you say "shaky"?
[14:36:37] <paravoid>	 I was, yes
[14:36:41] <bblack>	 ok
[14:37:07] <paravoid>	 shaky in the sense that the MW teams don't seem to be making any promises about being able to resource it
[14:37:25] <_joe_>	 bblack: anyways, I'm looking at the puppet change at the moment, I have a couple of comments I think. Are you going to work on it, or should I take over? I'd prefer not, given I already have my hands full with other issues
[14:37:33] <bblack>	 maybe stepping back a bit, I can give some internal perspective of mine on "multi-dc" (in the general sense)
[14:37:40] <paravoid>	 so far -- as I said yesterday, I'd like to at least have clarity on whether it's happening or not by the end of the annual plan
[14:38:09] <bblack>	 the main goal of multi-dc is site-failure resiliency within reasonable time bounds
[14:38:37] <paravoid>	 careful there, you might end up owning a line on our annual plan :P
[14:38:37] <_joe_>	 one of which is upgrading confd, which will allow us to write less lame templates (still go text/template, but it has way more features)
[14:38:43] <bblack>	 it's hugely complex to be multi-dc, and any plan that isn't routinely tested is going to cause chaos when you need to exercise it
[14:39:22] <bblack>	 so as a part of basic site-failover, you need to at least be able to test basic site-failover routinely (once a quarter, no big deal)
[14:39:57] <bblack>	 and if you want to test that without taking service outages or data integrity risks due to async config deploy to N services and meta-services...
[14:40:33] <bblack>	 routinely testing a hard active/passive setup (where the services aren't multi-dc aware (active/active) and don't cope well without ops controlling the overlaps manually) becomes problematic
[14:40:46] <paravoid>	 heh
[14:41:11] <paravoid>	 I was saying this (much less eloquently) to mark last night :)
[14:41:15] <bblack>	 it seems to me that in terms of cost/benefit/risks
[14:41:39] <bblack>	 either you go all in and say "Our goal is for services to be active/active capable", at which point testing and failover becomes routine/easy and you have confidence in all related things
[14:41:53] <bblack>	 or you just accept that you have 1 primary DC and when it fails SHTF and recovery time will be long
[14:42:07] <bblack>	 the cost/benefit/risk/etc just doesn't work out trying to be in that middle ground
[14:42:33] <paravoid>	 that essentially abandonding the MW multiDC comes at a cost to us for routine failovers and incorporates risks
[14:42:39] <paravoid>	 abandoning*
[14:42:47] <paravoid>	 but yeah, as I said above:
[14:42:53] <paravoid>	 16:32 <@paravoid> broader comment: I think we should realign ourselves to whatever the outcome of the annual plan conversation is
[14:43:03] <_joe_>	 yup
[14:43:22] <bblack>	 another related point, though, is:
[14:44:11] <bblack>	 over the long term, our very complex blend of software/services/infra/meta-things will have to move DCs periodically (see: tampa->eqiad->codfw, possible future no-eqiad scenario on a couple fronts)
[14:44:43] <bblack>	 and those transitions, while simpler in general than fast failovers, also become much much easier to cope with in a multi-dc active/active world
[14:44:56] <bblack>	 so that's another side-benefit we get out of taking on the costs of heading in that direction
[14:45:24] <bblack>	 it seems to me that on the balance of things, it's the logical thing to aim for at our scale/complexity
[14:46:12] <bblack>	 even MW can't go active/active, if we can get everything else to do so and accept MW (for pragmatic reasons) as the lone exception that requires a brief RO/outage, that's not awful.
[14:47:01] <paravoid>	 I don't disagree fwiw
[14:47:05] <volans>	 +1
[14:47:36] <paravoid>	 bblack: let's discuss this specific goal in a little more detail tonight?
[14:47:46] <paravoid>	 (my night anyway :)
[14:47:53] <bblack>	 yeah
[14:47:57] <paravoid>	 and you should say all that :)
[14:48:12] <paravoid>	 much better said than I could ever say them
[14:48:27] <mark>	 so
[14:48:35] <mark>	 let's write up a realistic outcome for next fiscal year
[14:48:38] <mark>	 for annual planning
[14:48:51] <paravoid>	 in any case it doesn't make much sense for us to be working on this at full speed but having little or no support from MW or perf
[14:48:53] <mark>	 is that active/active-read or active/active
[14:49:10] <paravoid>	 if it gets dropped, it gets dropped and we should give it a similar priority too imho
[14:49:33] <mark>	 I can see two outcomes
[14:49:37] <paravoid>	 we focused on a quicker and more systematic switchover this quarter at the expense of other goals like k8s
[14:49:39] <mark>	 getting active/read-only active in place
[14:49:43] <_joe_>	 honestly, being sure that GETs do not cause writes in mediawiki would already be a huge boost for our ability to manage switchovers smoothly
[14:49:44] <mark>	 and reducing the time to do a switchover
[14:51:38] <bblack>	 it would be a huge boost to "hey let's implement the semantics of the core protocol of both the internet and our product right, so that other software integrating it doesn't make faulty assumptions and break things" front
[14:51:55] <bblack>	 writing GETs are just awful tech debt to begin with
[14:52:21] <paravoid>	 yeah and last I heard it will affect our TLS 1.3 deployment too
[14:52:39] <paravoid>	 (yes, I know it sounds completely unrelated :)
[14:52:49] <_joe_>	 paravoid: uh?
[14:52:52] <_joe_>	 how come?
[14:53:17] <paravoid>	 the 0-rtt part of TLS 1.3 opens up the possibility of replay attacks
[14:53:42] <bblack>	 yeah, it's similar to the tcp fastopen issue
[14:53:48] <_joe_>	 oh, ok
[14:54:21] <bblack>	 "GET is idempotent and safe" is a core assumption of the internet, and there's probably fallout we can't even predict both in-house and without
[14:54:21] <paravoid>	 "A possible solution might be a TLS stack API to let applications designate certain data as replay-safe, for example GET / HTTP/1.1 assuming that GET requests against a given resource are idempotent."
[14:54:26] <paravoid>	 for example
[14:54:33] <paravoid>	 this is from https://timtaubert.de/blog/2015/11/more-privacy-less-latency-improved-handshakes-in-tls-13/
[14:54:40] <paravoid>	 which isn't what I had previously read
[14:55:03] <bblack>	 varnish caches GETs for instance.  and more-importantly, varnish will replay failing GETs to alternate backends assuming that's safe.
[14:55:21] <paravoid>	 yeah also see https://blog.cloudflare.com/tls-1-3-overview-and-q-and-a/
[14:55:37] <paravoid>	 The solution is that servers must not execute operations that are not idempotent received in 0-RTT data. Instead in those cases they should force the client to perform a full 1-RTT handshake. That protects from replay since each ClientHello and ServerHello come with a Random value and connections have sequence numbers, so there's no way to replay recorded traffic verbatim.
[14:55:43] <paravoid>	 Thankfully, most times the first request a client sends is not a state-changing transaction, but something idempotent like a GET.
[14:56:31] <paravoid>	 i.e. state-changing GETs means that we open up the possibility of 0-RTT TLS 1.3 traffic being captured and replayed
[14:56:43] <paravoid>	 or that we maintain an external blacklist of state-changing GETs, which is ewwww
[14:57:08] <paravoid>	 so that part is tech debt as bblack said
[14:57:19] <_joe_>	 paravoid: I don't remember specifically which non-idempotent GETs we allow
[14:58:20] <bblack>	 the last I heard (months ago) nobody knew for sure, and they were digging to be sure they found them all
[14:58:33] <paravoid>	 action=purge used to request confirmation only for anons
[14:58:38] <paravoid>	 and just did it for logged-in users
[14:58:44] <bblack>	 there's the opposite problem too, where we have readonly POSTs as well, but that's just an efficiency problem really
[14:58:48] <paravoid>	 but I see that's not the case anymore?
[15:00:28] <bblack>	 I donno
[15:00:49] <bblack>	 _joe_: comments on the WIP dns change?
[15:01:04] <_joe_>	 bblack: incoming in a few
[15:01:05] <bblack>	 I still have it on the head of my local clone, I was going to merge up structural stuff today I think
[15:01:33] <_joe_>	 bblack: what structural stuff?
[15:01:45] <bblack>	 (there's no great way to test that part since it's so meta to all involved things - will just put authdns on cp1008 if it's not still there from before, disable puppet on other authdns, and see what sticks)
[15:02:27] <bblack>	 _joe_: structural stuff is: https://gerrit.wikimedia.org/r/#/c/340154/ + https://gerrit.wikimedia.org/r/#/c/340156
[15:02:40] <bblack>	 (changing the basic config include layout, no-op functionaly, but risky/annoying to deploy)
[15:03:15] <_joe_>	 bblack: I'll look at those, I was looking at https://gerrit.wikimedia.org/r/#/c/331789
[15:03:55] <bblack>	 they're a prereq to 331789
[15:04:17] <bblack>	 (the puppet side was part of it, I split it out to simplify testing of 331789 after the structure stuff is deployed)
[15:10:26] <bblack>	 it seems like, on a fresh re-read, 331789 is missing some things anyways
[15:10:35] <bblack>	 like, the confd config to make use of the template?
[15:12:08] <_joe_>	 bblack: yes, you need to use confd::file there
[15:12:16] <_joe_>	 I'm commenting already :)
[15:17:27] <ema>	 mailbox lag on cp1074 is > 1.2M, keeping an eye on it. We can definitely increase the alerting threshold though as there have been no 503s yet
[15:17:52] <bblack>	 I don't know if I ever observed a recovery from such a number (without restart)
[15:18:04] <bblack>	 it's still several days away right?
[15:18:14] <ema>	 yeah, restarted 2 days ago
[15:18:25] <ema>	 I think we did observe a recovery roughly at this point 
[15:18:30] <ema>	 (without restart)
[15:18:30] <bblack>	 of course some of it is load-pattern sensitive too
[15:18:39] <ema>	 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?from=now-10d&to=now&panelId=21&fullscreen
[15:18:46] <bblack>	 it can backlog under peak traffic plateau, then recover during lower load later in the day
[15:19:18] <bblack>	 one of the many design-change sorts of things that's still backlogged to look into
[15:19:40] <bblack>	 is sorting out our grace/keep TTLs and VCL behaviors, and then clamping down to 24h for the primary TTL even in the backends
[15:19:58] <bblack>	 which will probably have some impact on the lag issue
[15:20:26] <bblack>	 I think at present, we're actually worse-off on grace behavior than we were in varnish3, in the net of all of our changes to get through the version hurdle
[15:20:27] <volans>	 ema: [wishlist] in those graph it would be nice to have varnish restarts as vertical annotations ;)
[15:20:52] <bblack>	 we're setting beresp.grace as before on appserver-delivered objects into cache
[15:20:58] <bblack>	 but we're not setting req.grace at all
[15:21:20] <_joe_>	 volans: no shit
[15:21:22] <bblack>	 and we don't have the magic in vcl_hit to look at grace/keep either, like that blog post from varnish
[15:21:31] <ema>	 volans: yep. For now you can see the restarts by looking at other graphs (eg: cached objects) in the general view https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?from=now-7d&to=now
[15:23:16] <volans>	 yeah, of course
[15:23:59] <bblack>	 what we're missing (magic in vcl_hit) is roughly like the 3rd block of code in: https://info.varnish-software.com/blog/grace-varnish-4-stale-while-revalidate-semantics-varnish
[15:24:57] <bblack>	 the other other thing (which could be a follow-on to all the above, instead of solving it at the same time), is to align cache aging across layers via Surrogate-Control
[15:25:39] <bblack>	 so instead of capping object TTLs at "24h in this layer", we can cap them for the entire cache infra, and not have the "well, mostly 24h, except it could be up to 48h in ulsfo or 72h for a certain service via esams, in edge cases", etc
[15:28:03] <bblack>	 and I'm not even sure if 24h works out as a TTL cap anyways.  I think for text it probably will.  upload might have to be longer? it's hard to say without testing in the real world.
[15:28:18] <bblack>	 we've analyzed stats before and it's likely 24h is close to optimal, anyways, at least for text
[15:28:32] <bblack>	 (where optimal is: as small as we can set it without it damaging hitrate much)
[15:28:41] <_joe_>	 bblack: I commented on the change, I'm now off to other things :)
[15:28:48] <bblack>	 (or getting so small that it's a factor in quick outages / failovers /maintenance)
[15:31:03] <ema>	 re: grace, was there any specific reason not to port the relevant vlc to v4?  
[15:31:29] <bblack>	 ema: well, the way it works changed pretty dramatically
[15:31:36] <bblack>	 I don't think req.grace really exists anymore in practice
[15:34:07] <ema>	 right, we do set beresp.grace but don't act on it in vcl_hit
[15:34:46] <bblack>	 right, which makes it basically ineffective
[15:35:01] <bblack>	 which also means we currently never use stale-while-revalidate, which is kinda awful for short-lived objects that are hot :)
[15:37:01] <ema>	 1.5M lag on cp1074, cp1072's is also starting to grow
[15:37:34] <bblack>	 I think we want almost precisely what the blog examples does for vcl_hit, minus the textual output stuff (e.g. set req.http.grace = "normal(limited)";)
[15:37:43] <bblack>	 and s/10s/5m/ to match old behavior (5m grace as normal)
[15:39:35] <bblack>	 this probably will have a positive effect on the load.php spike thing and the ticket to force age:0
[15:39:41] <bblack>	 (will probably reduce the spiking)
[15:40:10] <bblack>	 that whole load.php thing is complex anyways
[15:40:29] <bblack>	 5m TTL isn't a great solution to begin with, we just don't have better ones readily available.
[15:40:56] <bblack>	 probably the Right answer is to attach an xkey to all load.php outputs and purge that xkey on resource deploy
[15:41:04] <bblack>	 (which doesn't happen every 5 minutes)
[15:41:32] <bblack>	 assuming there aren't too many thousands of unique load.php outputs
[15:46:10] <_joe_>	 there are
[15:46:12] <_joe_>	 :)
[15:46:34] <bblack>	 _joe_: why?
[15:46:50] <bblack>	 it's skins * langs * projects * ?
[15:47:41] <_joe_>	 it's a few thousand objects IIRC
[15:47:48] <bblack>	 ok
[15:47:51] <_joe_>	 sorry, tens of 
[15:47:54] <bblack>	 some of which might be rare
[15:47:58] <_joe_>	 yes
[15:48:08] <_joe_>	 also, it should be mostly well cached in apc
[15:48:14] <bblack>	 yet another time I really wish varnish had some kind of cache explorer
[15:49:00] <bblack>	 (select count(*) from malloc_cache where req.url ~ "^/load\.php")
[15:54:00] <ema>	 that would be cool
[15:57:01] <ema>	 https://docs.trafficserver.apache.org/en/latest/admin-guide/storage/index.en.html#inspecting-the-cache
[16:01:42] <ema>	 1.7M, new record
[16:12:58] <ema>	 cp1072 recovered on its own, instead
[16:17:25] <ema>	 no 503s but there's a bunch of 500s like this one: https://upload.wikimedia.org/wikipedia/commons/thumb/2/2c/Map_of_Virginia_highlighting_Arlington_County.svg/145px-Map_of_Virginia_highlighting_Arlington_County.svg.png
[16:17:37] <ema>	 Error creating thumbnail: /usr/bin/timeout: the monitored command dumped core /srv/mediawiki/php-1.29.0-wmf.13/includes/limit.sh: line 101: 4773 Segmentation fault /usr/bin/timeout $MW_WALL_CLOCK_LIMIT /bin/bash -c "$1" 3>&-
[16:26:32] <ema>	 mailbox lag recovered on cp1074 without restart too
[16:28:00] <godog>	 always joyful to compute on bytes provided by internet users
[16:28:27] <wikibugs_>	 10Domains, 10Traffic, 10Education-Program-Dashboard, 06Operations: Create short link for outreachdashboard.wmflabs.org - https://phabricator.wikimedia.org/T146332#2657740 (10Dzahn) I think dash.wmflabs.org would be no problem at all, but dash.wikimedia.org might be because it implies a production service a...
[16:46:51] <wikibugs_>	 10Wikimedia-Apache-configuration: Create 2030.wikimedia.org redirect to Meta portal - https://phabricator.wikimedia.org/T158981#3053534 (10VictorGrigas) Hi everyone, I have a video ready to go to promote this new URL, and will be going on paternity leave in (possibly under) a month. It would be nice if this coul...
[16:51:18] <mutante>	 and there is the next "numbers-only" domain they want. and since "15" set a precedent...
[16:53:23] <mutante>	 "We've come to consensus on this redirect and I'd prefer not to go through too many cycles of discussions."    
[17:05:56] <wikibugs_>	 10Domains, 10Traffic, 10Education-Program-Dashboard, 06Operations: Create short link for outreachdashboard.wmflabs.org - https://phabricator.wikimedia.org/T146332#3061658 (10Vojtech.dostal) dash.wmflabs.org is a bit awkward.   could the WMF education team register a completely different domain of first ord...
[18:33:46] <wikibugs_>	 10Domains, 10Traffic, 10Education-Program-Dashboard, 06Operations: Create short link for outreachdashboard.wmflabs.org - https://phabricator.wikimedia.org/T146332#3061963 (10Dzahn) What about the option of using dash.wikimedia.org (or outreach.wikimedia.org?) but also actually moving your service into prod...
[18:40:57] <wikibugs_>	 10Domains, 10Traffic, 10Education-Program-Dashboard, 06Operations: Create short link for outreachdashboard.wmflabs.org - https://phabricator.wikimedia.org/T146332#3061975 (10Ragesoss) @Dzahn I would love to move it production. I understand that it's a more complicated thing than many of those "microservice...
[19:48:39] <wikibugs_>	 10Domains, 10Traffic, 10Education-Program-Dashboard, 06Operations: Create short link for outreachdashboard.wmflabs.org - https://phabricator.wikimedia.org/T146332#3062165 (10Dzahn) @Ragesoss I am willing to help with the productionizing. One thing to start with would be a list of requirements (software pac...
[19:50:11] <wikibugs_>	 10Domains, 10Traffic, 10Education-Program-Dashboard, 06Operations: Create short link for outreachdashboard.wmflabs.org - https://phabricator.wikimedia.org/T146332#3062166 (10Dzahn) p:05Triage>03Normal
[20:52:26] <wikibugs_>	 10Domains, 10Traffic, 10Education-Program-Dashboard, 06Operations: Create short link for outreachdashboard.wmflabs.org - https://phabricator.wikimedia.org/T146332#3062482 (10Dzahn) @Ragesoss @dduvall   Oops, is this the same thing that is suggested for deletion here?  https://gerrit.wikimedia.org/r/#/c/340...
[20:55:30] <wikibugs_>	 10Domains, 10Traffic, 10Education-Program-Dashboard, 06Operations: Create short link for outreachdashboard.wmflabs.org - https://phabricator.wikimedia.org/T146332#3062487 (10Ragesoss) @Dzahn That was part of the groundwork I mentioned, but it isn't actively being worked on. I think the idea was to delete i...
[20:56:44] <wikibugs_>	 10Domains, 10Traffic, 10Education-Program-Dashboard, 06Operations: Create short link for outreachdashboard.wmflabs.org - https://phabricator.wikimedia.org/T146332#3062489 (10Ragesoss) @Dzahn: a quick glance, it's also out of date in terms of the requirements; among other changes, the project is on Ruby 2.3...
[21:35:03] <wikibugs_>	 10Domains, 10Traffic, 10Education-Program-Dashboard, 06Operations: Create short link for outreachdashboard.wmflabs.org - https://phabricator.wikimedia.org/T146332#3062640 (10Dzahn) @Ragesoss Ok, just seems like a waste of good work to remove it entirely and start from scratch. Maybe we can start by updatin...
[21:38:11] <wikibugs_>	 10Domains, 10Traffic, 10Education-Program-Dashboard, 06Operations: Create short link for outreachdashboard.wmflabs.org - https://phabricator.wikimedia.org/T146332#3062646 (10Ragesoss) @Dzahn sounds good. I'll start pulling together all the dependency updates I notice. I added the Gerrit one, and I guess th...
[22:38:52] <wikibugs_>	 10Domains, 10Traffic, 10Education-Program-Dashboard, 06Operations: Create short link for outreachdashboard.wmflabs.org - https://phabricator.wikimedia.org/T146332#3062964 (10Dzahn) @Ragesoss Nice. Yea, those tasks on that workboard column sound about right to me. I'll comment on T159274 for getting the Ger...