[00:48:19] 10Wikimedia-Apache-configuration, 13Patch-For-Review: Clean up redirects.conf/redirects.dat (remove en2.wikipedia.org, etc.) - https://phabricator.wikimedia.org/T105981#3116704 (10Dzahn) Dzahn - 03-16 17:30 Patch Set 1: Code-Review+1 yes, i removed most (all) of these myself from DNS in the past --- double... [00:49:03] 10Wikimedia-Apache-configuration, 13Patch-For-Review: Clean up redirects.conf/redirects.dat (remove en2.wikipedia.org, etc.) - https://phabricator.wikimedia.org/T105981#3116705 (10Dzahn) the next one in the dependency chain is: https://gerrit.wikimedia.org/r/#/c/322602/ which i +1ed and looks trivial enough [06:18:20] <_joe_> bblack: sorry, i wasn't around anymore [06:19:15] <_joe_> I am not sure what went on yesterday evening, but I'm ok working myself on the dns/conftool entries [06:21:33] <_joe_> volans, godog for deployment-prep testing of dns discovery, the main obstacle is - I would say - that we don't really manage things the same way in deployment-prep [06:21:51] <_joe_> but I'm happy to set up conftool to work with deployment-prep data if needed [06:23:07] <_joe_> actually, I'll do that today, then we can simply install a gdnsd instance managing some discovery domain [09:30:11] _joe_: if it isn't a time sink then yeah I think it'd be nice to replicate in deployment-prep as much as we can, if not we can use codfw too [09:30:42] <_joe_> well it is a time sink, a huge one, if we want to set up authdns there as well [09:30:52] <_joe_> and the whole thing to mean something to applications, too [09:30:57] I think that for general testing that will be great godog, _joe_, but for this specific test godog you just need that a record DNS changes and verify that the script pick the new one [09:31:34] I've looked at the script quickly and basically uses urllib2 (through eventlet) and it should use plain OS's resolution [09:31:57] <_joe_> script? [09:32:05] rewrite.py in puppet [09:32:11] is the one using the config that we change [09:32:15] with the commit [09:32:24] <_joe_> oh just that? [09:32:28] <_joe_> ok [09:32:32] godog: anything else? [09:32:58] <_joe_> volans: are we using an IP in that config now by any chance? [09:33:02] no [09:33:24] volans: not for swift afaict now [09:33:53] swift::proxy::rewrite_thumb_server: 'rendering.svc.eqiad.wmnet' [09:33:59] <_joe_> uhm, I am more and more convincing myself we need local dns resolvers with proper name caching [09:34:10] in hieradata/common/swift/proxy.yaml [10:57:58] 10Traffic, 06Operations, 10Pybal: Upgrade twisted on load balancers to 16.2.0 - https://phabricator.wikimedia.org/T160433#3117465 (10ema) I've been testing twisted 16.2.0 on pybal-test2001 for a while and the various monitoring protocols look good. I'm going to upgrade twisted on lvs1007-12 as a next step. [13:29:02] 10Domains, 10Traffic, 06Operations, 06WMF-Legal, 13Patch-For-Review: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3117737 (10Beetlebeard) >>! In T158638#3110288, @Dzahn wrote: > @Kaarel_Vaidla @Beetlebeard I can make the change but i wanted to check with... [13:56:06] _joe_: yeah, I don't think a proper full-stack test of the authdns stuff in deployment-prep is realistic at this point [13:56:49] because it has its own different authdns/recdns setup, mostly. Basically we've never been prepared to test regular authdns there. [13:57:46] _joe_: also, in case you missed it - the hostnames are there since sometime yesterday (using mock config to pass lint) [13:58:38] bblack@cp2001:~$ host restbase.discovery.wmnet [13:58:39] restbase.discovery.wmnet has address 10.2.1.17 [14:14:34] <_joe_> oh wow [14:14:57] <_joe_> i need to play with it now :P [14:18:08] also, beware with the mock-config-linting, a different linting problem is in play: you *need* to have hieradata/discovery.yaml updated and puppet-merged and agent run on authdns servers before deploying a matching authdns change that adds new records + mock entries. [14:18:17] (if you're adding another service) [14:18:41] if you try to push the DNS side first, it will pass authdns-lint (due to mock fakery), but then fail during actual update (for lack of the real config data from puppet) [14:20:59] <_joe_> I'll document that [14:21:32] bblack, naming: failoid or nulloid? :) [14:21:34] <_joe_> so now the TTL is 5 minutes I see [14:21:46] <_joe_> we should make it 10 seconds maybe? [14:21:54] should we? [14:22:15] <_joe_> I think we should, 5 minutes is way too much time IMHO [14:22:16] also, we have options for variability too [14:22:47] <_joe_> yeah you're right, of course, we could reduce TTLs before doing a switch, specifically for mediawiki [14:22:49] (e.g. we can make it drop or rise when one side of active/active goes out, etc) [14:22:56] or that [14:23:01] <_joe_> or better, active/passive services should have a shorter ttl maybe [14:23:20] <_joe_> because the switch time gives you a disservice [14:23:46] well it's always async, the question is how much time you're willing to live with [14:24:00] for a planned fail event, it kinda doesn't matter since we can plan ahead [14:24:05] <_joe_> yes [14:24:12] for a real event (what we're aiming for), things are different [14:24:59] but 10 seconds also seems extreme [14:25:13] during normal times, that causes a ton of needless network traffic [14:26:18] I guess the host-level caching is a factor here too [14:26:24] (whether we plan to get there as part of this) [14:26:45] <_joe_> bblack: AFAICT, most of our apps don't cache DNS queries at all [14:26:55] <_joe_> php surely doesn't [14:27:10] <_joe_> nodejs should now have a 5 seconds cache [14:27:17] let's try something less-extreme initially and then work downwards after we've seen it not be an issue? maybe 30s? [14:27:25] <_joe_> ack [14:27:33] <_joe_> we can start with 300 now [14:27:41] <_joe_> and just move one thing to discovery [14:27:54] "now" as in today or as in apr 19? [14:27:58] <_joe_> and see how much more dns traffic we get turning it down [14:28:03] <_joe_> today :) [14:28:03] ok [14:28:33] so I have Plans that will make that moot for recdns<->authdns, but probably not by then [14:29:04] we also still don't have the new pdns_recursor + edns-client-subnet tested/deployed [14:29:34] so there's some small but non-zero probability, especially during network misbehaviors, of getting the "wrong" answer for an active/active query (going to other DC pointlessly) [14:29:56] <_joe_> so I will switch parsoid to discovery [14:30:09] <_joe_> yeah I know, that's why I just wanted to move parsoid for now [14:31:14] if you want to do some initial testing on seenig results of conftool behaving as expected, probably best to test against authdns directly [14:31:28] (I tested those manually a while back, but not a bad idea to confirm) [14:31:40] e.g. a/a vs a/p and how they handle 1-depool or both-depool, etc [14:32:08] dig @ns0.wikimedia.org foo.discovery.wmnet (from a host in each DC) can confirm that behavior without worrying about the cache layer interfering [14:33:02] the "both up but active/passive" check is at the confd level, so it just fails to modify the running statefile when the temp output fails that check [14:34:13] <_joe_> bblack: already done those tests [14:34:36] <_joe_> sorry I had to step away from the keyboard a second, family "emergency" [14:35:49] ok cool, no obvious issues? [14:39:44] <_joe_> not that I've seen, but I am still toying with it [14:41:55] <_joe_> Mar 21 14:41:33 radon confd-lint-wrap[3626]: updating error mtime on /var/run/confd-template/.discovery-appservers-rw.state714631629.err [14:42:04] <_joe_> yeah it works as expected [14:42:11] <_joe_> I want to wait for icinga to notice [14:42:15] <_joe_> and alert us [14:42:47] <_joe_> volans: please consider the TLS case [14:42:49] <_joe_> for nulloid [14:43:09] <_joe_> we do have services contacting the mw api via https at the moment [14:43:31] _joe_: we said no listening port yesterday, don't we? [14:43:40] <_joe_> volans: oh ok :) [14:43:57] then we can explicit that with iptables or not :D [14:49:38] volans: I don't really care on the named, I think it's called fail on the DNS config side, but it's easy to switch [14:49:54] <_joe_> I found an issue in our confd monitoring [14:49:55] <_joe_> meh [14:50:02] yeah I'm going with failoid :) [14:53:53] resource wise what is our minimum? in eqiad the ganeti cluster is pretty full [14:57:54] whatever the minimum is I guess [14:58:09] lol [14:58:10] really there should be no applayer load other than running some baseline puppet agent config updates [14:58:24] the kernel's just going to be handing out connection refused packets [14:58:34] I was wondering if 1vcpu and 512MB of RAM is enough [14:58:40] it should I guess [14:58:45] yeah should be plenty [14:59:00] worst case, failoid actually fails and they see timeouts instead of conn-refused [14:59:01] <_joe_> 640 KB should be enough for everyone [14:59:12] (which probably means the applayer stuff is trying too hard with aggressive retries) [14:59:22] eheh [15:06:40] http://dnsdist.org/ is pretty cool [15:11:24] bblack: anything has changed to deploy normal dns changes? I would need to merge https://gerrit.wikimedia.org/r/#/c/343877 to get create the ganeti instances [15:14:20] volans: nothing has changed, so long as you're not editing actual discovery entries :) [15:14:43] great [15:16:11] _joe_: re TTLs, we could make the confd controlled too, if there's a logical easy way to add more metadata there (kinda like "weight" in the LVS/pybal case) [15:16:33] <_joe_> bblack: uhm, yeah we can extend the schema [15:17:00] the thing is the TTL is per-record, not per-DC. But the inputs are per-DC, so I guess we map it that way. [15:17:08] that probably made no sense [15:17:38] <_joe_> no it did make sense [15:18:02] so, in the static configured stuff in zonefiles, we can set a max/min, like "parsoid 300/10 IN DYNA geoip!disc-parsoid", which means the TTL to the user never falls outside of the range 10-300 [15:18:41] and (ignoring the 10,000 other ways to do it that aren't currently relevant), then in the statefiles confd is writing, instead of marking each DC as UP or DOWN [15:18:57] we can say "UP/240" or "DOWN/33" or whatever value we pull from confd [15:19:40] and gdnsd will do some logically combining of those which is really only relevant for true monitoring, not this immediate statefile, so we'd probably want to just keep it the same on all DCs, I think [15:21:21] I guess what I'm saying is that from our logical perspective, what we're actually doing, the TTL would be a value we set in confd per-service [15:21:40] in the templating, it will get emitted per-service-per-dc, but we can paper over that in the template (output it twice) [15:23:02] but if the schema stuff is easier to make it per-dc, we can do that too and just keep them in sync when we edit them [15:23:57] <_joe_> yeah I'd suggest the latter [15:24:26] I think what will happen if they fall out of sync, is gdnsd will take the lesser of the two TTLs regardless of which is up or down [15:24:29] <_joe_> I also have to test if such a schema change can be done on the fly or it needs some hand-mangling from me [15:24:36] but thinking through that for all scenarios is challenging heh [15:24:43] <_joe_> yeah [15:25:38] and failoid itself is a factor for active/passive. there's not currently a confd control for failoid's TTL [15:26:01] but we can maybe fix it low for that case, and I don't think it comes into play in that case until we switch to it temporarily [15:27:47] (yeah just checked the docs, it would be easy to statically-configured "failoid" TTL input) [15:34:37] <_joe_> bblack: so I am perplexed by what follows [15:34:53] <_joe_> oh no, nevermind [15:36:39] I guess what I meant is: in the case that active/passive is etcd-set to down+down and we switch to the failoid IP, it has its own TTL input to the final decision [15:36:47] (which only comes into play during down/down) [15:37:20] but it can't set it higher than the etcd TTLs for the service, only lower [15:37:50] the basic way the TTL logic works can be described sort of like this: [15:38:15] <_joe_> bblack: would you object to me merging https://gerrit.wikimedia.org/r/#/c/340993/ ? [15:39:00] I don't have any objection, if you're comfortable testing this live, with a 300 TTL :) [15:39:16] <_joe_> I am :P [15:39:46] <_joe_> actually, the failoid TTL should be shorter than the TTL of "normal" entries, yes [15:39:59] we really do need a better/more-ops-level equivalent of deployment-prep that replicates the whole infra stack :/ [15:40:16] oh back to what I was saying [15:40:19] <_joe_> when we get to the point where apps use -rw and -ro to contact mediawiki [15:40:27] the basic way the TTL logic works can be described sort of like this: [15:40:30] <_joe_> for now we will avoid using failoid during the switch [15:40:41] <_joe_> as we can't really separate reads and writes [15:41:21] gdnsd has this ideally-ordered list (based on e.g. geoip) of resources it can answer a query for a hostname with. like for appservers-rw, it has [eqiad-IP, codfw-IP], and if those are both dead it moves on (via metafo) to [failoid-IP] [15:41:40] each of those IPs has a TTL input from monitoring (which in our case, monitoring is really just config/statefiles) [15:42:06] whichever entries it has to traverse before it finds an "UP" answer to give, the minimum of that set is the output TTL [15:42:13] <_joe_> so this time what we will do is set codfw to active too, and that won't be updated. once we have actually switched over mediawiki, we will set eqiad down [15:42:36] <_joe_> sorry, bbiab [15:42:55] so the TTL is basically the minimum of: the source-TTL of the answer it actually gave, and also every down/failed thing that would have been preferable but was skipped due to being down [16:34:23] 10Traffic, 06Operations: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569#2696835 (10Gilles) On the tech-mgmt meeting you mentioned this was underway, is there another phab task for it? [16:46:16] 10Traffic, 06Operations: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569#3118280 (10BBlack) This is it. We're currently still testing/deploying the kernel that allows it to be enabled. After that we can do some testing/evaluation on BBR itself and report here.... [17:00:19] 10Traffic, 06Operations, 06Performance-Team: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569#3118331 (10Gilles) [18:35:43] 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 2 others: LDF endpoint ordering is not stable between servers when paging - https://phabricator.wikimedia.org/T159574#3118713 (10Gehel) At this point, the only workable option is the "single LDF server" (appart from abandoning LDF completely). So let's... [19:59:38] ema: on the topic of grace/keep - longer grace times = more chance to do a stall-free answer near expiry boundary (immediate cache answer + background refresh) [20:00:02] but in the healthy case, we keep it minimal (5m), because we don't want to violate the TTL contract by too much [20:00:12] what about effectively moving grace inside of the TTL instead of outside of it [20:00:41] ignoring other little complexities for a second, the idea would be: [20:01:04] 1) Object comes in with a natural (or capped, whatever) TTL of 86400 (1d), which is what we're calling the real/contract TTL [20:02:26] 2) We set grace at, say, 10% of object life within certain reasonable bounds. So for this object it's 8640s (2.4h). Maybe we set a lower bound at something like 5m regardless of TTL (and deal with that corner case below as well). [20:02:51] 3) Then we subtract the calculated grace from the TTL before setting the object grace. [20:03:21] 4) So it ends up as obj.ttl = 77760, obj.grace = 8640 [20:04:35] what we're saying is instead of "violate the TTL by up to 5 minutes in the interest of stale-while-revalidate optimization", it's now "Never violate the TTL, but try to take the opportunity to revalidate asynch during the final 10% of the object's life if a hit comes through during that time" [20:05:04] it gives us a broader opportunity to get the async refresh, without causing a broader breach of the contracted TTL [20:07:55] and whereas before we set obj.grace=60m, and then req.grace varied on backend health (5m/60m), we can set this broad-but-safe grace value on the object itself, and use the maximum grace (ttl_cap/10, 8640) in req.grace. [20:08:41] and we don't really care about backend health for grace reasons. We're generally setting a larger/healthier grace all the time, and there is no tradeoff to consider about when we violate a contract in the interest of uptime. [22:57:38] 10Traffic, 06Operations: Define 3-host infra cluster for traffic pops - https://phabricator.wikimedia.org/T96852#3120141 (10BBlack) [22:58:14] 10Traffic, 06Operations: Define 3-host infra cluster for traffic pops - https://phabricator.wikimedia.org/T96852#1227571 (10BBlack)