[00:59:30] 10Traffic, 10Operations: Implement machine-local forwarding DNS caches - https://phabricator.wikimedia.org/T171498#3468712 (10faidon) I think this is a good idea overall and that we should be doing that. A few points: - I'm worried a little bit that this will hide issues like the ones you mentioned under the c... [07:48:43] paravoid: hey :) [07:48:44] > The glibc resolver issues with multiple recursors/timeouts is something we can't get around from addressing I think [07:48:53] what do you mean with this? ^ [07:49:32] that we should anyways have multiple resolvers in resolv.conf (127.0.0.53 and the DC-local recursor for instance)? [08:48:52] 10Traffic, 10Commons, 10Operations, 10media-storage: 503 error for certain JPG thumbnail: "Backend fetch failed" - https://phabricator.wikimedia.org/T171421#3469152 (10fgiunchedi) @Aklapper _usually_ traffic since this indicates varnish failure to fetch and most likely a network or varnish problem. See als... [09:50:13] 10Traffic, 10Operations: Content purges are unreliable - https://phabricator.wikimedia.org/T133821#3469311 (10zhuyifei1999) [10:03:22] 10Traffic, 10Operations: Content purges are unreliable - https://phabricator.wikimedia.org/T133821#3469363 (10zhuyifei1999) [11:01:49] ema: yes [13:50:50] 10Traffic, 10Operations: Implement machine-local forwarding DNS caches - https://phabricator.wikimedia.org/T171498#3469995 (10BBlack) >>! In T171498#3468712, @faidon wrote: > - I'm worried a little bit that this will hide issues like the ones you mentioned under the carpet. The cases where services are latency... [14:26:32] the questions about both glibc and systemd-resolved options turn out to be really thorny the deeper you go [14:26:41] I've wiped out what I wrote like 4 times already :P [14:28:01] in the net, it doesn't seem like relying on systemd-resolved to do anything fancy is a good idea. It doesn't document its internal algorithms for handling upstream multiple NSes and timeouts/failovers, and even if it did we couldn't rely on that behavior changing [14:28:13] that and, I'm sure there are still bugs and misunderstandings lurking in its code [14:28:49] the nice thing about systemd-resolved is that it could've been our NSS plugin to avoid glibc's default DNS code, but given the above it's not a good solution [14:29:28] (it could still be a local cache in front of *something*, but then we still need that smart something beneath it, at which point that smart something can just be a cache too) [14:30:43] things get complicated by the fact that not all software will use the standard interfaces that hit NSS anyways. Some software will parse /etc/resolv.conf on its own and use its own DNS library. So ideally /etc/resolv.conf points at something singular and smart as well... [14:31:58] and really, any singular dns cache/daemon thing will at least occasionally be unavilable, so you always need a strategy to handle that. [14:32:46] the idea of an NSS plugin that does a better job than glibc (by spamming parallel requests at short timeouts) does pretty well at that, assuming you have redundant remote servers you don't restart together (e.g. hydrogen + chromium) [14:33:15] if we have anycast on all of those, it just makes it easier and better [14:33:41] the NSS module of choice could be configured with just the anycast IP, and we assume at least one anycast endpoint (hopefully a DC-local one) is in service and reachable at any time. [14:34:01] (then we just need faster retries for random loss or whatever) [14:34:24] I think systemd-resolved provides a legacy interface as well [14:34:29] on 127.0.0.53 or something [14:34:31] it does [14:34:35] (optionally) [14:34:57] yeah [14:34:59] but then we're still facing "how do you handle the restart of that single daemon)? [14:35:21] handle it gracefully, without causing perf issues with a slow timeout [14:36:00] even if we leave systemd-resolved out of this it's a hard problem, and systemd-resolved just introduces new unknown behaviors [14:36:36] nod [14:36:40] we could simulate what we'd want of systemd-resolved with a pdns_recursor on 127.0.0.1 and a resolv.conf/NSS-module that tried that first before falling back to the DC-local stuff [14:37:23] well systemd-resolved does socket activation too I think, so maybe it's handling restarts like that? [14:37:39] well I haven't checked, but seems plausible :) [14:37:49] assuming we had a parallel/fast NSS module to handle failure, and a pdns_recursor on 127.0.0.1, would we configure the NSS module with "127.0.0.1 ", and have it spam both? seems a waste [14:38:09] paravoid: even if it's using socket activation, that doesn't avoid the resolution downtime/failover when the daemon is restarting [14:38:22] it's at the very least going to be a latency spike [14:38:42] (if not loss due to UDP buffer overflow on the activating socket) [14:39:35] but the more-common interface (gethostbybame()) hits systemd-resolved via NSS module using synchronous calls. what happens to those under restarts? (and if we knew, could we be certain that some answer we liked won't change later?) [14:40:31] alternatively, if we're not using systemd-resolved's own NSS module, then there's little benefit to using it as a generic cache on 127.0.0.53 that glibc backends into. We could just use a better-tested one like pdns_recursor. [14:41:43] the only scenario I've come up with that I really like the direction of (but still has some holes to figure out) is having a local pdns_recursor listening on the anycast IP too. [14:42:09] so let's say we have anycast recdns (across 2x machines per DC for redundancy) at 10.0.0.53:53 [14:42:43] we can pretty much say that using that singular IP in resolv.conf is about as reliable as we're going to get, and maybe we need to tweak (via custom NSS) that we want retries to be faster and spammier to cover edge cases [14:43:13] all that's missing now is a machine-local cache that doesn't screw up the behavior of that scheme [14:43:28] so have the machine-local cache bind to 10.0.0.53:53 too and configure it on the loopback, so it can answer those when it's up [14:43:47] the hangup is having a way to ensure it's not configured on the loopback when the daemon's dead [14:44:05] but perhaps a systemd service dependency type of thing can handle that (one that adds and removes the loopback'd anycast IP) [14:44:49] and then the local pdns_recursor would obviously be configured with the explicit remote cache IPs, not the anycast that itself is listening to. [14:46:20] it's the most robust answer I have, but there's details to get right [14:47:12] so the work for that one would be something like: [14:47:55] 1) We still need to find/write some custom NSS that can do faster/spammier retries (optionally, can request from parallel servers and/or have more-advanced failover strategy in general, but not necessary for this particular solution) [14:48:16] 2) We anycast across the per-DC nameservers for a reliable single resolver IP [14:48:53] 3) We stick that IP in resolv.conf in case of $random_software parsing resolv.conf, and also configure it as the backend for the custom NSS above [14:49:53] 4) As an optional final step to improve performance and reduce load on the anycast servers, we can create a configuration of local pdns_recursor which can reliably bind and create the anycast on the loopback and backend to the real IPs of the real anycast servers (and reliably not block access to the remote servers when it fails...) [14:51:44] or we kill (4) and go the other route: [14:52:24] 10Traffic, 10Operations, 10ops-eqiad: Degraded RAID on cp1008 - https://phabricator.wikimedia.org/T171028#3470350 (10Cmjohnson) @ema is it okay to take this down..most of the time the server needs a re-install after swapping /dev/sda will this be okay? [14:53:00] 4) Create a local cache on 127.0.0.53 that backends to the anycast IP. Have the NSS module configured with 127.0.0.53->10.0.0.53 with a very fast failover (e.g. if localhost doesn't respond in 5ms, query remote) [14:53:43] but then we have no control over the behavior of software that parses resolv.conf for itself and has its own behaviors, and they wouldn't have a great way to handle the failover. whereas the previous solution handles that case by making the singular IP super-reliable. [14:54:29] 10Traffic, 10Operations, 10ops-eqiad: Degraded RAID on cp1008 - https://phabricator.wikimedia.org/T171028#3470368 (10ema) >>! In T171028#3470350, @Cmjohnson wrote: > @ema is it okay to take this down..most of the time the server needs a re-install after swapping /dev/sda will this be okay? @Cmjohnson: yes.... [15:00:14] 10Traffic, 10Android-app-feature-Compilations, 10Operations, 10Wikipedia-Android-App-Backlog, 10Reading-Infrastructure-Team-Backlog (Kanban): Determine where to host zim files for the Android app - https://phabricator.wikimedia.org/T170843#3470414 (10Fjalapeno) @fgiunchedi thanks for the info… I'm worki... [15:26:29] just to add to the picture, we could also have a second anycast IP, that is not announced locally and is the one used as a backend of the local pdns_recursor in (4) :) [15:27:56] good thought! :) [15:29:44] for that matter, I'm not sure if there's router-level issues that make it desirable to have redundant anycast IPs, either [15:30:36] (but I don't think so?) [15:32:02] so yeah we could make 10.0.0.53 the primary internal anycast for dns resolution, and also have 10.0.0.52 set up similarly. 10.0.0.53 is what goes in resolv.conf (and similar for other NSS modules), and what a local cache would bind to if it wants to intercept (and then forward to .52) [15:32:21] yeah, something like that [15:35:11] I wonder, with ip_nonlocal_bind sysctl set to allow binding sockets to arbitrary IPs that don't exist on interfaces [15:35:26] if that's enough for the local traffic to the cache and we don't need the IP actually configured on loopback [15:35:42] that would make the takeover much simpler [15:36:16] (at the cost of having ip_nonlocal_bind set on virtually everything, which is a bit of a risk since it can hide misconfigurations and let them "work" but fail) [15:36:30] I wish there was a sockopt for nonlocal bind. maybe there is? [15:38:32] there is I think [15:38:42] but why do you want to put the recursor in a separate IP again? [15:44:21] paravoid: what did you mean? I can parse that question multiple ways [15:46:08] IP_TRANSPARENT might be the per-socket equivalent of ip_nonlocal_bind, although the documentation talks explicitly about a separate use-case, but still... [15:46:34] why not have the recursor listen on 127.0.0.1:53? [15:47:50] if we put a host-local cache on 127.0.0.1:53, that cache will occasionally at least need quick restarts (if not fail completely). where/how do we configure how to fall back to the remote caches that doesn't have some negative tradeoff? [15:49:07] if we're using glibc's standard nss-dns->resolv.conf for gethostbyname(), it would have to be a timeout-based failover (slow) [15:49:39] if we get around it with an NSS replacement that hits the backup in parallel, in the common case we're pointlessly spamming the remote caches when the answers are almost always arriving first from the local cache anyways. [15:50:07] having the local cache take over the anycast IP is a way for it to step into the stream and back out again seamlessly [15:56:06] I'm not sure if adding/removing an IP and flushing route caches is going to be much faster [15:56:46] ? [15:57:29] adding/removing an IP and flushing route caches can be asynchronous from the flow of resolved queries, though [15:57:47] the glibc timeout->failover is within the flow of resolved queries, they all stall [15:58:41] on normal maintenance I guess the time doesn't count as long as the requests all goes through, either locally or remotely [15:58:47] with the anycast takeover, at some point in time when the route change operation is complete, the stream of query packets is redirected from a local socket to a remote one, or vice-versa [15:59:30] plus the net result (outside of the config of pdns_recursor) looks simpler from all other perspectives [16:00:00] all dns clients use this one IP. you put that in resolv.conf or similar and you're done. no failover strategies to consider, because failover strategies live behind this IP. [16:00:19] or are you talking on failure? [16:00:39] I suspect the case he's talking about is the pdns_recursor crashing [16:01:08] in which case I'd still think the process dying + socket teardown resulting in re-routing packets to remote should be a pretty slim window [16:01:28] yes, that's the case I'm talking about [16:01:41] you'd need to remove the IP from the loopback interface and possibly flush the route cache [16:01:58] if nonlocal_bind or IP_TRANSPARENT works, I don't think we have to touch the loopback [16:02:32] nonlocal_bind means that pdns_recursor can bind to an ip that isn't assigned to an interface yet, AIUI [16:02:50] yeah, the untested question is whether local traffic to that bound socket would route to it without an interface [16:02:56] no it wouldn't [16:03:14] hmmm ok [16:03:22] 99% sure it wouldn't [16:03:40] well let's assume it doesn't then for now, and test later [16:04:02] we still need a modified NSS with any scheme I can think of [16:04:34] given a resolv.conf (or other-nss.conf) with 127.0.0.1:53 + 10.0.0.53:53 configure in it somehow for local and remote caches, and control over the NSS code, what would your ideal behavior be? [16:05:02] (I was thinking a configurable short timeout on the first before fallback to the second one with aggressively-short re-sends) [16:13:39] it's tricky to think about it in generic terms... as in how you would make the new NSS module configurable for these kinds of behaviors in general [16:13:51] hardcoding just what we want today would be awful [16:16:43] nameservers=[127.0.0.1:53,timeout=5ms,resends=0; 10.0.0.53:53,timeout=70ms,resends=1; [192.0.2.1,192.0.2.2],timeout=35ms,resends=3;] [16:17:36] ^ send to first NS, if no answer in 5ms send to the second NS. if no answer in 70ms, send again to that second NS. If no answer in a further 70ms, send in parallel to both of the final IPs, with 35ms timeouts, up to 4x total sends to them, then cycle back to the top of the list [16:18:01] timeout=2s; # overall timeout at which point all processing aborts and returns a failure [16:18:01] (in meeting) [16:18:06] (me too :) [16:19:23] and of course in this hypothetical NSS's behavior, all outstanding sent packets can still get answers. we might be 3-4 sent packets into the process above and then the first answer to arrive is a response to the first one. So we keep those sockets open for late responses and take the first that arrives. [16:20:20] the above misses the option (at the outer scope and/or the pair at the end?) to do some kind of round-robin per-query instead of linearly failing through a list, too [16:20:35] but very little of that matters if we have working anycast, for our case [16:41:42] bblack: the nginx problem that you ran into before your vacation was caused by a few jessie hosts still running an nginx not linked against openssl 1.1 (only the builds > =1.11.4 do that). the last two remaining jessie nginx hosts (notebook hosts) were just upgraded, so your puppet patch should now work fine [16:42:28] 10Traffic, 10DBA, 10Operations: dbtree: make wasat a working backend and become active-active - https://phabricator.wikimedia.org/T163141#3187493 (10jcrespo) [16:42:48] moritzm: yeah, that and labs [16:43:09] moritzm: (and the puppetdb hosts were in the same boat too, which broke all of puppet. but I upgraded those on the spot because at that point it was kind of chicken and egg) [16:44:01] yeah, not sure what we can do about labs, all hosts should run unattended-upgrades, so in theory they should all be on 1.11.10, but there will certainly be some exceptions where people disabled unattended-upgrades [16:49:54] back on the ideal NSS module configurability, it probably needs some kind of recursive groupings [16:50:38] perhaps at the bottom-most level you can define NS sets (that use actual IPs), and give each one a timeout/resend param, and a "stategy" of linear, random, parallel, or chash. [16:51:25] and then you can build on top of that recursively upwards, sets of sets which just have a strategy. e.g. metaset1=[set1, set2],strategy=linear [16:51:41] 10Traffic, 10Android-app-feature-Compilations, 10Operations, 10Wikipedia-Android-App-Backlog, 10Reading-Infrastructure-Team-Backlog (Kanban): Determine where to host zim files for the Android app - https://phabricator.wikimedia.org/T170843#3470841 (10Fjalapeno) [16:51:42] metaset2=[metaset1,set3],strategy=random [16:52:30] etc [16:52:53] that would give you the full amount of flexibility to adapt this advanced nss-dns variant to whatever scenario and make it resilient and fast [16:54:12] I think the main counter-argument against this path (advanced NSS work, + separate localhost listener for local cache, as opposed to making "local cache listens on anycast" work somehow), is that it *only* works for clients that use the standard gethostbyname() type interfaces into glibc NSS [16:54:59] if some software, for better or worse, loads its own DNS client library and tries to parse NS IPs from /etc/resolv.conf or whatever, it will be lacking this sort of advanced feature stuff [16:57:00] if we have a strategy that can make a single IP address very resilient, it seems better for those clients. [16:57:24] I don't know how many of those clients we even have tbh [16:58:19] I guess you could write some kind of simple-but-strange forwarding-only daemon, which listens on 127.0.0.1:53 and answers by executing gethostbyname()-style calls. but then it would lack TTL information to hand to its clients. [16:59:03] that it's so hard to come up with an appropriate univeral answer speaks to how many design problems there are with the existing practices and interfaces in this space :P [16:59:35] which makes me sympathetic with whoever decided systemd-resolved was a good idea. but as with all things systemd, it only solves the problem from their perspective, not from everyone else's... [17:02:15] maybe the most-correct way to architect this solution (but maybe way outside the time we have to spend on it) would be more like: [17:03:16] nevermind, that idea failed before I got to the end of the first line [17:07:56] all I can really say for sure is what paravoid said earlier in the ticket: nothing gets us past "we need a better NSS module for hostnames than glibc provides" [17:08:43] if we lean on anycast, and could make "local cache takes over anycast IP" work sanely and reliably, that NSS module just needs to support a single IP and fast retries (basically just like existing glibc, but with floating point seconds for the timeout argument?) [17:09:20] if we want it to support a separate local cache and all that entails, we might as well build a fully-generic one that can handle all scenarios as an open source project that might get traction with others in similar scenarios [17:10:51] the generic NSS module I guess could have a direct local listener of its own, just for the illegitimate-resolve.conf-parser cases [17:11:07] (or as a separate little forwarding daemon with a shm interface to the NSS module's code) [17:12:25] libnss-dns-sucks [17:14:10] oh wait underscores. and also, it would be horrible if distros for desktops started trying to use it or something, should make it sound like it's only for datacenter-level operations [17:15:20] libnss_dns_dcops? libnss_dns_scary? libnss_dns_troublemaker? [17:15:41] something to discourage home users from configuring it commonly to spam public or ISP recursors with 5ms retries or whatever :P [17:18:09] libnss_dcdns "NSS Module for Advanced DNS Client Configurations in Datacenter Scenarios" [17:42:32] 10Traffic, 10netops, 10Operations, 10ops-eqiad: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3471226 (10Cmjohnson) The bios update that I have has failed to install....looking at another solution. [17:42:35] 10Wikimedia-Apache-configuration, 10Operations, 10Mobile, 10Puppet, 10Reading-Web-Backlog (Tracking): On mobile, http://wikipedia.org/wiki/Foo redirects to https://www.m.wikipedia.org/wiki/Foo which does not exist - https://phabricator.wikimedia.org/T154026#3471227 (10Jdlrobson) [17:46:32] 10Wikimedia-Apache-configuration, 10Operations, 10Mobile, 10Puppet, 10Reading-Web-Backlog (Tracking): On mobile, http://wikipedia.org/wiki/Foo redirects to https://www.m.wikipedia.org/wiki/Foo which does not exist - https://phabricator.wikimedia.org/T154026#3471281 (10Jdlrobson) [19:11:25] 10Traffic, 10Operations, 10ops-eqiad: Degraded RAID on cp1008 - https://phabricator.wikimedia.org/T171028#3471853 (10Cmjohnson) @ema can you verify the host name for me please. cp1008 was decom'd a long time ago. [19:12:36] 10Traffic, 10Operations, 10ops-eqiad: Degraded RAID on cp1008 - https://phabricator.wikimedia.org/T171028#3451439 (10BBlack) It was decommed a long time ago, and then I revived it as a quasi-production testing machine for "temporary" use for a little while, and probably poorly documented that, and now "tempo...