[01:40:19] 10Domains, 10Traffic, 10Operations, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10Dzahn) Thanks for additional comments. The old ticket T84200 (currently private because imported from old ticket system (RT) and contained personal email from... [01:41:17] 10Domains, 10Traffic, 10Operations, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10Dzahn) p:05Triage>03Normal [11:22:49] 10Traffic, 10Operations: puppetize http purging for ATS backends - https://phabricator.wikimedia.org/T204208 (10ema) [11:22:59] 10Traffic, 10Operations: puppetize http purging for ATS backends - https://phabricator.wikimedia.org/T204208 (10ema) p:05Triage>03Normal [11:33:13] 10Traffic, 10Operations: Define and deploy Icinga checks for ATS backends - https://phabricator.wikimedia.org/T204209 (10ema) [11:33:28] 10Traffic, 10Operations: Define and deploy Icinga checks for ATS backends - https://phabricator.wikimedia.org/T204209 (10ema) p:05Triage>03Normal [11:35:08] bblack, I realised yesterday that I don't think we have any mechanism for checking when the config for a cert changes [11:35:18] i.e., say you add an extra SAN to an existing cert [11:35:39] I don't think it will notice that and reissue [11:36:41] on the other hand I wonder if it's easier to just make a new certificate entry in the config in that case and move clients over [12:00:29] 10Traffic, 10MediaWiki-ResourceLoader, 10Operations, 10Performance-Team: Investigate source of 404 Not Found responses from load.php - https://phabricator.wikimedia.org/T202479 (10ema) >>! In T202479#4575922, @Krinkle wrote: > 2. Hostnames we route to text-lb that Varnish doesn't recognise (receives varnis... [12:15:17] 10Traffic, 10Operations: ATS: log inspection at runtime - https://phabricator.wikimedia.org/T204225 (10ema) [12:15:24] 10Traffic, 10Operations: ATS: log inspection at runtime - https://phabricator.wikimedia.org/T204225 (10ema) p:05Triage>03Normal [12:40:58] 10Traffic, 10Operations: Package and deploy ATS v8.x - https://phabricator.wikimedia.org/T204232 (10ema) [12:41:24] 10Traffic, 10Operations: Package and deploy ATS v8.x - https://phabricator.wikimedia.org/T204232 (10ema) p:05Triage>03Normal [12:41:49] 10Traffic, 10Operations: Package and deploy ATS v8.x - https://phabricator.wikimedia.org/T204232 (10ema) [13:28:16] Krenair: I think valentin left some note somewhere, about periodically restarting or hupping for now or something [13:28:46] ah yes, in an email: [13:28:48] "The current code only checks the certificates status when the configuration is loaded. Right now this happens when certcentral is started and every time a SIGHUP is received. So a easy-hack could be setting up a small cronjob sending a kill -SIGHUP to the certcentral process every X hours to be sure that the certificates are being check and renewed when needed." [13:29:12] I guess that was more about renewals though [13:29:45] I think eventually we'll want the runtime to monitor both such things, but it can be added later [13:30:03] (FS watchers for config changes, and running long-scale timers ticking down to expiry-related actions) [13:30:52] yeah that's a different thing though [13:31:07] that's talking about the checks for expiry/renewals [13:31:27] I don't think we have checks for 'oh, we have a cert with this ID but it has a different set of SANs/CN to the configured one' [13:32:49] what's the ID? just an invented label? [13:33:10] 10Traffic, 10Operations, 10fixcopyright.wikimedia.org, 10MW-1.32-release-notes (WMF-deploy-2018-09-04 (1.32.0-wmf.20)), 10Patch-For-Review: Sort out HTTP caching issues for fixcopyright wiki - https://phabricator.wikimedia.org/T203179 (10CCicalese_WMF) 05Open>03Resolved a:03CCicalese_WMF I'm going... [13:34:20] bblack, e [13:34:23] oops [13:34:24] bblack, yeah [13:34:58] e.g. in https://gerrit.wikimedia.org/r/#/c/operations/software/certcentral/+/458554/8/debian/config.example.yaml [13:35:00] 'testing' [13:35:59] if you change the SAN/CN I don't think it will notice [13:36:23] yeah that should probably be fixed [13:36:49] can compare the configured list to the list in the cert, (sort both since order doesn't matter) [13:37:23] maybe the main CN matters though, in cases where the main CN is chosen for convenience of legacy non-SNI clients though [13:37:29] err, non-SAN clients [13:37:58] so CN==CN' and sorted(san-list)==sort(san-list') [13:38:37] ema: i see you were on 1099 and did some gdnsd things, did anything break or you just left it down? [13:39:35] bblack: hey! [13:39:54] bblack: yes, I've seen that dpkg reported broken packages in icinga, so I've run dpkg --configure [13:40:27] oh? [13:40:53] I don't remember apt saying anything was wrong at install-time [13:40:57] maybe I missed it [13:40:59] that seems to have broken things, so I've stopped gdnsd for you to look at the thing [13:41:11] what broke? [13:42:08] yeah, I've said "thing" twice in a sentence, not very precise! :) [13:43:25] dpkg --configure caused a service restart, see logs at 07:05:29 [13:44:17] then I've noticed that the last log entry from yesterday was: [13:44:22] Sep 12 18:00:12 cp1099 gdnsd[38167]: Server at pid 36748 exited after stop command [13:44:41] so I've stopped the daemon and *that* is what failed (exit code=42) [13:44:54] gdnsd on a cp* host? [13:45:16] Krenair: it's cp1099, the new test host (cp1008 replacement) [13:45:29] non-varnish stuff gets tested there? [13:46:04] yes, see ::role::authdns::testns [13:46:07] huh, ok [13:46:59] ema: ok [13:48:53] it's just more convient than dealing with designate :P [14:50:32] (FS watchers for config changes, and running long-scale timers ticking down to expiry-related actions) [14:50:49] Well puppet will control the config and can notify on change [14:51:24] do we need to do inotify stuff if we've got that? [14:57:15] I guess not [15:47:13] that recursor list is huge :) [15:47:27] but the CSV version includes their reliability percentage and stuff too [15:47:33] I might filter down and just try the most-reliable ones first, etc [15:47:55] 20,512 "valid" ones all total [15:47:59] I've got a basic patch to hopefully probably handle changes to existing certs, atop my great big stack of commits: https://gerrit.wikimedia.org/r/#/c/operations/software/certcentral/+/460382/ [15:50:57] sorry I'm bogged down a bit, but it's on my list to go through them eventually! [15:51:01] ok :) [15:51:09] I should remember to puppetise a cron to periodically kill -SIGHUP it, and later make it regularly check these things automatically without external prompting [15:51:38] right now I'm kinda of assuming the odds of the certcentral deployment itself being ok by EOQ are better than that of successfully upgrading our prod dns to gdnsd-3.x by EOQ, so I'm more focused on that end of it. [15:52:12] 10Traffic, 10Operations, 10Patch-For-Review: Make cp1099 the new pinkunicorn - https://phabricator.wikimedia.org/T202966 (10ema) [15:52:14] yes [16:34:36] yeah I have a script running now on the CNAME check thing, using the ~11K list entries that claim to be reliable and last-updated during September [16:35:53] it's made it through ~750 of them so far, with ~31 just failing to respond for some reason (will look at those deeper later, but probably not issues), and ~10 or so that failed my basic scripted dig-check so far, but all of those have been explicable when checked manually [16:36:18] (one was simply not a recursive server at al, and the others so far have all been servers that fail when queried with edns0 which dig does by default, but worked with dig +noedns) [16:36:33] so, probably ok, but I'll see what all the corner-cases look like at the end of the run [16:41:36] bblack: I have a question from yesterday [16:41:55] why double fork ? [16:42:23] I am missing some systemd internals there maybe :) [16:43:13] jijiki: so the new daemon (or well, someone, one of the two) eventually has to send a notification to systemd to inform it that the daemon's main pid has switched (via the systemd notify socket, with MainPID=NNNNNN) [16:44:07] if we wait until after the old daemon has exited before the new one sends this information, there's a gap in the middle where systemd goes "oh shit mainpid died, call the whole thing dead" [16:44:52] but if you send the new mainpid *before* the old one dies, and you've only forked once, the new mainpid's parent is the old process, not systemd's PID=1, so it calls it an "Alien MainPID", because it's not a child of systemd directly, and also fails. [16:45:09] but when you double-fork, the second fork reparents the new daemon to PPID=1, and then the mainpid switch actually works [16:46:52] I see [16:47:20] so you are doing a sort of blue-green ? [16:47:48] for the handoff between the two daemons you mean? [16:47:57] yes [16:48:18] technicaly we a starting a new one before we kill the old one, yes? [16:48:25] yes [16:48:36] the critical sequence there from the daemons' pov, ignoring systemd is: [16:49:06] the new daemon starts up in parallel and fully finishes loading data/config/etc. It requests copies of all the DNS listening sockets from the old daemon (over a control socket connection) and validates all that stuff [16:49:43] then it spins up its own DNS listening threads on them all, at which point there's no gaurantee which daemon answers a request (requests could be randomly routed to either of the two daemons). [16:49:59] but as soon as it does that, it tells the old daemon to stop itself, which it does fairly quickly, minimizing the overlap window [16:50:19] that's what I was going to ask next, who's answering ? :p [16:50:32] and since the sockets were copied over via SCM_RIGHTS (as opposed to e.g. starting brand-new ones with SO_REUSEPORT for sharing), at the kernel level they shared kernel-level recv buffers and such too. [16:51:09] if you opened new sockets with SO_REUSEPORT and then closed the old ones, there'd be some loss of whatever UDP reqs (or TCP SYNs) were stacked up in a kernel buffer somewhere at the time, but with SCM_RIGHTS that level is seamless. [16:52:05] there's a design choice here you could easily go either way on (no actual constraints): you could shoot down the old daemon first before firing up the new listeners, and avoid overlap at the cost of a tiny window of unavailability [16:52:29] or do it the other way around (what I've chosen), where there's no unavailability, but a short window where either daemon could answer live requests. [16:54:06] * jijiki is trying to parse [16:54:16] thank you :) [16:54:29] if they give the same answers it's immaterial. If they don't, there's reason to pause, but since DNS changes are always effectively asynchronous (e.g. whatever code, config, zone data you're deploying as part of this smooth-replacement also has to go to all N of your separate live servers that are geographically-diverse), you can't get an atomic switch of your results from the external POV anywa [16:54:34] ys [16:54:40] so the daemon overlap is just another minor case of the same thing, seems like the right tradeoff [16:55:02] I'm bad at writing parse-able output sometimes, followups welcome :) [16:55:13] lol [16:56:22] that point about "no atomic dns changes" is a lesson that's hard to learn, we've failed at it here some years ago at least once that I remember. [16:56:57] in practice it means if some DNS result you care about involves 2x queries from cache->authserver, you should be careful to deploy the change in separate steps because it won't be atomic across all your nameservers. [16:57:09] an example is probably easiest: [16:57:35] say your existing data is: [16:58:02] in wikimedia.org zone: foo.wikimedia.org CNAME foo.wiktionary.org [16:58:16] and in wiktionary.org zone: foo.wiktionary.org A 192.0.2.1 [16:58:50] and you commit a singular DNS change to these 2x zone files, which does: [16:59:03] foo.wikimedia.org CNAME bar.wiktionary.org; (replaces previous CNAME) [16:59:10] delete foo.wiktionary.org [16:59:19] add bar.wiktionary.org A 192.0.2.2 [16:59:42] and you push this single change out to all your nameservers, each of which applies it as an atomic whole change (no queries with half-updated data) [17:00:17] since caches need to make two queries to do a cross-zone CNAME, and they may choose random members of your authdns server set for each query, and you can't update them in perfect atomic sync with each other.... [17:00:51] some caches may see the very brief view "foo.wikimedia.org CNAME foo.wiktionary.org + foo.wiktionary.org NXDOMAIN" (oops) [17:01:21] or: "foo.wikimedia.org CNAME bar.wiktionary.org" from an updated one + "bar.wiktionary.org NXDOMAIN" (from a not-updated one) [17:02:11] so any time you have a single logical DNS change that involves multiple layers like this, you have to step through globally deploying it in stages that keep everything sane, with an applicable TTL-time between to boot (in case of old cached records too) [17:02:42] e.g. add the new entry for bar.wiktionary.org, wait 1x negative TTL window, then switch the CNAME, wait out the CNAME TTLs, then delete foo.wiktionary.org. [17:03:54] if I got everything right [17:04:16] if I were to hit th esame server [17:04:35] CNAMEs and sub-domain delegations are the only real case that causes multiple queries like that though. You don't have to think about this stuff except in those cases. [17:04:47] I could get a foo.wikimedia.org CNAME bar.wiktionary.org from the new daemon [17:05:05] and a bar.wiktionary.org NXDOMAIN from the one about to die? [17:05:20] yes :) [17:06:02] and the point is, if someone raises that as a reason not to overlap requests on daemon handoff, the logical response is "Well, you can't update all your separate nameservers atomically anyways, so you already had that problem even if gdnsd's local switch was atomic" [17:07:05] (and yes, caches do commonly rotate multiple authservers when doing a pair of quick related queries like these) [17:07:59] so we are playing with splits of the second here [17:09:07] for the daemon overlap, it should commonly be sub-second overlap. [17:09:57] for N geographic authservers, even if you tried to make them atomic via some crazy dns-2-phase-commit thing, you'd still have to coordinate over the latency between them all. [17:10:34] so there's always a bigger problem than the daemon overlap, and the answer is "only deploy public-facing DNS changes that make sense asynchronously, in steps" [17:11:02] I have a different question though [17:11:07] (and separated by the TTLs because caches could already have part of the answer (including negative answer) cached, and cache lifetimes are usually bigger than this) [17:11:25] if we had stop start the traditional way [17:11:30] the whole "beat the race with 2-phase commit" sort of strategy would only work if there were zero or near-zero TTL values anyways. [17:12:05] can we make the assumption that [17:12:31] if someone tries to talk to our dns while it is being restarted [17:12:39] will get no answer [17:12:49] that by the time they switch to tcp and retry [17:12:53] the service would be back up ? [17:13:54] that's maybe more or less how it would work out, from the pov of a single server. Very close to it anyways. In practice "start" takes ~2-3 seconds for our servers while they load and process geoip maps and zonefiles, etc. [17:14:17] yeah I get that it is different when at scale [17:14:23] but they would eventually get a retry, or more likely they'd try a different authserver IP and get an answer there [17:15:25] you could, in theory, from the immediate-results point of view, solve the whole problem of deploying that multi-layered change, by stopping all your dns servers globally, then starting them all with the new data, and take a ~3-4 second outage of DNS responses. [17:15:31] but you'd still get screwed by TTLs [17:16:03] some cache would've already cached the NXDOMAIN for bar.wiktionary.org before you start your process, for even say 300 seconds (if that's your negative TTL), and then hit the new config after your 3-4 second outage and get bad results. [17:16:21] oh [17:16:53] all this other stuff about tighter races only matters for empty remote caches and/or very very short TTLs [17:17:04] I see I see [17:17:24] at the end of the day, in practice TTLs make all of this not worth pursuing. the golden rule is when you make a multi-layered change, you have to break it into smaller changes and space out the changes according to TTL clocks. [17:17:24] :D [17:18:10] and if you've spaced out your public-facing changes to not break within the TTL windows, then the sub-second overlap of differing-but-compatible results from a daemon are immaterial. [17:18:20] (ditto if it takes several seconds to get all your remote authservers in sync) [17:19:06] but it's a hard thing to grasp until you've been bitten by it and debugged it. So I'm sure I'll get a bug report about how gdnsd's overlapped restart supposedly caused an outage for someone due to mixed results. [17:20:00] go ahead and document it so that if anyone opens that bug you can just link to the docs ;) [17:20:18] if it is a known bug it is practically not a bug [17:20:40] I would not call it a bug at all [17:20:42] true! it's just hard to distill the above into something clear and concise [17:20:54] volans: it is not a bug, but I would be filed as one :p [17:21:01] bblack: use the IRC logs, copy/paste and you're mostly done ;) [17:21:24] same for the blog post :-P [17:21:25] * volans hides [17:21:30] * jijiki curates the logs for publishing [17:21:36] lol [17:21:55] I think those are 2-3 20' talks at least [17:22:48] it's kind of an esoteric area though. I feel like most of the opsen worlds' answer to these topics is "Why would I care about any of this? My cloud provider handles my DNS" [17:23:28] but yeah, should maybe FAQ some of this out at least in gdnsd docs [17:24:06] it could generate some fan mail for sure [17:24:08] yeah and when their infra is down because $cloud's DNS has an outage they can just wait [17:24:20] you definately need to write what "replace" does [17:24:38] along with what we were just saying [17:27:23] ther other way to look at this topic from a pragmatic point of view, though: [17:27:52] these kinds of data races only happen on linked layers of records, which is CNAMEs, delegations, and I guess technically e.g. MX->A lookups. [17:28:16] delegations tend to be stable, and MX->A should be handled with appropriate TTL-mitigated sets of changes [17:28:30] but CNAMEs should just be avoided anyways in local data [17:28:44] they're the worst kind of corner case in 10 different ways, they're one of the worst features of the DNS [17:29:26] lol [17:29:29] replacing CNAMEs that point around in your local data with either redundant copies of IP addresses or a template system to get rid of the redundancy is always better. Save CNAME for when you're actually referring to something external to your authority. [17:31:27] djb's recommandation on that topic: "I recommend that all CNAME records be eliminated. DNS should have been designed without aliases." [17:31:39] from the wonderful, if dated: http://cr.yp.to/djbdns/notes.html [17:32:17] I was wondering when will this conversation will get to djb's dns stuff [17:32:18] ehehehe [17:32:59] I was once a happy djb tinydns user, long ago in a far away galaxy [17:33:11] * jijiki was a qmail admin [17:35:25] the funniest thing I have seen about cnames is amazon's LBs [17:35:40] where if one wants their foobar.com to point to an amazon LB [17:35:45] well, not possible [17:35:48] yeah :) [17:35:56] unless one uses route53 [17:35:59] they don't want to give away stable IPs, because they'll be stuck with them forever [17:36:30] when I was an AWS user, we never used amazon's LB and that was one of the reasons [17:36:54] (the other being that we had other custom protocols to support aside from HTTP, and amazon would forward TCP, but without any client IP information, just reproxied and lost) [17:37:16] a client was stuck with http://foobar.com and wanted amazon LB due to an audit (long story) [17:37:26] I just moved them to route53 and stop caring about it [17:37:41] using http://foobar.com was a bad call to begin with, but anyway [17:38:10] * jijiki dinner [17:38:48] cya! [17:44:18] 10netops, 10Operations, 10ops-eqiad: Interface errors on cr2-eqiad:xe-4/0/0 - https://phabricator.wikimedia.org/T203719 (10ayounsi) 05Open>03Resolved Lot better, thanks! [18:27:18] 10netops, 10Operations, 10fundraising-tech-ops: Grow frack-administration-codfw to /28 - https://phabricator.wikimedia.org/T204271 (10ayounsi) p:05Triage>03Normal [21:32:34] 10netops, 10Operations, 10Performance-Team: Stop prioritizing peering over transit - https://phabricator.wikimedia.org/T204281 (10ayounsi) p:05Triage>03Normal [21:44:21] 10netops, 10Operations, 10Performance-Team: Stop prioritizing peering over transit - https://phabricator.wikimedia.org/T204281 (10Imarlier) Sounds interesting. Keep Perf in the loop as you start to think about how to do this, and what your target geos might be.