[02:04:31] Krenair: was the issue that you didn't have permissions for the annotated tag but had it for a normal tag? [02:04:44] oh, just all tags [02:05:31] Krenair: hmm, there's an option you can pass to gbp so it doesn't require the upstream/ prefix [02:05:47] Krenair: https://gerrit.wikimedia.org/r/plugins/gitiles/integration/uprightdiff/+/debian/debian/gbp.conf#4 [02:14:33] 10Traffic, 10Operations, 10fixcopyright.wikimedia.org, 10MW-1.32-release-notes (WMF-deploy-2018-09-04 (1.32.0-wmf.20)), 10Patch-For-Review: Sort out HTTP caching issues for fixcopyright wiki - https://phabricator.wikimedia.org/T203179 (10Legoktm) [10:29:29] legoktm, well I put that in my one :/ [10:29:37] didn't complain about lack of 0.1 tag [13:49:35] Krenair: be sure to make a release with this https://gerrit.wikimedia.org/r/#/c/operations/software/certcentral/+/458767/ included [14:16:48] 10netops, 10Operations: Intermittent connectivity issues in eqiad's row C - https://phabricator.wikimedia.org/T201139 (10faidon) Has anything happened on this? IIRC at our meetings we talked about investigating this further e.g. with the help of JTAC, and exploring whether we should disable the JunOS' DDoS pro... [14:19:27] vgutierrez: btw the script won't actually need the "_acme-challenge." prefix implied on every entry, but it's easy enough to strip it anyways. [14:20:10] ack [14:20:41] I mean, the challenge information sent by the ACMEv2 API sends the whole hostname for the TXT record [14:20:43] doesn't really matter either way I guess [14:20:50] ok [14:20:54] assuming it's going to be _acme-challenge couldn't be 100% safe [14:21:12] the RFC says so, I thought? Maybe I should re-read [14:21:58] you're right [14:22:00] "The client constructs the validation domain name by [14:22:00] prepending the label "_acme-challenge" to the domain name being [14:22:00] validated, then provisions a TXT record with the digest value under [14:22:03] that name. " [14:22:06] (sorry for the lame paste format) [14:22:43] yeah, and I don't think the literal "_acme-challenge" is passed around in any of the HTTP reqs/resps [14:23:29] anyways, instead of hacking it off in the script, maybe I should just upgrade the gdnsd part to ignore the leading _acme-challenge. label on the CLI, so it works either way people think to do it. [14:29:56] also, for today's issue of "Favorite Historical WTFs of the DNS", I give you https://tools.ietf.org/html/rfc1982 [14:30:33] TL;DR - the original DNS spec made the zone's SOA serial number a 32-bit unsigned integer, but then said that it "wraps" using "sequence arithmetic" and failed to actually explain what any of that meant. [14:31:21] so in 1996, they "clarified" this by inventing and entirely new stupid and whacky way to interpret these numbers as some kind of abstract sequence involving signed values, which leaves some possible values and/or step-amounts effectively unusable, etc... [14:31:42] lovely [14:32:05] I dare you to even try to understand "Serial Number Arithmetic" as defined in the RFC on the first read (or second, or third, ...) [14:33:03] I suspect it all boils down to BIND having historically used a signed integer for it in their code, and the rest is all a workaround to make that seem like a sane choice :P [14:34:03] one of the many gems from the RFC: [14:34:04] "Note that there are some pairs of values s1 and s2 for which s1 is not equal to s2, but for which s1 is neither greater than, nor less than, s2. An attempt to use these ordering operators on such pairs of values produces an undefined result. [14:34:08] " [14:35:10] also lol at the hubris that anyone might reuse this scheme at the end: [14:35:13] "As this defined arithmetic may be useful for purposes other than for the DNS serial number, it may be referenced as Serial Number Arithmetic from RFC1982 [14:35:16] " [14:37:22] > Note that there are some pairs of values s1 and s2 for which s1 is not equal to s2, but for which s1 is neither greater than, nor less than, s2. [14:37:37] this could be coming straight from /r/lolphp ^ [14:41:28] although it's too consistent for lolphp, s1 would need to be greater than s2 on Saturdays (but s1 == s2) [14:47:28] luckily Serial Number Arithmetic only matters if you do zone transfers :) [14:48:10] but even without transfers, of course an unsigned 32-bit integer isn't good for marking a timestamp that has decent range and is human parseable as an integer [14:49:16] our templating that faidon put together for mtime->serial in repos/dns uses decimal YYYYMMDDHH to fill up the digits, which only gives 1h precision, which is one of the most-reasonable tradeoffs you can make [14:51:16] You could do YYMMDDHHMM for one-minute precision, which won't break until the year 2043. I guess you can invent something new at that point, assuming serial consistency is irrelevant because you're not doing zone transfers. [14:51:58] but that's only 25 years away, and I might be alive to hear the complaints :P [14:52:31] DNS updates were not that often when I made that :) [14:52:52] I don't remember if I followed what was done before that project either, quite possibly it was like that before too [14:52:58] well any way you slice it, it will be approximate. there could always be a pair of back-to-back changes less than a second apart. [14:53:38] this is all re: making gdnsd generate automatic mtime-based serials if the file's value is zero. [14:54:01] it won't be consistent across servers, though [14:54:25] probably not! [14:55:02] even with 1H or 1D resolution, and ntp-synced servers there would be edge-cases from the distribution mechanism's timing writing files to servers a second apart at the boundary. [14:55:03] and you don't really have another way to check whether two servers serve off the same data, do you? [14:55:42] yeah :) [14:55:52] I don't think a time-based serial is really the answer for "same data" [14:56:25] once you throw out zone transfers, it's at best a sanity check for "hey something looks wrong, is the serial value really old compared to when I know I deployed a change that seems broken?" [14:56:25] not in the general case of course! [14:57:03] but given timestamp looseness and potential speed of updates, it won't be perfect if there are rapid changes no matter what. [14:57:10] if you throw out zone transfers then you don't care about it being monotonically increasing :) [14:57:15] and "same data" check can be done by whatever mechanism you're using to distribute. [14:57:17] could just invent a random magic [14:57:59] that doesn't cover it -- there's value in knowing that the DNS software actually loaded what you told it to (re)load [14:58:00] well the reason not to make it random magic, is to make it easily human-parseable so someone can say "Hey I pushed a change an hour ago, and that serial value looks like 3 days ago" [14:58:08] and it's not serving what it had loaded before for whatever reason [14:58:32] so I do think you need an in-memory thing [14:58:42] that could be a TXT with the git hash of the commit you deployed of course [14:58:58] easy and consistent and meaningful :) [14:59:04] I think it's taking validation too deep [14:59:29] you're just used to too reliable DNS software :P [14:59:49] if the server didn't load the data it was told to, then there's a serious bug in your distrubtion mechanism or the server, and you're going to find out when you notice the effects of a bad update and debug it and rant [15:00:30] but if there's a serious bug in the distro mechanism or the dns server, whose to say the serial value reflects the buggy lack of state updates accurately anyways? [15:00:33] I just think it's better if a monitoring check notices the effects for you :) [15:00:49] it's a trivial consistency check, but it's of trivial value. [15:01:12] the distribution bug could easily happen before the serial timestamp is templated into the data. [15:02:00] I've experienced first-hand bind silently refusing to load a new zonefile tbh [15:02:19] yeah, this would've caught that case [15:02:46] but it's not going to catch whole classes of bugs in the git distribution scripts and templating, etc [15:03:10] (neither would injecting the hash as a record, really, although it probably narrows the scope of the possible bugs) [15:03:26] but what's the alternative that catches all those? [15:03:49] well, philosophically, no simple mechanism is going to catch all possible bugs [15:05:00] (except, perhaps, you could wipe out a whole wide swath of buggy components by explicitly healthchecking queries against every deployed record, etc) [15:05:16] (but then you might still have a bug in whatever's managing all that data and its healthchecks, I guess?0 [15:06:06] sure, and if you supported AXFRs you could do AXFRs and diff them too, and you'd still wouldn't catch all bugs [15:06:29] (and even if you ignore complexities around DYN records etc.) [15:07:54] right [15:08:02] my POV is that "is the SOA serial matching between our N servers" or "is the TXT record with the git commit matching between our N servers" are super easy to write and can catch /some/ stuff, so why not :) [15:08:28] how do you automate SOA serial matching though? the present system still has edge cases. [15:08:51] the TXT record would work for that. [15:09:29] I haven't worked on it in many years now, but IIRC the edge cases you mentioned don't exist right now (but others may!) [15:09:41] because I think the serial is injected during the template->zone generation [15:09:55] which happens once and then its output is distributed [15:10:08] I thought each server pulled git and templated separately? [15:10:17] maybe? I don't remember :( [15:10:19] now I have to go look [15:10:50] yeah, each server pulls git and templates separately, locally [15:10:54] looking at it now too, I think you're actually right [15:11:21] but if the serial-mtime templating were replaced with TXT githash templating, that would close that loophole [15:11:26] indeed [15:11:27] also [15:11:45] you could generate the serial in authdns-update, then call authdns-local-update (and in turn authdns-gen-zones) with the serial in $1 [15:11:56] sure [15:11:57] that's an easy fix, despite all the bash there :P [15:12:14] really that code is up for being rewritten in a higher-level language [15:12:19] I regret not doing it better the first time [15:12:32] the reason to even push against it, is I think once we have working INCLUDES with zone name refs, this serial/githash mechanism would be the only reason to template at all, vs just deploying static files. [15:13:08] and it seems like a lot to template/rewrite the files just to inject the check [15:13:26] oh hadn't realized that at all [15:13:42] INCLUDEs with zone name refs being what? [15:14:03] well, I added back $INCLUDE in general, because we'll need it if nothing else for netbox stuff to not get into templating insanity, etc... [15:14:24] but I then looked at our possible current $INCLUDE use-cases, like the sub-parts of the wmnet zone [15:14:49] well, maybe wmnet is a bad example, that one's static [15:15:31] but anyways, if you're going to have softlinked zones (e.g. foo.org -> wikipedia.org), you need everything inside to be zone-relative, not absolute [15:15:42] IIRC (and again, haven't looked at it in 4-5 years, everything I say take it with a grain of salt) [15:15:46] and then if you di "$ORIGIN subzone", you're stuck there and can't back out [15:15:55] the most complex templating need was the language macro [15:16:01] which is why we need a templated {{ zonename }} [15:16:53] which was that we had huge language lists that applied to all kinds of projects, and multiple times (one for plain, one for mdot, and one for zero in the past?) [15:16:54] but then, the places where we do use {{ zonename }} today are mostly for convenience, we don't actually use them in langlist or other tricky cases [15:17:30] I guess you could generate that once in a relative sub-include and take the hit for adding each language twice [15:17:33] that's not a big hit [15:17:52] fair! [15:18:00] think we still have zero? [15:18:02] anyways [15:18:19] otoh [15:18:27] there's still ton of repetition and manual stuff in our zonefiles [15:18:34] the point is, when I added back $INCLUDE, I also added macros @F and @Z for file-level and zone-level original origin, which allow relative names to jump in and out of scopes [15:18:35] that we should probably generate/template further [15:18:43] including the netbox stuff :) [15:19:00] @Z is basically {{ zonenname }}, and @F is whatever the origin was when the current includefile started (same as @Z in the original file) [15:19:14] e.g. our NS or MX records are being repeated a hundred times [15:19:36] https://gerrit.wikimedia.org/r/#/c/operations/dns/+/223059/ [15:19:39] hey that's just 3 years old :P [15:19:56] so you can do things like "$ORIGIN foo.@F .... $ORIGIN bar.@F ...." and have foo and bar as peers, without explicitly specifying the zone name (symlinked zonefiles) [15:20:34] but basically the idea was that you could say {% extends "wikiproject" %} and then just use the defaults and override them only where necessary [15:20:45] yeah [15:20:49] that's the completely opposite direcetion [15:20:58] of templating the shit out of it :) [15:21:05] :) [15:21:37] I think I pushed that because I was annoyed I had to fix how many MX records or something [15:21:42] working on mail stuff at the time? [15:21:45] netbox will necessarily be generated by $something anyways, but it doesn't have to be the general jinja mechanism used at the authdns servers themselves [15:22:21] we could also move the templating into the git repo too, but it's ewww on a different level [15:22:25] nod [15:22:29] (do the templating on commit, I mean) [15:22:30] heh [15:22:46] I think the fact that each server does its own templating is a mistake in retrospect [15:22:57] kinda like redirects.dat, which I think we cleaned up to not do that way anymore [15:22:59] rather than do it once and distribute the output [15:23:05] we did! [15:23:38] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/357733/ [15:24:09] having one authserver create and distribute to all could be viewed as an extra SPOF though, vs them all pulling from a central git and updating locally. [15:24:33] what if the one you start from is faulty (bad update to a python module, or borked RAM or whatever) and pushes what could've been a local mistake to all? [15:24:45] there's a review step [15:25:12] so it's generate -> diff/check/confirm -> deploy [15:25:21] if we decide to keep it that is :) [15:26:46] unless the local fault happens post-confirm! :) [15:27:01] heh [15:27:15] so yeah, it's all imperfect anyways, you're just trying to catch the most-obvious possible stupidity [15:27:29] always :) [15:27:50] but I'd argue that "dns server fails to reload zonefiles when it's told to" is a pretty rare stupidity [15:28:04] but a set of scripts pulling git clones and running templating engines, maybe. [15:28:06] certainly is with gdnsd :) [15:28:18] indeed [15:28:31] on a number of servers that may go up/down/disconnected too [15:28:37] a set of dns servers that just pulls a static git repo and tells the server to reload, maybe far less so [15:28:48] but yeah, there's the "offline dns server" problem too [15:28:51] /happens to be offline when an update is pushed [15:28:56] none of this is stateful in that case [15:29:26] and we don't have transactions/rollback capabilities in case one of the updates fail [15:29:29] (maybe the authdns systemd unit should do an authdns-local-update --force on start? or some one-shot unit on reboot can do it before the dns server starts, but not on restarts?) [15:31:02] yeah we could [15:31:13] all of these could stand to be better-engineered [15:31:16] when this was made systemd wasn't even in the horizon :P [15:31:55] yeah I think it's time for revamping all that [15:32:00] and get rid of all the bash scripts [15:32:11] we could write a mysql backend for gdnsd and have it pull from replicas using a schema with a serial number for transaction on the data! :) [15:32:16] especially if we're about to double the number of auth servers for instance [15:32:26] more than double [15:32:29] or do more and more complex things [15:32:54] the plan is to eventually go from the current 3 to 10 (+2 every time we add another cache dc) [15:33:17] tbh my sense is that we won't be able to get rid of a pre-deployment generation step so might just as well keep the jinja stuff (and do more with it) [15:33:21] but I don't care that much :) [15:33:26] sure [15:33:52] the discovery stuff do their own thing to generate fragments too, no? [15:34:12] ugh don't get me started on that, I hate the way the current discovery stuff is structured [15:34:16] haha [15:34:17] ok :) [15:34:30] (hence the ancient contentious commit that moved all the templating off the dns repo and over to the puppet repo) [15:34:49] and I think I've argued about this before, but I'd probably split into multiple classes of servers as well [15:35:06] especially as we grow the DNS infrastructure _and_ we decide to do smarter things about the stuff we serve [15:35:13] the core problem with the current discovery stuff is that for every new service, there's matching changes to the gdnsd config and the zonefiles that must go together, but the config is templated from puppet and the zonefiles from repos/dns [15:35:46] and honestly, as we grow the team [15:35:49] even I can't remember which order you commit/deploy them anymore to make a new one deploy sanely and pass CI checks [15:36:22] DC Ops need to be able to add/remove servers in mgmt and whatnot, no reason for this to be the same repository/config/server etc. with en.wp.org [15:36:25] we already have at least one such DNS split with labs running different software [15:37:25] and having to update a large amount of edge-servers all over the world that are potentially anycasted etc. to change an iDRAC's IP [15:37:40] if it weren't for the magic in discovery-dns being in .wmnet, it could be logical and clear to move wmnet and all the revdns stuff to a separate repo that's more dcops/netbox focused, and possibly a separate server implementation that's more general-purpose than gdnsd and does DDNS and other whacky things, but maybe not the geoip bits as well. [15:38:14] nod [15:38:27] and whatever the replacement is you got for wikimedia.org to split that from the public view [15:38:30] wikimedia.net? [15:38:32] yes [15:38:48] but it's immaterial, can always grab anything else we want to [15:38:51] something under .wiki for instance :P [15:38:58] you're such a troll :P [15:39:01] :D :D [15:39:35] we could always move or delegate discovery.wmnet anyways [15:39:41] indeed [15:39:52] to where though? third cluster? :) [15:39:59] or back to the first one? [15:40:04] arguably if wmnet and wikimedia.net and revdns go to this other dcops/netbox -level service.... [15:40:07] is the latter even going to work? I guess not? [15:40:28] .svc. is also something that doesn't belong with the individual servers [15:40:31] discovery.wmnet and svc.$dcname.wmnet and such should move elsewhere [15:40:32] yeah [15:40:35] but honestly I haven't thought about it all that much [15:40:51] .wmsvc or whatever [15:40:59] services.wiki! [15:41:01] :P :P :P [15:41:48] really .wiki is super nice, but only if we had bought the whole gtld ourselves [15:42:01] yeah for the language stuff indeed [15:42:03] (aside from my general dislike of the gtld scam) [15:42:32] we can always buy .wikipedia [15:42:50] but what I really hate, even accepting gtlds in general, is the idea that we'd deploy names within .wiki alongside the ones the .wiki owner is selling to others and effectively lend our legitimacy to them and cause brand confusion, etc [15:43:01] fair fair [15:43:05] and sorry, i was trolling [15:43:17] unnecessary diversion :P [15:43:49] yeah we're like 20 levels off to the side of where I started when I pasted the RFC funny. Which was "do I sneak in one last feature commit for gdnsd-3.x, which is automatic mtime-based serials?" [15:44:09] btw [15:44:17] with automatic mtime-based serials [15:44:45] and pre-generation of templates and something like rsync -a [15:44:50] the servers will be consistent [15:45:02] so that argument doesn't exist anymore either! [15:46:30] brb [15:47:12] well, assuming ntp works I guess [15:47:34] now I'd really have to think about that, maybe ntp doesn't matter if rsync sets the fs timestamps consistently anyways [15:48:06] anyways, I think I'll push it anyways. it's an easy optional feature, don't set serial=0 if you don't want to use it [15:48:27] and I'll use the 2-digit years so it gets 1-minute resolution, users in 2043 can be damned [16:02:46] and in unrelatd wtf news today: https://bugs.chromium.org/p/chromium/issues/detail?id=881410 [16:03:21] (chrome decides when you visit "www.example.com", it should just show "example.com" in the URL bar, hilarity ensues) [16:04:48] note that at least some variants of this will strip m-dot too, for us :P [16:05:04] it's not clear to me which variants might be live in which canary/beta/whatever versions [16:05:07] fun! [16:05:29] (note it doesn't actually change the domain being visited, it just confuses the user visually) [16:06:50] 10netops, 10Operations, 10monitoring: Add virtual chassis port status alerting - https://phabricator.wikimedia.org/T201097 (10ayounsi) {F25691134} Putting the script here the time I send a Gerrit CR. It uses snimpy and the required MIBs can be obtained on https://apps.juniper.net/mib-explorer/index.jsp > m... [16:08:05] damn, the "www" hiding is really annoying. thanks for the bug link. also https://www.wired.com/story/google-wants-to-kill-the-url/ [16:10:03] bblack, it does the same for 'm' [16:10:14] So people visiting en.m.wikipedia.org are going to see en.wikipedia.org [16:10:28] oh you saw that [16:10:32] should've read further [16:11:06] they're doing all this as of Chrome 69 and it seems getting rid of the 'm' one in Chrome 70 [16:11:41] paravoid: btw, sorry ahead of time, but packaging updates for gdnsd-3.x may not be trivial. I'm probably going to package up a beta pre-release or whatever locally just to see what it takes to make it work for our servers, and then go from there. [16:12:28] (and then at some later time when you've made a real upstream debian package, we can backport to stretch) [16:12:40] ok [16:13:10] lmk ;) [16:24:17] is it because of systemd stuff or? [16:24:35] also, I think you had some configuration breaking changes in there right? [16:24:40] I wonder how I should handle these :) [16:27:28] also just saw that chromium bug [16:27:30] wtf [16:27:52] the best part is that they're not removing it just from the beginning [16:28:08] or just once [16:28:19] "subdomain.www.domain.com" displays as "subdomain.domain.com". [16:28:27] Here's another example where it goes very wrong. The site "www.m.www.m.example.com" should not show up as "example.com". [16:28:30] holy shit [16:28:33] seriously [16:28:48] and with a green SSL icon potentially as well I think? [16:29:54] 10Traffic, 10Operations, 10Patch-For-Review: certcentral: Make configurable the cmd executed to perform a DNS zone update - https://phabricator.wikimedia.org/T203678 (10Brandon) I don't really know what cmd or "DNS zone update" means... I barely use phabricator lol. Is this issue related to safari/iOS auto-... [16:31:24] 10Traffic, 10Operations, 10Patch-For-Review: certcentral: Make configurable the cmd executed to perform a DNS zone update - https://phabricator.wikimedia.org/T203678 (10Krenair) >>! In T203678#4566976, @Brandon wrote: > I don't really know what cmd or "DNS zone update" means... I barely use phabricator lol.... [16:34:59] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review: Deploy a scalable service for ACME (LetsEncrypt) certificate management - https://phabricator.wikimedia.org/T199711 (10Krenair) [16:35:09] 10Traffic, 10Operations, 10Patch-For-Review: certcentral: Make configurable the cmd executed to perform a DNS zone update - https://phabricator.wikimedia.org/T203678 (10Krenair) 05Open>03Resolved [16:45:13] 10netops, 10Operations, 10cloud-services-team, 10Patch-For-Review: modify labs-hosts1-vlans for http load of installer kernel - https://phabricator.wikimedia.org/T190424 (10ayounsi) 05Open>03Resolved Thanks, this has been useful, especially running a packet capture on the working vs. non working flows.... [16:45:35] 10netops, 10Operations, 10cloud-services-team, 10Patch-For-Review: modify labs-hosts1-vlans for http load of installer kernel - https://phabricator.wikimedia.org/T190424 (10ayounsi) a:05RobH>03ayounsi [17:30:29] managed to break git-buildpackage [17:30:38] alex@alex-laptop:~/Development/Wikimedia/Operations-Software-Certcentral (review/alex_monk/458554)$ gbp buildpackage --git-ignore-new [17:30:39] gbp:error: 0.1 is not a valid treeish [17:30:56] if I `git show upstream/0.1` that's okay [17:31:11] If I tag 0.1 that's okay too [17:31:20] oh right, I bet it's that gbp.conf missing the upstream/ bit [17:32:03] actually I think I'll just dump the upstream/ part of the tag name [19:32:51] paravoid: re gdnsd-3.x packaging: I've minimized the config fallout (unless they happen to use a couple of select params set to a couple of unlikely and now-illegal values). and there won't be serious zonefile compat issues (again, unless you're already doing something highly unlikely and probably dumb). I don't know that we can insulate users from whatever fallouts remain in those areas. [19:33:08] paravoid: (but it shouldn't be hard to update any example config to avoid them, if it ever used them) [19:34:09] paravoid: more worried about the that that the CLI stuff changes in incompatible and dramatic ways. The systemd unit file will change dramatically, the way things are stopped/started/restarted/reloaded changes, the underlying daemon has different CLI flags, etc... [19:34:45] paravoid: and new to the scene (not sure how much it affects packaging) is using the rundir (e.g. /run/gdnsd/) for a control socket, and the new gdnsdctl binary as well. [19:36:08] but the longer-term upshot, is non-major-version upgrades should be painless in the future, I hope (zero loss, even under systemd). [19:36:33] the new systemd template has: [19:36:33] ExecReload=/usr/local/bin/gdnsdctl -l replace [19:37:08] which you could also just execute directly from outside of systemd, either way it will seamlessly reload the the whole thing under systemd (including binary changes from a version-bump) and not drop packets. [19:37:54] 10netops, 10Operations, 10Wikimedia-Incident: asw2-a-eqiad FPC5 gets disconnected every 10 minutes - https://phabricator.wikimedia.org/T201145 (10ayounsi) Cabling has been done out of order, but end result is there. (minus the 7m DAC). During the re-cabling, the fabric was very unstable: frequent disconnect... [19:53:42] 10netops, 10Operations, 10Wikimedia-Incident: asw2-a-eqiad FPC5 gets disconnected every 10 minutes - https://phabricator.wikimedia.org/T201145 (10BBlack) I think, it's hard to evaluate the stability of the intended, supported VCF design while in an intermediate state. It's also probably not reasonable to ex...