[02:33:47] 10Traffic, 10Operations, 10Performance-Team (Radar), 10Services (designing), and 3 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10Imarlier) On a very random note, I wanted to say that I enjoyed this: {F27546380} Guess the subscriber list tr... [08:45:51] 10Traffic, 10Operations, 10Performance-Team (Radar), 10Services (designing), and 3 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10Gilles) @imarlier https://translate.google.com/#view=home&op=translate&sl=et&tl=en&text=krinkle [12:17:27] 10Traffic, 10Operations: kartotherian TLS support - https://phabricator.wikimedia.org/T211970 (10ema) p:05Triage→03Normal [12:18:17] 10Traffic, 10Operations, 10Maps (Kartotherian): kartotherian TLS support - https://phabricator.wikimedia.org/T211970 (10ema) [13:07:55] ema: https://github.com/wikimedia/operations-debs-trafficserver/blob/master/proxy/http2/hpack-tests/story_16.json [13:08:25] ^ I don't think it actually does anything we care about it, but it was one of the few matches in wikimedia's repos for "bits.wikimedia.org", which apparently they used in some hpack test before [13:10:21] This repo is supposedly-useless as well, but seems like an odd thing to still be publishing in our official wikimedia github: https://github.com/wikimedia/www.wikipedia.org [13:10:26] (and also references bits) [13:30:08] internal nxdomains, from a relatively short sample on authdns1001: https://phabricator.wikimedia.org/P7914 [13:30:25] (names we gave nxdomain responses for, to our own dns caches for our own servers' lookups) [13:30:46] mostly it's cross-domain stuff. e.g. looking up a codfw.wmnet hostname in eqiad.wmnet, or any internal hostname in wikimedia.org [13:30:56] a lot of which is going to boil down to /etc/resolv.conf searchlists [13:31:19] I imagine a fair amount of it is that various configurations that could/should be using explicit full hostnames is using short ones and relying on resolv.conf searching [13:31:42] fun, off top of my head I think at least prometheus and icinga don't have fqdns in their config and would do a lot of lookups [13:31:55] "a lot" [13:33:23] it might have a minor perf benefit at least, to fix those on the configuration side [13:34:02] it seems our normal resolv.conf setting for this is just a single "search foo", where foo is the same domainname the host appears to be in (so e.g. "eqiad.wmnet" for a private host or "wikimedia.org" for a public-subnet one) [13:34:43] you can effectively turn off resolv.conf's guessing by getting rid of the search line and doing "domain .", but I imagine it would break something at this point. [13:37:11] yeah I remember bumping into the search list needing all domains for prometheus because of the unqualified names [13:37:41] it's probably not worth the 20 line ramble explaining exactly the "why" mechanisms, but TL;DR is that at some future point wikimedia.org had DNSSEC, that high volume and broad hostname range of pointless nxdomains will matter more, and might break things (not for the dns servers, but for the clients doing these pointless lookups) [13:38:18] s/is that at some future point/is that if at some future point/ [13:39:17] and really within our own puppet-managed internal stuff, we always know the fqdn one way or another. We should just be explicit everywhere. [13:40:09] (that's not to say that e.g. icinga labels need the qualification in the UI and alerts, but the hostnames being hit via DNS do) [13:41:56] agreed, ditto for prometheus, there might a way to have qualified names in config and unqualified for display/metric purposes [13:42:35] I've always wondered why we have non-qualified in icinga [13:46:06] the medium-length explanation is that if we want to do DNSSEC efficiently and not be a reflection source for DoSing other sites, we'll probably dynamically sign nxdomain responses [13:47:08] and thus to avoid being DoS'd ourselves by outsiders sending tons of lookups on random domains to force us to do signing operations at a high rate, the server would implement a local cache of recent signings to cover common/real nxdomain signed outputs (e.g. from misconfigs and recently-removed things) [13:47:41] and then ratelimit cache misses and offer no response at all for excessive cache misses to defend. [13:48:48] (and thus if icinga was doing a high rate of nxdomain lookups to wikimedia.org, it would quickly run into queries that time out with no response while searching through its domain searchlist, which would probably slow it to a crawl and break it) [13:52:38] s/high rate of nxdomain lookups/high rate of nxdomain lookups across a broad set of invalid hostnames/ [13:54:02] in my short sample on authdns1001, our own internal nxdomain lookups are ~1.6% of all lookups against the authserver heh [13:55:41] public external nxdomains are another ~1%, but they clump up on a few key cases [13:56:05] bits.wikimedia.org (which has been dead for years now) is ~0.27% of all lookups on its own. [13:57:13] 2 years and 4 months since we pulled that name from DNS, and that was after we made a solid effort at removing references from everywhere we could. [14:32:47] bblack: I'll let the ats folks know (re: story_16.json) [14:33:47] I don't think it actually generates lookups anyways, I think it's just used for internal unit testing as a random past example from the wild [14:35:53] as a first step to add TLS to maps: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/479669/ https://puppet-compiler.wmflabs.org/compiler1002/13954/ [14:36:22] after this we'll need to add a :443 service to the LVSs and then we should be done I think! [14:36:25] reviews welcome :) [14:40:36] oh, I forgot ferm and monitoring, adding those too [16:28:59] DNS cookies are working well: https://grafana.wikimedia.org/d/000000341/dns?panelId=13&fullscreen&orgId=1&from=now-24h&to=now [16:29:22] init means the client sent a client-only cookie with no server cookie, to bootstrap with us [16:29:33] ok means they sent a valid server cookie we sent them previously [16:30:01] bad means they sent us a bad server cookie (possibly, an outdated one we sent them over an hour before, it's ok for that to happen and acts like the init case for legit cases) [16:30:47] my interpretation of the stats is that the few caches that use cookies are using them with us pretty well, and reusing them for many requests over their ~1-2h lifetime from init/refresh. [16:31:51] err: missing context maybe: DNS Cookies is https://tools.ietf.org/html/rfc7873 . It's a ToFU mechanism to avoid blind injection/forgery of UDP DNS reqs/resps. [16:32:08] (a paranoid cache could also do the init bootstrap over TCP, but few do apparently) [17:41:06] 10netops, 10Operations, 10ops-eqiad: asw2-a-eqiad FPC7 faulty PEM0 - https://phabricator.wikimedia.org/T206972 (10Cmjohnson) 05Open→03Resolved I received the new PEM from juniper ...resolving this task [20:19:07] 10netops, 10Operations: migrate netinsights from rhenium to sulfer - https://phabricator.wikimedia.org/T212011 (10RobH) p:05Triage→03Normal [20:25:41] 10netops, 10Operations: migrate netinsights from rhenium to sulfer - https://phabricator.wikimedia.org/T212011 (10RobH) [21:32:06] 10netops, 10Operations: migrate netinsights from rhenium to sulfer - https://phabricator.wikimedia.org/T212011 (10RobH) a:05RobH→03faidon So the setup task noted that @faidon is familar with the services on this box, assigning him for input on best way to migrate. [23:08:40] 10netops, 10Operations, 10Patch-For-Review: IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 noisy alert - https://phabricator.wikimedia.org/T205829 (10ayounsi) 05Open→03Resolved This has been quiet since. No root cause identified though.