[10:09:41] moritzm, jbond42: what do you think if I do a full release of debmonitor? [10:11:04] sgtm volans [10:14:20] yeah, sounds good [10:14:30] ack will do [10:15:57] let's simply keep the clients at 0.2.9 since nothing changed there, the debs can be bumped once there's a change which affects them [10:30:38] ahem, technically speaking there was a fix from last week from me [10:30:51] https://phabricator.wikimedia.org/T282529 [10:31:01] :) [10:31:07] on the client I mean [10:33:46] ah, the cornercase found by Arturo, right, totally forgot about it [10:34:04] yep that one :) [10:35:30] actually, rolling out the new version would also be a good opportunity to fix the legacy cases for the debmonitor system user [10:36:11] older installs use 998:998 [10:36:57] but with the new scheme and the centrally managed adduser.conf (via puppet) and the equivalent sysusers.d config, they are now correctly created in the 100-499 range [10:37:32] nice, anything that needs to land in debmonitor for that? [10:37:36] in the debian/ dir maybe? [10:37:56] it already uses a systemd sysuser [10:38:38] but when we bump the package to 0.2.10 I can use the chance to drop the existing sysuser with cumin and rollout the deb (which the recreates the debmonitor user in the 100-499 range) [10:39:16] ack, I'll let you know once the packages are available [10:39:19] *package [10:39:22] debian/sysusers/debmonitor-client.conf [10:39:25] ack, sounds good [10:39:51] I think rectifiying debmonitor should be the only thing left until we can close https://phabricator.wikimedia.org/T235162 for good [10:40:09] nice! [10:40:15] which was found during the puppetmaster buster migration two years ago :-) [10:40:21] lol [10:45:40] and no jessie build, yay [10:52:07] can I finally drop 3.4 support? I think john already did but had to revert it [10:53:10] yeah, with jessie gone, 3.4 is history as well [10:53:47] stretch's 3.5 is our current base layer [10:59:08] moritzm: volans: fyi there is still python3-build-jessie which gets built by docker-reporter-base-images.service (and uses demonitor-client) [10:59:29] is that still needed? [10:59:38] i dont know but its still there :) [11:00:01] however i guess if you dont do a jessie release it should be fine [11:00:12] I would not [11:00:30] dropping 3.4 support and cleaning a lot of old cruft that was there just to suppor that [11:00:46] thanks for reminding me of the docke rimage [11:01:00] ack should be fine and not a massive deal if not, think we can fix forward if anything strange heppens but think it is probably fine [11:01:08] ack [11:04:43] jbond42: no idea about that image, but it can surely go away [11:07:39] nto sure about the image either but i wondered if its used in CI to build jessies packages that we may still need to build? [11:08:36] i think the timer parses the docker registry directly to get a list of active/latest images so we would need to delete it from the registry [11:16:30] it must be getting only packages from upstream as we've already dropped out jessie's component in apt.w.o [11:22:59] or it [11:23:10] or it's just a stale image which can't be rebuild anyway? [11:26:46] btw I must be recalling wrong, because by my recollection the time should already be broken. Don't we start the image, install debmonitor-client and then report the upgradable pacakges to debmonitor? The install should be already broken IIRC [11:26:55] *timer [11:27:16] anyway, few other patches coming your way with cleanups [11:30:08] https://debmonitor.wikimedia.org/images/docker-registry.wikimedia.org/python3-build-jessie:0.0.3 shows upgradeable packages [11:30:23] and was last updated 4 days, 11 hours ago [11:30:33] volans: correct debmonitor-client is installed by docker-reporter [11:30:45] so after the apt.wikimedia.org repos for jessie were taken down [11:31:13] and from where it gets it? : [11:31:14] :) [11:31:24] so yes if jessies is allready removed then its must be allready broken [11:31:41] maybe they still get cached in the web proxies or so? [11:31:46] I don't recall when we did the purge from reprepro [11:31:53] anyway, if it's unused, let's simply remove it :-) [11:32:22] no its currently failing https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=deneb&service=Check+systemd+state [11:32:36] +1 to remove [11:32:42] +2 [12:26:30] moritzm, jbond42: FYI I've updated https://wikitech.wikimedia.org/wiki/Debmonitor#Common_commands to reflect both the new cookbook to delete hosts and also to use the new path for the cfssl certs as opposed as the puppet ones. [12:26:44] I've used $(hostname -f | tr '.' '_') as a trick to get it dynamically [12:27:03] ack thanks [12:31:46] looks good [15:39:36] hello! (I never knew this channel existed! :) [15:39:43] XioNoX, topranks: sukhe has some questions regarding how to configure netbox for wikidough [15:39:47] I am trying to add https://gerrit.wikimedia.org/r/c/operations/dns/+/692625 but through Netbox instead [15:39:49] hello sukhe :) [15:40:05] in particular there is no prefix but an aggregate right now [15:40:18] and also if the best role is anycast or VIP for an anycast VIP :) [15:40:34] and then the next question after that, I started with https://netbox.wikimedia.org/ipam/ip-addresses/8539/ but should it be "VIP" or "anycast" for the role? [15:40:51] thanks volans [15:42:32] Based on the other allocations I suspect 'VIP' should be used rather than 'Anycast' for the role. [15:42:49] But I am too green to say, let's see what #XioNox thinks :) [15:43:01] topranks: thanks, updated! [15:43:07] er ok, guess that was too soon :P [15:43:31] leave it for now, I think that's correct so hopefully you don't need to change back! [15:43:40] yep [16:19:05] Probably stupid question for me, just digging around on turnilo... [16:19:21] Why do we have so much inbound port 853 from Google? Assume it's DoT ? [16:19:32] DoT to authdns [16:20:02] Do Google do that speculatively with all auth dns? [16:20:36] Like 8.8.8.8 resolver tries to use DoT if it can when talking to auth NS? [16:20:42] in the original email they sent, they said they were doing "TLS experimentations" with authdns :) [16:21:04] they actually asked us if they want to receive TLS traffic from us or not! [16:21:12] ok cool. [16:22:05] I guess it puts a bigger load on our revolvers, doing TLS handshake each time? [16:24:04] bblack knows more about this than I do (obviously!) but IIRC he had to update some parameters to handle the load [16:25:57] https://github.com/blblack/gdnsd/commit/3a65b7e41f4f01f92f8a30d0db2d452047fac5cb [16:32:13] ok cool thanks for the info. [16:36:25] topranks, sukhe, that IP is an Anycast VIP, good luck with Netbox :) [16:37:03] more seriously we don't do anything different in Netbox with one or the other, so VIP is fine [16:37:07] XioNox: We should create a "prefix" for 185.71.138.0/24 right? [16:37:20] Currently the IP has no "parent prefix" as there is only an aggregate in Netbox for that range. [16:37:27] XioNoX: hhah my commit was for operations/dns anyway since I didn't know any better to go with Netbox [16:37:46] XioNoX: see https://phabricator.wikimedia.org/T252132#7098776 [16:37:53] it's not ideal [16:37:59] topranks: yep, good idea for the prefix [16:38:17] yes I was asking before why doesn't have one, looked weird [16:38:18] Ok I will add it I've the page open here. [16:38:45] * I was asking too [16:38:55] topranks: it creates a long lived tcp connections which can sen multiple queries/responses along the same tcp stream so you dont need to do the tls handshake for every query [16:39:04] good resolveres should mostly stay connected [16:39:31] and for busy reolveres there can even be a performance boost as you have less ip/udp headeres to construct [16:39:37] volans: ack for the org-global, manual is indeed better for now [16:39:59] we need to decide what to do with those, I guess one file per first level domain [16:40:12] I would need to re-check the script and see what's needed there [16:40:12] jbond42: yeah that makes sense, at least on the authdns side. [16:40:17] volans: yep [16:40:40] I think the issue is the global [16:40:59] But I guess an individual resolver has to connect to many many auth NS constantly, it's not going to be a small set of them with lots of queries to each. [16:41:00] somehow [16:41:11] because we do create wikimedia.org-eqiad and the like [16:41:22] for other domains [16:41:58] but they are always with 3 levelss [16:42:10] the single records [16:43:02] topranks: i have much less knowlage/experience on the cache side and have not seen any good paperes/presentations on real world loads yet (not even sure how many big auth serveres ofer TLS) [16:43:15] but yes i reckon :) [16:44:21] I'd not come across any before now, which is why the 853 stuff in netflow caught my attention. [16:44:49] however there can be boosts for them to, if as a cache you know that theses 1000 domains all exist on this 1 ns (which is very common), when you come to refresh you cache you can refresh all entries for every record on all 1000s domains over one tcp connections probably dose help [16:46:29] i.e. instead of sending perhaps 10000 tcp packets all with ~64 kb, you can send one 64k tcp stream so potentially ~40 packets vs 10000 [16:46:44] *10000 udp packets [16:46:50] Yeah makes total sense. [16:47:05] but like i said no real experience wit caches [16:47:37] Only fear would be it encourages centralization, i.e. running more and more domains on fewer and fewer auth NS. [16:48:06] yes that has been the unfortunate trend with dns for both the auth and the cache :( [16:48:56] moritzm: FYI debmonitor-client packages are available on apt, I didn't yet tried to install it though [16:50:44] I've tested it on cumin1001, all looks good so far [16:59:21] FYI I'll wait tomorrow before releasing debmonitor server [16:59:34] s/release/deploy/ [17:30:49] since we decided to do manual for now, looking for a quick review of https://gerrit.wikimedia.org/r/c/operations/dns/+/692625. thanks! [17:30:55] er wait, sorry, wrong one sigh [17:31:18] oh right, correct, ignore ^ I thought I was in sre-private but it's sre-foundations lol [17:32:03] irssi really needs to expand the channel name next to the chat area [18:26:47] volans: excellent, I will roll this out in the next days [20:03:32] xe-0/1/3 flapping like mad on cr2-esams [20:03:45] This is Lumen 10G to cr2-eqiad. [20:04:55] interesting, I know there's some break-fix work happening on the codfw/ulsfo Lumen link [20:05:03] seems unlikely to be related to that though [20:07:18] indeed [20:07:27] BGP has remained down despite the physical int flapping. [20:07:31] So topology is stable [20:09:35] I don't have an account on their web portal to open a ticket [20:13:10] ok well that answers my question about what we'd normally do. [20:13:19] If you don't have one I'm fairly sure I don't ( [20:13:25] one for the #todo list. [20:13:40] Traffic seems to be routing via cr3-knams and then out via GTT VPLS. [20:13:41] https://librenms.wikimedia.org/device/device=66/tab=port/port=19412/