[08:54:53] akosiaris: re: kubetcd monitoring, I see the targets are down according to prometheus here https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets?orgId=1&from=now-1h&to=now [08:55:10] and indeed prometheus1003:~$ curl kubetcd1004:2379/metrics [08:55:17] returns an empty body [08:55:30] https:// [08:55:38] maybe I need to switch the scheme [08:56:00] ah! no my mistake, the config has https [08:56:13] but missing SAN, maybe it is that [08:56:29] ah it needs fqdn perhaps? [08:56:47] yeah the fqdn works fine as well [08:57:23] indeed [08:58:16] subject: CN=_etcd-server-ssl._tcp.k8s3.eqiad.wmnet [08:58:30] I guess we can reissue them with the hostname as well as the fqdns as SANs [08:58:43] unless it's easy to use fqdns instead [08:59:08] so yeah either the hostname in SAN or hostnames_only => false in the class_config [08:59:31] I 'll do latter.. seems faster [08:59:34] thanks for that pointer [08:59:57] I think the former would be preferrable, the latter changes the 'instance' label to be a fqdn as a side effect, if you don't mind that then yeah it is faster [09:02:00] I actually prefer the instance to be fqdn in most cases [09:02:15] but I haven't really created anything yet, so for now it's a moot point [09:02:49] indeed [11:43:06] heads-up, I'm gonna remove remaining changeprop components from scb in a few minutes https://gerrit.wikimedia.org/r/c/operations/puppet/+/603534 [13:04:43] what impact should this have? [13:26:01] <_joe_> none [13:26:09] <_joe_> hnowlan: burn with fire [13:29:25] all gone [13:55:13] marostegui d1 and d2 pdu swap has been completed. [13:55:23] cmjohnson1: \o/ thanks [13:55:41] Jeff_Green I am going to be starting fundrasing rack now, C1-eqiad [14:11:47] how can I talk to about url-downloader? dns git blame suggests Alex but was from 2016 :) [14:12:47] we should fix git blame. hardcode the output to say "it's elukey. it's always elukey" [14:15:40] can volans call volans ? [14:16:13] every time I try it's occupied [14:16:41] volans: what do you need wrt url downloaders? [14:17:08] XioNoX: he'd find a nit for sure [14:17:21] hahaha [14:17:59] moritzm: so, we have url-downloader.eqiad.wikimedia.org (and similar for codfw) that are A records with the same IP of urldownloader[12]002 [14:18:08] and then we have url-downloader 1H IN CNAME url-downloader.eqiad [14:18:34] I was wondering if 1) we need all those 3 records or just one it's enough and 2) if they could all be CNAMEs [14:21:13] depends on what is being used in the deployments, I guess url-downloader might have been a thing from the time before codfw was setup [14:21:31] I think in Puppet we mostly configure the site specific CNAMEs [14:22:04] not sure about images [14:23:11] I'm not why those a A records [14:23:36] and I don't see a reason why a CNAME wouldn't work [14:24:40] I miss the context for what they are used, but from the name I couldn't think of a use case where a cname wouldn't work [14:24:56] I can send a patch to convert them [14:25:06] if there is someone willing to merge it :D [14:28:25] we can simply give it a shot on eqiad first, it receives some minor level of requests as well [14:29:37] ack, btw they are all 1H TTL, if the idea was to do failovers is not great [14:33:06] I think that's simply because noone ever bothered to change away from the 1H default, the url downloaders are practically maintenance free apart from the odd OS upgrades [14:35:53] ack [14:36:38] moritzm: there is one trick move to cnames, it will be v6 too [14:36:47] and maybe by default v6 [14:36:58] in case the listener is not configured for it [14:37:19] the hosts have v6 defined in DNS fwiw [14:39:35] just checked, squid also listens on v6 [14:40:21] great [14:40:53] patch sent [14:42:50] and maybe check ip6tables too [14:44:02] volans: +1d, I can merge this on Monday and keep an eye on it [14:44:39] moritzm: sounds perfect, thanks a lot [14:46:09] XioNoX: at first look ip6tables looks very similar to the v4 one [14:46:36] great [14:49:42] kormat: meta question - who's this elukey person that you people keep mentioning today? [14:50:12] :D [14:51:06] elukey: if you don't know, i'd keep it that way. :) [16:20:03] wikibugs down? [16:20:31] Going to give it a restart [16:27:58] marostegui: I can restart it if you want or don't have access [16:28:20] hauskatze: thanks - done already :) [16:28:37] Looks it's working though [16:28:42] On some channels at least [16:29:26] hauskatze: yeah, I restarted 8 minutes ago :) [17:04:56] volans: interested in hearing about weird sre.hosts.decommission behavior? (or if not you, someone?) [17:05:39] andrewbogott: sure, shoot [17:06:06] ugh, hard to backscroll in 'screen' [17:06:08] https://www.irccloud.com/pastebin/4mcsuyr4/ [17:06:31] This could easily be user error, I'm not sure I've used this before [17:07:37] that spacing makes it super hard to read [17:08:01] what CLI did you run? [17:08:15] sudo cookbook -d sre.hosts.decommission --force FORCE cloudvirt100[1-9].eqiad.wmnet -t T263151 [17:08:15] T263151: decommission cloudvirt100[1-9].eqiad.wmnet - https://phabricator.wikimedia.org/T263151 [17:09:31] what's --force FORCE ?!?!? [17:09:45] without it it will only do up to six hosts at once [17:09:51] ah damn, that's supposed to be an action store_true [17:09:54] I'll add it [17:09:55] I can try doing six without it [17:09:55] it does the decommission with a hammer [17:10:00] but unrelated [17:10:35] so you run it in dry-run [17:11:09] in dry-run we force the netbox-ro token to be selected [17:11:14] because you know, dry-run [17:11:38] Maybe it does a dry run by default? I definitely didn't specify dry-run on the cli [17:11:45] yes -d [17:11:48] oh! [17:11:49] ok [17:12:06] want me to try again without? [17:12:08] I can make it faile in a better way, it's very hard to make this one work with dry-run [17:12:15] sure go head should work fine [17:12:20] last famous words™ [17:13:14] seems happy so far [17:13:38] probably no need to actually fix anything here [17:13:45] the argument for sure [17:13:56] I guess you could make the recipe say 'nope, I can't do a dry run' if asked [17:16:36] I think I can make it work, let me try [17:19:28] patch sent [17:24:28] hm, now it's proposing to remove [17:24:41] db1131 1H IN AAAA 2620:0:861:103:10:64:32:6 [17:24:52] from dns along with the cloudvirt bits [17:25:23] andrewbogott: yeah expected, go ahead [17:25:31] ok. Just a race with other dns changes? [17:25:46] I forgot to run the dns cookbook after fixing a weird situation in netbox [17:25:52] that host got moved row today [17:25:56] ok [17:27:08] andrewbogott: you're the lucky one finding weird stuff [17:27:15] I'm failing to see in the task what failed for cloudvirt1005.eqiad.wmnet (FAIL) [17:27:32] was the stdout more clear? [17:27:36] https://www.irccloud.com/pastebin/uvA6Epaj/ [17:28:02] yeah tht's the same of the task [17:28:07] I meant specific to 1005 [17:28:23] see https://phabricator.wikimedia.org/T263151#6471727 [17:28:41] failed actions are in bold, let me look at the logs [17:31:59] andrewbogott: got it, Failed downtime host on Icinga (likely already removed) [17:32:14] as long as it did everything else that doesn't worry me :) [17:32:15] it shouldn't consider it a failure of the whole thing [17:32:49] ok for me to move ahead with purging puppet/dns? [17:33:18] yeah Idon't see any blocker [17:34:12] included fix/improvement in the existing cr [17:34:17] 'k [17:34:33] thank you! This is all pretty great even with confusing output :) [17:35:54] :) [18:11:08] trying to build the Etherpad package on deneb. running into "Unmet build dependencies: nodejs (>= 10) npm (>= 5.8) libpq-dev". I know there would be "apt-get build-dep etherpad-lite" but also that since we use pbuilder/pdebuild we should be able to avoid cluttering the build host like that and it should "just work". It doesn't though for me. [18:13:34] this is on kafka main nodes: [18:13:34] Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, changeprop-admin is not a valid group name [18:15:22] looks like from https://gerrit.wikimedia.org/r/c/operations/puppet/+/603534 [18:16:16] elukey: i'll fix that [18:18:09] mutante: thanks! [18:23:18] aww. there is a second error hidden behind that. cpjobqueue-admin is not a valid group name [18:35:13] elukey: i ran puppet on kafka-main via cumin (6 hosts) and should all be recovered. leaving comments on that ticket to make sure it's also desired that these shell users can't get on kafka hosts anymore, only roots now