[00:35:29] If you are using the new bast3004 you will find a file called $user_bast3002.tar.gz in your home. It contains your home from bast3002. In case you are missing anything you can just unpack that or if not delete it. [07:43:06] I am handling s4 backup (db1102 and db1115 alerts) [14:06:27] volans|off: (for later) - /var/log/wmf-auto-reimage/201910251336_bblack_10319_ganeti3003_esams_wmnet* (reboot step failed because of some host key issue, I imagine a race on redefining it or whatever) [14:07:55] I think that race is related to some other oddities I've seen the past few days, which ultimately may come from some of the improvements to the mapped-ipv6 setup [14:08:22] (in that I tend to see late ssh_known_hosts redefinitions of host keys from the temp-ipv6 to the mapped-ipv6?) [14:10:52] maybe that will all fix itself when we get to the end state here (which I assume is that the mapped-ipv6 will be statically defined by the installer like ipv4) [14:18:49] puppet-merge seems to have gotten faster lately [14:19:37] I don't know why, but usually I flip over to some window and do other things while it iterates the masters and check back in a few times. lately as soon as I first check back it's already done [14:19:52] (so I guess an alternate hypothesis is also that I've gotten slower) [14:20:47] I noticed the same, I assumed it is due to a more powerful gerrit host [14:21:27] bstorm_: thanks for review, here's the next one for ferm stuff: [14:21:28] https://gerrit.wikimedia.org/r/c/operations/puppet/+/546189 [14:25:31] godog: so ganeti3003.esams.wmnet is a new authdns server. I assume based on its role, prometheus will pull gdnsd stats from it and they should show up in the right places in https://grafana.wikimedia.org/d/000000341/dns?orgId=1&from=now-1h&to=now automagically, but nothing yet. Is there some normal delay in noting a new host all the way out through grafana? [14:25:57] (it does run the node_gdnsd prometheus exporter with a once per minute cron) [14:32:23] bblack: I see ganeti3003 in the top panel now, started just now though with non-zero values [14:32:53] but yeah should happen automatically indeed [14:36:14] ah ok, just takes a bit! [14:36:39] or I guess, maybe takes some non-zero values showing up, too :) [14:37:29] hehe yeah I clicked on ganeti3003 in the leged to isolate it [14:37:59] looks like only "noerror" is there now [15:02:48] ottomata: will take a look :) [15:04:53] bblack: did you do the router changes? [15:05:25] XioNoX: for ns2? [15:05:31] yep [15:05:32] XioNoX: (yes if so) [15:05:38] ok cool! [15:05:39] only on cr[23]-esams [15:05:45] I donno if you have such rules for ns2 elsewhere [15:05:51] nothing else needed [17:54:43] bblack: so from the logs it seems that there was a key mismatching (the key differs from the one for that IP, offending key for IP at line X, matching host key at line Y) [17:55:28] right [17:55:42] now the scrupt should run puppet on the cumin host to get the new keys [17:55:55] but unfortunately to keep the noise low I run it with -q :( [17:55:57] but I think that could be because, the old key was still there, instead of being replaced by the new, because the IPv6 was temporarily different (ephemeral IPv6 before mapped came back online) [17:56:07] so I'm checking the changes on puppetboard [17:56:20] I donno [17:56:50] the ephemeral being 2620:0:862:102:b226:28ff:fe6e:ddd0? [17:58:27] not sure [17:58:41] bblack: search for ganeti3003 in those two runs: [17:58:41] https://puppetboard.wikimedia.org/report/cumin1001.eqiad.wmnet/3008dd96139d6cfdbe56b4c6bd00b8abaed7c937 [17:58:45] and [17:58:46] https://puppetboard.wikimedia.org/report/cumin1001.eqiad.wmnet/65d24ff26c23fbc2320566d8c6c518afed1e3c68 [17:59:05] the second one is the *after* the first one [17:59:45] right [18:00:13] although they both seem the natural ones every 30m, I'm still trying to match timestamps to see which one was the reimage-triggered one [18:00:22] reimage started ~13:36, just before that first one [18:00:47] and errored at 2019-10-25 14:00:23 [18:00:47] and probably was still picking the old ipv6 [18:00:57] while the new one was got at 05 [18:00:59] :05 [18:02:48] now, the logs are failing me, but in the output you should have got the hostname too, but I'm betting that: [18:02:51] 2019-10-25 14:00:22 [INFO] (bblack) wmf-auto-reimage::print_line: Puppet run completed [18:03:03] should be the puppet run on cumin1001 to get the new keys [18:03:10] that apparently never happened though [18:04:40] and from the _cumin log I can only see: [18:04:50] ----- OUTPUT of 'run-puppet-agent -q' ----- without anything after that [18:05:03] but it was sucessful [18:05:07] weird...