[13:59:16] hm, was puppetmaster1001.eqiad.wmnet rebuilt? [14:00:55] Ah, I see in the SAL it was [14:01:55] andrewbogott: and will be again sorry i rebuild before chainging the pxe config [14:02:17] * andrewbogott merges on 2001 instead [17:01:03] hi re the puppet upgrade, it was far from the seemless upgrade i had hopped for and i appolagise for all the noise. however it looks like it has settled down now. I think the basis of the error was that i fogot to migrate the `puppetmaster_ca` paramter to puppetmaster2001 before the reimage. this paramter also controls the volatile endpoint which was what caused all the problems. It seems when [17:01:09] the front end is unable to contact the ... [17:01:11] ... volatile endpoint everything dies. After this was fixed things should have converged however it seemed that there where a load of puppet process scattered around in a stuck state causing resource exhaustion. Further hitting puppetmaster2002 with `/usr/local/sbin/run-puppet-agent --failed-only -q` meant things could just not catch up [17:03:11] once things have settled i will try some more failoveres as i think codfw can handle the traffic if migrated in an ordelry fasion. however my experience today suggests another backend in codfw would help take the load if mistakes like mine happen again [17:03:31] s/migrated/failed over/ [17:04:21] for now im not gonna touch anything elses but please ping me if you see more issues [20:51:21] jbond42: when you have a spare minuteā€¦ T235218 [20:51:22] T235218: Catch cloud-puppetmasters up with production puppetmaster versions - https://phabricator.wikimedia.org/T235218 [20:54:39] andrewbogott: no problem ill try to get to it tomorrow, however as you dont use puppet db it should hopefully be just as simple as reimaging the servers with buster (famous last words) [20:54:51] hope so :) [20:55:03] btw when you get a sec could you take a look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/541213 (not urgent) [20:58:45] jbond42: if you are still around, which puppet host is the master ATM? still 1001 ? I'm asking because I'm trying to push updates to volatile but doesn't seem to work [20:58:52] jbond42: yep [21:00:01] ah no nevermind looks like the master is 1001 indeed [21:00:18] I'll followup in a task [21:01:46] godog: actully the volatile end point is on puppetmaster2001 at the moment (its controled by the puppet_ca_server) but should be safe to move back [21:02:28] jbond42: thanks, ok I'll push my updates to puppetmaster2001, I used puppet.eqiad.wmnet out of habit but simple to change [21:03:03] done, thanks! [21:03:58] godog: im gussing this is the swift files in volatile, can you send me a pointer to some docs as i have to admit im a bit ignoranet about swift in genral and specificly what is in volatile [21:06:30] jbond42: the stuff in puppet volatile is the swift 'rings', basically the consistent hashing assignments for what data goes on which servers [21:07:00] I don't think there's a single doc that both gives an overview of the system and of how we have it set up though [21:07:42] how does it get to the volatile dir on the masters. is it a git repo or something? [21:08:22] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/swift-ring/+/refs/heads/master/Makefile [21:08:25] :) [21:08:35] very advanced technology hehe [21:08:51] git repo that gets rsync'd by hand [21:09:13] how often do the files change? [21:09:47] overall, not very often, sometimes a few times in a day when there's a need to do maintenance [21:10:00] but it isn't uncommon for it to go months between a change being made, either [21:10:17] basically for adding new machines, decoms, or for extended repairs those files need to be changed [21:10:58] im asking because the volatile endpoint caused a few issues today when it wasn;t avalible. currently the inactive master proxies the volatile ndpoint to the active puppetmaster and im wondering if it could just serve it from its local system (which is rsynced with the master every 15 mins) [21:12:01] sounds like for swift data it would mean nodes in codfw may get the config upto 15 mins after nodes in eqiad, would that cause an issue? [21:12:16] or vice verser [21:12:31] that should be fine -- it's totally separate data for codfw vs eqiad and is likely to remain that way for a while [21:12:36] also some amount of eventual consistency is fine [21:13:27] ok thanks thats good to know. havn;t looked at all the other issues to know if this is viable but good to get one checked off [21:34:44] volatile is also used for the tftpboot images [21:35:00] some delay in propagating those would certainly be an acceptable tradeoff [21:35:23] they only change every fews months when there's a new d-i image for Debian point releases [21:43:56] cheers moritzm the other stuff in there i GeoIP whic i think can also be 15 stale, and misc and squid which i still need to investigate [21:45:09] oh misc seems to be GeoIP related as well [21:46:37] * jbond42 wonders if the 2 12k squid config files are still needed? [21:46:50] I'm pretty sure the squid directory is totally outdated and can be removed entirely [21:47:08] squid/frontend.conf has historic hostnames [21:47:25] from the times before Varnish [21:48:05] cp1001-cp1020, we're now at cp1076 in prod :-) [21:48:26] :) [21:48:33] cool thats what i was thinking [21:49:24] this looks promising, ill put a task togethr tomorrow with this info cheers [21:50:17] ack, sounds good [21:57:34] whoops, was at lunch, but yeah what cdanis said [21:57:44] also super boring technology yes [21:57:55] boring technology is the best kind tbh [21:58:31] heheh yeah I agree [21:59:10] now I can't unsee boring machines for tunnels [21:59:45] heheh [22:30:31] adding downtimes in Icinga is much easier than removing them. way too many clicks in the web UI if you change your mind and want to delete them. and then it seemed like i have permission issues but it removed it anyways. uh [22:31:01] well, i got what i wanted.. just took way too long :) [22:33:03] soon, alertmanager silences \o/ [22:36:48] ooh. ok!