[00:33:52] andrewbogott: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1116888 is one possible solution for T385530 [00:33:53] T385530: Unable to persistently set fs.inotify.max_user_instances and fs.inotify.max_user_watches - https://phabricator.wikimedia.org/T385530 [00:34:16] totally untested! [00:36:04] ooh, that's very similar to what I just started to write. I'm unsure about how the hiera lookup will behave though, I think it's regarded as bad from to not use explicit lookups. [00:36:08] But I'll try it first :) [00:37:57] In the profiles I set the values directly without hiera [00:38:51] yep, that part is definitely right [00:38:58] in classes we typically don't use the 'lookup' function and instead let Puppet do its normal thing, we just don't like that normal thing in prod config because someone decided that [00:39:31] * bd808 knows who and when but that's irrelevant to the code convention [00:41:29] yeah, as I feared setting "base::sysctl:;inotify::max_user_instances: 511" in hiera doesn't seem to do anything [00:41:46] so it works perfectly as a ready-made pair of settings but we maybe want to add a way to configure via hiera... [00:41:50] unless I'm missing something [00:41:58] if you have the ";" typo in there that's why [00:42:20] ... [00:42:38] my first commit message had a typo and your paste just did too [00:42:50] yep, fixing [00:43:25] ok, minus the typo it does exactly what you say it does [00:43:32] I used hiera *a lot* in mediawiki-vagrant ;) [00:43:50] and did so actually for over a year before it was adopted in prod [00:44:47] The thing that is less useful about that method is that you can only apply one setting for the whole server via hiera no matter how many times the class is reused. [00:45:00] but for this that should be perfectly fine [00:50:18] yeah, I think it's just what we need. adding Janis for a rubber stamp [01:04:29] * andrewbogott stopping for the night [08:22:30] arturo: great work on https://wikitech.wikimedia.org/wiki/News/2025_Cloud_VPS_VXLAN_IPv6_migration ! [08:51:30] XioNoX: thanks, hopefully we will restart work on that at the end of february, in a few weeks [09:09:08] ---> T385553 seems concerning [09:09:08] T385553: Cloud VPS puppet breakage on 2025-02-04 related to puppet-enc - https://phabricator.wikimedia.org/T385553 [09:12:57] is that "java unattended-upgrade without a puppetserver restart" bug again? [09:13:09] good theory, let me check [09:15:25] taavi: that was it [09:15:52] `sudo systemctl restart puppetserver.service` in the puppetserver fixes the problem [09:39:51] taavi: wdyt? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1117122 [09:43:41] does that prevent manually upgrading the package? [09:52:31] taavi: I'm not sure. I know one way or two to override any apt preferences by hand (like `install package=version`) [12:44:37] andrewbogott: I think we are ready to go wrt. the cloudgw operation [12:44:59] great! I assume there's nothing for me to do other than stand on the sidelines and cheer :) [12:45:36] mmm [12:45:42] there are 2 reimage operations to perform [12:45:56] to change puppet roles from cloudgw -> insetup and insetup -> cloudgw [12:46:04] I can do insetup -> cloudgw and you do the other? [12:46:24] cloudgw1004 is changing from insetup to the cloudgw role [12:46:32] cloudgw1002 is changing from the cloudgw role to insetup [12:46:46] this patch, basically https://gerrit.wikimedia.org/r/c/operations/puppet/+/1114998 [12:50:16] arturo: you want to reimage rather than just apply puppet and let it do its work? [12:50:30] yes, I prefer a full reimage [12:50:56] in neither of the two role transitions, puppet is prepared for proper leftover cleanup [12:51:04] ok! [12:51:06] folks I don't think I can make the cloudgw call... but any issues just ping me [12:51:14] ok, thanks topranks [12:51:37] arturo: should we just dcom 1002 rather than go to ->insetup? [12:51:52] topranks: that's OK! I don't think we are doing the videocall anyway, the calendar invite was mostly a time slot reservation [12:52:10] andrewbogott: I believe ->insetup is == decom [12:52:27] I guess it's a step on the road, I usually skip it but that's fine. [12:52:32] Any reason for me not to start that right now? [12:52:38] do it! [12:53:36] hm, this patch also touches 1001, will it be upset about 1004 not being ready yet? [12:53:48] hopefully not [12:53:53] they can survive without each other [12:53:55] ...and stashbot dies, I hope we aren't having a toolforge outage now [12:54:01] we aren't [12:54:05] ok :) [12:54:28] I'm checking with [12:54:30] while true ; do sudo cookbook wmcs.openstack.network.tests --cluster-name eqiad1 ; done [12:54:33] on cloudcumin1001 [12:57:04] and... just doublechecking, we're sure that cloudgw1001 is currently the active gw right? :D [12:57:10] yes [12:57:28] I failovered them by hand a few minutes ago [12:57:28] ok, here we go [13:00:29] ah [13:00:40] I guess that might explain all the bots on tools disappearing from IRC [13:00:54] Reedy: yeah [13:01:20] our particular CG-NAT does not really play well with the IRC bots [13:02:35] andrewbogott: I noticed I selected 'bookworm' for the new OS of cloudgw1004, whereas cloudgw1002. So if I don't cancel the reimage, this is not only replacing hardware but the OS as well. Wdyt? [13:02:53] whereas cloudgw1002/1001 are bullseye [13:02:56] * [13:03:23] I don't think we should pair the two things together [13:03:41] I don't think we should, but on the other hand it saves us another future outage if we skip ahead... [13:03:43] and at very least, we should exercise the cloudgw puppet role on codfw1dev first [13:03:51] Oh, well, that's a good point. [13:03:52] Hm [13:03:57] So I guess bullseye for today :( [13:04:09] ok, will re-start the reimage [13:06:22] andrewbogott: the cookbook is asking whether this puppet role is already in puppet7 [13:06:42] let me check on 1001 [13:07:12] yes, it's on 7 [13:07:34] ok, thanks [13:12:56] arturo: belt + suspenders: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1117189 [13:14:25] andrewbogott: maybe leave the file hieradata/hosts/cloudgw1003.yaml for when we move it [13:14:59] or, specify the setting on the role level, rather than per host? [13:15:40] I think I'm generally confused about this, let me do some more tests [13:21:00] arturo: it's already set in the role hiera so I'll just abandon my patch, all is well [13:21:36] ok great [13:29:14] so far the cloudgw1004 reimage is going fine. First puppet run started [13:31:35] 1002 reimage is wrapping up [13:37:59] 1004 first puppet run completed, now rebooting and another puppet run [13:46:12] reimage finished [13:46:16] I'm spot checking a few things [13:46:24] then we will failover to the new box and see what happens? [13:50:19] scary but, yes, I guess that's the plan :) [13:55:14] andrewbogott: ok, then doing it now! [13:55:27] ok! [13:55:49] done [13:56:31] network tests look green [13:56:37] * andrewbogott trying to not say anything that's bad luck [13:58:37] I guess we need to change the dashboards to show the new node... [13:58:47] I'm doing so now [13:59:38] done [14:00:22] whoops I'll drop my pending changes [14:00:23] cloudgw1004 is now sustaining almost 2Gbps traffic. I think we can declare the operation over, and a success :-P [14:00:50] agreed, everything looks good to me. [14:01:13] So we'll give this a couple of weeks and then do the other switchover. [14:01:22] ok [14:01:25] Which we don't need to schedule since it'll be a passive server being changed [14:01:29] nice work! [14:01:50] we would need to failover as well, for the sake of testing the new box, but yeah [14:01:55] thanks for assisting [14:01:58] true [14:03:24] arturo: want to send the 'all clear' email? [14:03:39] you do please, I'm about to lunch [14:04:08] ok [16:25:15] topranks: there was something weird happening wrt. DNS while we did the cloudgw migration, see T385600 Not impacting anything, but definitely a mystery [16:25:16] T385600: Cloud DNS: investigate weird graph during cloudgw replacement operation - https://phabricator.wikimedia.org/T385600 [16:41:55] topranks: thanks for confirming :-) I'm a bit paranoid about DNS problems lately [16:43:24] arturo: np [16:43:42] we can change that if you want, configure bird on the hosts to send a community with the router which will affect routing [16:44:07] basically tag the routes as high/low priority [16:44:20] nah, I'm fine with how things are today