[05:48:33] morning! [05:48:42] there is something weird going on with restbase1017 [05:49:23] ah ok https://phabricator.wikimedia.org/T222960, bootstrapping [05:50:06] godog: --^ :) [05:50:17] it seems that some manual work on fstab is needed [05:50:33] lemme know if I can help, not sure how to do it :) [05:50:42] (I mean, without causing a mess!) [07:53:22] elukey: bootstrapping indeed, I'll take a look and let you know! [07:53:30] <3 [10:16:03] I am going to perform a codfw master database switchover [11:18:26] db2045 has a bad BBU and is likely to lag https://phabricator.wikimedia.org/T227862 [11:18:37] I have downtime'd for the weekend [11:18:45] (it is mostly depooled and passive) [11:19:12] but commenting it in case it appears on some other monitoring (mw, prometheus) [14:25:09] godog: I'm getting a bunch of iowait alerts; any chance that's a side-effect of your diamond/prometheus patch that you just merged? [14:25:54] andrewbogott: yeah 100% chance it is that, I'm looking if the alerts are real or need tweaking [14:26:05] thank you! [14:26:10] I'll ack in the meantime [14:28:36] andrewbogott: did that page? if so I'm sorry :( [14:29:39] I think it's a selective wmcs page — everyone's awake right now anyway though so no big deal :) [14:35:30] I didn't get a page here [14:39:42] yeah me neither, part of the reason I asked is I suspected it was wmcs only [14:42:32] yeah, it has contact groups set to wmcs [14:46:21] ok deployed, should be recovering soon [14:51:41] godog: I got fewer recovery pages than I expected but things look clear on the icinga dash [14:52:25] (and, dang, the icinga 'All Unhandled Problems' dash is shorted than it's been in years — we are so close to the finish line!) [14:52:28] *shorter [14:53:07] hehe the finish line is a lie! [14:53:20] but yeah I see all but labstore1004 recovered [14:55:12] oh, I guess it dropped off the dash when I ack'd it [14:55:30] so is 1004 alerting still due to a metrics change, or due to an actual load change? [14:57:53] heh load seems fine, so probably icinga [14:58:28] yeah, iotop looks pretty reasonable [14:59:40] and, it recovers [15:00:37] (meeting) [15:22:59] godog: cdanis sorry about missing meeting! [15:23:04] i was here, but didn't realize it was happening [15:23:15] my headphones were plugged in but i wasn't wearing them, so i didn't get any sound pings :( [15:24:31] hehe no worries ottomata [15:25:41] np :) [16:20:09] there are 2 patches on the puppet queue, is that inteneded? it is alerting [16:21:30] one id jijiki which I think is on it, the other is from Jhedden [16:21:33] *is [16:21:38] yes [16:21:42] I was about to ping hej [16:21:47] jeh [22:07:10] mutante: still having issues with the wmf-reimage or it was all fixed/explained already? [22:14:33] I'm going off in a few but feel free to leave a message and I can look at it ;) [22:29:08] volans: yes and no. i got my task finished at the second attempt so it's not urgent but it also did not completely finish cleanly. it was just that the issue is in detecting the last puppet run finished. so if i ignore the script output i could still get to the server and use it [22:29:38] i could make a ticket but nothing needs to be done now [22:29:40] and thanks [22:29:50] of ot says it failed means that puppet didn't exit with the expected exit code [22:29:58] it just timed out [22:30:16] waiting to fetch the puppet state [22:30:16] from which host did you run it? [22:30:22] while it actually was finished [22:30:25] cumin1001 [22:30:33] whoch OS did you installed? [22:30:38] stretch [22:30:39] and hostname too please [22:30:43] restbase1017 [22:31:07] the first attempt a day before was slightly different i think , because then i could not ssh to the machine [22:32:02] * volans looking [22:32:33] see the cumin.out file shows that part about failing to fetch the PUPPET_STATE or so [22:35:13] interesting [22:36:50] if I run the same command it works fine, I'm wondering if something was not setup correctly at that time, but fiven that the timeout is long if a second pupept run would have fixed it it should have passed anyway within the timeout [22:37:20] also that's the check of the puppet run @reboot after the first puppet run and reboot [22:37:37] it's to just check that puppet runs succesfully at reboot [22:38:32] and that's actually correct [22:38:45] puppet failed on that host until 00:00:10 of the 12th [22:39:14] see https://puppetboard.wikimedia.org/node/restbase1017.eqiad.wmnet (either click Show All or Next until the end) [22:40:31] mutante: ^^^ [22:42:11] and it seems to me that scap was failing [22:42:43] volans: yes, the puppet fail actually makes sense, that is because urandom had to deploy first [22:43:05] so that's normal, the reimage script checks that at reboot puppet runs successfully [22:43:13] and was waiting a successfull puppet run [22:43:19] that never happened within the timeout [22:43:35] alright. so this is because it didnt follow the normal workflow to first use role(spare) [22:43:49] it was a special case where existing host had to move to another rack and IP [22:43:49] a host shouldn't need manual intervention on reimage, the puppetization should be able to succeed [22:43:54] so it already had a role [22:44:06] still that should work out of the box [22:44:32] in theory all hosts should be able to be reimaged into their role and work at first puppet run [22:44:48] I know it's not the case fo same (many?) but that should be the ideal status [22:44:51] for all roles [22:44:57] yea. unfortunately i think there's quite a few that dont [22:45:04] or need 2 runs [22:45:28] in this case i dont really know the manual steps but i'll ask next time [22:45:29] ideally we should track them and try to fix them [22:45:36] ok [22:46:14] what we can do for the reimage script is change "Still waiting for Puppet after" into "Still waiting for a successfull Puppet run after" [22:47:41] yea, that makes sense. was thinking something similar. it could just say "either not finished or failing" [22:47:48] ack and thank you for looking [22:48:10] line 888 of modules/profile/files/cumin/wmf_auto_reimage_lib.py if you have a proposal ;) [22:48:26] ok [22:48:32] or i'll make a patch on monday [22:48:50] alright! [22:49:03] thanks for reporting it :) [22:49:25] * volans going for real now [22:49:27] *off [22:49:36] * mutante waves ..ttyl