[07:19:38] I'll migrate the codfw install server to nftables in about 10 minutes, if anyone is currently running a reimage, let me know and I'll postpone [07:25:23] moritzm: looks clear according to SAL [07:29:21] ack, starting now [07:48:29] reimages in codfw are good to go again [09:52:51] XioNoX: bjensen [09:53:07] I am rebooting a redis server, I do not expect much but headsup [09:53:26] ack, thanks [09:53:47] XioNoX: bjensen: starting rolling reboots of codfw kafka-main, no expected impact [11:05:49] Moving on to eqiad kafka main [14:37:57] fyi, I'll be rebooting deploy1003 at 1700Z (k8s infra window, so it shouldn't clash with deployments) [17:00:59] rebooting deploy1003 in ~1min, now would be a good time to save your stuff :D [18:45:54] Raine have you unlocked the keyholder per https://wikitech.wikimedia.org/wiki/Keyholder ? Looks like tchin tried to do a deploy about an hour ago and it failed [18:46:43] inflatador: oh crap, my bad [18:46:54] TIL [18:47:56] Raine no worries, I've never done it myself either ;) [18:48:16] hey Raine, so it's "sudo keyholder arm" and there used to be multiple deployment keys [18:48:23] but then we unified most of them into one [18:48:40] so it's asking for a passphrase but it should all the be that ONE phrase that is in pwstore [18:49:07] like even if it's asking multiple times, it's most likely just repeat the same one [18:49:11] Did it alert? I don't see anything in icinga [18:49:52] I dont know, I am just saying this as general help because I remember it from migrating deployment servers in the past. [18:49:54] done [18:50:34] I didn't see an alert [18:50:54] there used to be an alert for this, that's true [18:51:02] "keyholder not armed", I remember it [18:51:12] should it be armed on the inactive deploy host too? [18:51:33] hmm.. probably not [18:51:41] there is an '18 unarmed Keyholder key(s) on deploy2002:9100' alert on alertmanager [18:52:20] but eqiad is primary [18:52:32] is the alert only on codfw? [18:54:33] the puppet code seems to be modules/keyholder/manifests/monitoring.pp - i don't see the "if active server" logic there. [18:54:40] not entirely sure if feature or bug [18:55:32] tbh I suspect the alert in eqiad just cleared before I had time to look [18:55:40] unrelated to the above, interesting thing while running cookbook, did something change with the alias param? [18:55:49] ValueError: Alias (ncredir-magru) does not match allowed aliases: A:ncredir, [...], A:ncredir-magru [18:55:57] but --query A:ncredir works, but --alias ncredir doesn't work like it did [18:56:19] but even with the alert existing, it probably gets buried with the ten million other active alerts around [18:56:29] ^ [18:57:35] sukhe: I ran into that before, but haven't yet gotten the chance to get it deployed: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1269402 [18:57:36] I didn't see anything fire in #operations but that's just looking with my eyes [18:57:49] seems like first is to define if it should be armed on inactive server or not.. and then handle monitoring based on that [18:58:07] that it did not fire on active server feels like wrong either way [18:58:12] the passive deployment host also needs to be rearmed after a reboot [18:58:24] moritzm: ah, thank you [18:58:24] ok, ack [18:58:25] it was simply forgotten when it was rebooted earlier [18:59:34] what taavi said then, it probably did alert but nobody was notified [19:00:17] sukhe: feel free to merge if it unblocks you, otherwise I'll do it tomorrow [19:00:32] moritzm: no worries, we did --query and that worked of course [19:00:45] would you recall why we added A: in https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1235794? [19:00:59] * Raine doing deploy2002 too [19:01:28] just making sure we are not missing anything [19:01:42] though, it was only an issue with the ncredir one so we should be good [19:02:42] tbh I'd expected those alerts to have made it to IRC, as they have severity: critical and there's no specific alertmanager routing rules for the team+severity pair in the config, but it seems to only have made it to the 'default' receiver (i.e. nothing) [20:58:42] anyone have a quick minute to stamp https://gerrit.wikimedia.org/r/c/operations/puppet/+/1289420 [21:01:47] thanks Keith! [21:01:54] np! [21:25:30] 😅 anyone for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1289424 ? [21:31:43] thanks Scott :)