[13:52:03] topranks: thanks for looking at those graphs yesterday night. I agree 99% success rate is not concerning, I'm (slightly) worried about the errors in T382220 that seem to be correlated with the probe failures [13:52:03] T382220: KernelError Server cloudgw1002 may have kernel errors - https://phabricator.wikimedia.org/T382220 [13:52:47] any idea on what could possibly cause them? is it worth rebooting the host like andrewbogott was suggesting yesterday? [13:54:38] just found that another kernel error was logged last night, again around 00:30 UTC like the night before [14:05:15] I mean a reboot isn't a bad idea [14:05:23] I replied on task there, TL;DR I've no idea [14:05:40] but also if we have no signs of other problems I'm not sure how worried we need to be about it [14:08:00] the NIC firmware is ok on that host - or at least it's the normal one we use everywhere [14:08:55] ok thanks. I'm fine with waiting until tomorrow to see it these errors keep reoccurring [14:09:10] that would also give us a clearer idea whether reboot helps or not [14:12:29] A reboot is no harm we should just go ahead and do that it won't hurt [14:13:02] as taa.vi said yesterday it's all keepalived so it'll fail over ok and quickly so it can be done without a maintenance window [14:13:20] that said it's better to manually failover before the reboot though I don't know how to do that tbh [14:26:15] me neither, I found this https://www.virtualtothecore.com/manual-failover-of-keepalived/ but seems more complicated than I would like [14:37:48] on routing gear the typical thing to do is just to change the priority and have pre-empt enabled, but I'm not really familiar with the setup here [14:44:18] that might work, I see we have nopreempt in /etc/keepalived/keepalived.conf [15:55:12] yeah so probably if that was instead "preempt" or the line wasn't there, then changing the priority would work [15:55:40] snooping on arturo's bash history I think probably a simple "sudo systemctl stop keepalived" would do it even [16:04:28] andrewbogott pinged a.rturo on telegram and he will probably pop by later this week, so the plan is to wait until then just for extra safety [16:05:15] I tried changing the priority but it's set with puppet, I guess I could disable puppet, but maybe there's a simple way [16:05:46] also I was confused because the logs show that there was a failover 2 nights ago, but then it failed back to the highest priority one, apparently ignoring the "nopreempt" [16:07:10] tl;dr let's wait for a.rturo's feedback on this [16:07:42] I'll also add some notes on wikitech for the future, if we find out the "correct" way to do this [16:08:21] I sent a calendar invite + followup email scheduling a call 24 hours from right now. [16:27:04] andrewbogott: thanks, I received the invite and my plan is to join you in the videocall, and have my terminal ready to act [16:27:15] sounds great, thank you! [16:27:25] if I recall correctly this is the second hardware failure in cloudgw1002 [16:27:52] this may indicate a more persistent HW problem? if so we may consider just replacing the server? [16:28:38] possibly. I don't think we have much of an idea about what's happening. The task is https://phabricator.wikimedia.org/T382220 [16:29:30] that sounds like a hardware problem in the NIC [16:32:47] yeah same problem [16:32:49] T376589 [16:32:50] T376589: cloudgw1002: network interface problem - https://phabricator.wikimedia.org/T376589 [16:33:26] I guess this supports the idea of the replacement [16:42:11] I suppose the nic is on the main board so can't be swapped [17:02:00] dhinus: can I get a quick +1 on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1105020 before you go? [17:02:42] in a meeting, I'll try to have a look [17:03:46] thx [17:03:53] I'm in the meeting too :) [17:36:57] your patch looks good but I wanted to look at it more carefully... I'm running an errand but I'll +1 when I'm back :) [17:38:16] ok, it can definitely wait until tomorrow, don't miss dinner on my account [21:33:33] bd808: at some point last week you mentioned something about wmf running a static site w/search, can you remind me what it was? [21:37:16] andrewbogott: https://developer.wikimedia.org/ [21:37:57] nice. Is that fuse.js? [21:39:29] https://lunrjs.com/ is the backing tech. We get it through the static site generator's search plugin. https://www.mkdocs.org/user-guide/configuration/#lang [21:40:32] thanks, will read