[07:25:34] I have restated stashbot but it is not happy :) [07:25:59] https://sal.toolforge.org/production is unavailable, may someone restart it please using the procedure at https://wikitech.wikimedia.org/wiki/Tool:SAL#Restarting_the_tool [07:26:00] :) [07:28:49] and the DNS issue happened again ~ 10 minutes ago [07:28:49] 07:17:07 fatal: clone of 'https://gerrit.wikimedia.org/r/performance/excimer-ui-client' into submodule path '/src/lib/excimer-ui-client' failed [07:30:07] though this time the graphs on https://grafana.wikimedia.org/d/000000240/wmcs-dns-eqiad1?orgId=1&from=now-1h&to=now shows a bunch of things happened at that time [07:30:10] maybe the restarted [07:31:56] * dhinus paged Host checker.tools.wmflabs.org - PING CRITICAL - Packet loss = 100% [07:34:51] there is another active alert: virt.cloudgw.eqiad1.wikimediacloud.org:0 has failed probes (icmp_virt_cloudgw_eqiad1_wikimediacloud_org_ip4 [07:41:45] hi [07:41:57] I'm around now [07:43:18] the virt.cloudgw alert has resolved by itself [07:43:55] I see that stashbot has also rejoined [07:44:29] I have no idea what happened around 9:30 [07:46:24] ok [07:47:06] I don't see anything relevant on the SAL [07:48:27] or cloudgw logs [07:49:28] maybe a network issue? [07:49:46] the active cloudgw briefly flapped between the 2 nodes [07:49:48] https://usercontent.irccloud-cdn.com/file/0Hve1frW/image.png [07:51:22] https://www.irccloud.com/pastebin/SHgl7Vw6/ [07:51:29] this is wrong, and I don't know what that means [07:53:43] oh, the kernel driver at cloudgw1002 failed [07:53:48] https://www.irccloud.com/pastebin/WuYxvB6K/ [07:53:54] this machine needs reboot [07:54:30] nice catch! [07:55:36] I'll reboot it to see if the error goes away [07:55:52] ack [08:01:50] hashar: I've restarted sal.toolforge.org [08:04:26] dhinus: merci! [08:05:28] the checker.tools alerts is no longer firing in icinga, but it was not auto-resolved in victorops. I [08:05:33] I've resolved it manually [08:05:36] ok [08:05:47] cloudgw1002 seems stable after the reboot, I'll let traffic flow [08:06:05] and see if the error repeats with actual traffic [08:08:11] full alert history from icinga, it only lasted a couple of minutes, I think they're all connected to the cloudgw issue: https://phabricator.wikimedia.org/P69472 [08:08:37] dhinus: yes, all connected [08:10:36] getting an early start I see xd [08:11:18] now this alert: [08:11:19] FIRING: SystemdUnitDown: The service unit cadvisor.service is in failed status on host cloudgw1002 [08:11:25] why do we have cadvisor on cloudgw1002 ?? [08:13:53] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1078346 [08:17:08] according to wikitech, "As of Aug 2023 all production hosts run cadvisor" [08:17:16] https://wikitech.wikimedia.org/wiki/Cadvisor [08:17:50] oh, I see, as a way to query per cgroup resource usage [08:18:36] why is it failing? [08:19:48] I don't fully understand, but here are what seems to be the relevant logs [08:19:51] https://www.irccloud.com/pastebin/u72tQXCq/ [08:20:27] not being able to bind to the main IP address seems like a race condition [08:20:32] with networking.service [08:20:59] from a bit before [08:21:06] https://www.irccloud.com/pastebin/gGpeuzeJ/ [08:21:42] Oct 07 07:59:37 cloudgw1002 confd[588]: 2024-10-07T07:59:37Z cloudgw1002 /usr/bin/confd[588]: FATAL Cannot get nodes from SRV records lookup _etcd-client-ssl._tcp.eqiad.wmnet on 10.3.0.1:53: dial udp 10.3.0.1:53: connect: network is unreachable [08:22:24] kind of makes sense, the network was not up until after 8:00 [08:22:24] [Mon Oct 7 08:00:04 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit [08:23:37] I see, it took a while to come up after reboot [08:23:57] and I guess at least cadvisor.service should declare [08:23:59] After=network-online.target [08:23:59] Wants=network-online.target [08:24:32] probably, it seems to have give up trying to start it before the network was up yes [08:26:40] oh, conntrack was killed [08:26:46] https://www.irccloud.com/pastebin/GX8pYgfg/ [08:26:56] that might have caused the flip [08:27:20] https://www.irccloud.com/pastebin/w5HKfeGb/ [08:28:04] that's the network driver right? [08:28:22] yes, we already detected that, see https://phabricator.wikimedia.org/T376589#10205754 [08:39:22] Interesting, I did not see the irc logs until later, the last one I saw before commenting with "getting an early start" was the graph of the flip :/ [08:45:31] I've restarted my laptop completely (just in case, had a pending upgrade too), though I suspect it might have had no effect, so expect me to have missed any messaged [08:51:29] cloudgw1001 got a timeout on conntrack before the kernel issue (~4min) [08:51:33] https://www.irccloud.com/pastebin/1C07XE8v/ [08:52:28] most likely it hang trying to connect to the other node [08:53:00] the other node did not get the issue until 4 min later though [08:54:29] having hardware problems... it could be anything. Maybe this is the cause of the DNS errors reported the other day [08:55:33] we could do a test, can we switch the primary to the other node, and see if the DNS issues persist? [08:55:54] the active node is now cloudgw1001 [08:56:00] last week it was cloudgw1002 [08:56:06] (until today) [08:57:08] true, it did not switch back [09:00:12] we'll see then, the firmware seems to be ~4y old, might be worth seeing if it can be upgraded I guess [09:00:43] I agree [09:11:55] review here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1078350 [11:08:37] topranks: when you have a moment, please review https://gerrit.wikimedia.org/r/c/operations/puppet/+/1077712 (non urgent) [12:35:20] quick review here: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/91 [12:35:39] ^^^ first time a legit user project deletion is conducted via tofu-infra [12:36:46] arturo: done [12:36:53] thanks [12:37:01] nice! I wonder if it will fail because of connected resources? [12:37:01] I wonder if we need anything else for domains to go away, etc [12:37:16] yeah [12:37:21] let's try :) [12:37:21] also, not sure how it worked before [12:37:31] I know the wmfkeystone hook also kicks in for deletion, so we shall see [12:37:44] there's a checklist on wikitech for things that need to be deleted when you delete a project [12:37:52] ok, let me checkout that [12:38:01] https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Projects_lifecycle#Deleting_a_project [12:38:10] thanks [12:39:09] wow, actually merging the tofu-infra patch is a tiny step in there :-P [13:09:30] that's a good candidate for a cookbook ;)