[13:37:14] So I have added three nodes to our codfw cluster, and they mostly work, the problem is that the storage initializer of our isvcs can't talk to Thanos-Swift.: [13:37:19] botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "https://thanos-swift.discovery.wmnet/wmf-ml-models?prefix=articlequality%2Fenwiki%2F20230824150035%2F&encoding-type=url" [13:41:43] it is strange, in theory the thanos-swift's IPs are listed among the ones that "escape" the istio iptable rules, since they are used by an init container [13:41:55] (only newer versions of istio support the use case, IIUC) [13:42:38] is there a way to see a failing pod in action? [13:43:56] the annotation is traffic.sidecar.istio.io/excludeOutboundIPRanges: 10.2.2.54/32,10.2.1.54/32 [13:44:09] klausman: --^ [13:44:54] I can pastebin the log of a failed storage-init attempt, sec [13:45:42] Also just spotted this: `socket.gaierror: [Errno -3] Temporary failure in name resolution` [13:45:54] https://phabricator.wikimedia.org/P67297 [13:46:16] I mean, yes, it's always DNS, but why, in this case? [13:48:31] any change that it got scheduled on 2009? [13:48:44] I noticed its calico container not running, tried to restart it [13:49:05] The failures I saw first were on 2009, yes [13:49:10] First host I uncordoned [13:49:45] yeah so bird is not getting to a healthy stte [13:49:46] *state [13:49:48] None of the ingress-gw's and ingress-gw-services are running, either [13:50:00] so the pods on 2009 are not able to contact coredns [13:51:05] bird: Node_2620_0_860_111__1: Error: Bad peer AS: 64811 [13:51:20] The core routers don't allow BGP [13:53:07] Doing these: https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/Add_or_remove_nodes#Adding_a_node [13:53:30] (only for 2009 for now) [13:54:15] sorry I posted on dcops and not in here [13:54:20] yes the BGP flag is missing [13:54:34] I saw :) [13:54:42] Running homer diff [14:30:38] Ok, enabled BGP on all three hosts, and used homer to update switch config. 2009 works fine (one pod on it). the other two have ingress-gw/-services at 0/1 ready. Could that be due to no pods ever having been there? [14:30:59] (the calico-node pods is 1/1 on all three) [14:33:14] nope they should be up and running [14:34:28] {"level":"warning","time":"2024-08-14T14:30:58.355958Z","scope":"envoy config","msg":"StreamAggregatedResources gRPC config stream closed since 7073s ago: 14, connection error: desc = \"transport: Error while dialing dial tcp: lookup istiod.istio-system.svc: i/o timeout\""} [14:34:33] Hmmm. One thing I noticed was that *cr*codfw* didn't have a diff for 2009, but 2009's TOR switch did [14:34:55] So I commited that, and assumed it'd be the same for 2011 and 2010 [14:35:12] diffing *cr*codfw* now [14:56:01] any luck? [14:56:27] Nope. cr diff is empty. tried rebooting ml-serve2010 to clear any bad state, also no luck [14:56:55] in the istio gw logs I found read udp 10.194.23.197:50591->10.194.0.3:53: i/o timeout [14:57:02] so that's DNS again [14:58:03] Odd that 2009 works but 201x don't [14:58:22] and you saw diffs for all three hosts on cr*-codfw? [14:58:47] No, none for cr, only the TOR switches (lsw1-c2 etc) [14:59:30] redoing those diffs, in case I forgot a commit [15:00:04] no diff for the TOR switches of 2010/2011 [15:00:16] trying one thing [15:00:23] (lsw1-c2-codfw and lsw1-d2-codfw) [15:02:08] I deleted the 201[0,1] calico pods to regenerate them, since they may have been in a weird state [15:02:15] and also deleted the istio gw pods in 0/1 [15:02:37] ack, I had deleted the istio stuff, but that didn't help [15:03:05] the istio stuff? [15:03:16] istio-ingressgateway/-services pods [15:04:44] the only diff atm is that scheduling is disabled [15:05:56] let me uncordon 2010, see what happens [15:06:38] done, no chaange so far [15:07:43] found nothing in the calico-node pod logs that shines light [15:09:09] I am on cr2-codfw, and show bgp neighbor doesn't show anything for ml-serve2009+ [15:09:22] not even 2009? [15:10:04] oh, right. that had no diff, so it would be lsw1-b7-codfw [15:11:18] The docs for this k8s bgp stuff mentioned that on some rows the BGP session is with the TOR switch, instead of the cr ones, so when I got no diff on cr for when I enabled BGP in 2009, I checked against lsw1-b7-codfw, and got a credible diff, so I committed that [15:11:27] And then similar for 2010 and 2011 [15:12:08] makes sense, at this point we should verify the bgp configs for 2010 [15:12:47] let me pastebin the diff [15:13:42] https://phabricator.wikimedia.org/P67313 [15:16:30] ok yes now I am on the TOR and I see the BGP neighbor for 2010, but it is in state Active, not Established [15:16:55] I am not sure what the semantic difference there is. [15:17:14] XioNoX: o/ around? [15:17:29] elukey: what's up? [15:18:16] XioNoX: Tobias just added the BGP config for a new k8s node, ml-serve2010, on lsw1-c2-codfw via homer, but show bgp neighbor lists the session in "Active" [15:18:33] klausman: active means actively *trying* to work :) established means all good [15:18:40] I haven't seen the new L3 TORs in codfw yet, is there anything extra to do? [15:18:56] elukey: there shouldn't, in theory [15:19:40] ack. I basically went by the wikitech instructions for new nodes modulo pointing at the TOR switches instead of core when I got no diff for core. [15:19:41] calico seems relatively happy for the host, maybe something is buried in its logs [15:20:11] I got the calico logs|grep -v INFO, but got nothing that would excplain why 2009 works and 2010 doesn't [15:22:55] `ml-serve2010:~$ sudo netstat -nlpt | grep 179` is empty, nothing listening on the BGP port? [15:24:26] XioNoX: I was restarting calico, it should be up now [15:24:45] elukey: still no bird [15:25:34] might be inside an birds/bird6 are running [15:25:47] oops, sentenc fragments falling out of my brain [15:26:04] bird/bird6 are running on 2010, lemme see their open ports [15:26:45] ps shows that it runs withs ` bird -R -s /var/run/calico/bird.ctl -d -c /etc/calico/confd/config/bird.cfg` [15:26:53] but then `/etc/calico/confd/config/bird.cfg: No such file or directory` [15:26:57] and now it's gone again [15:28:41] I think they run in containers, so their fsroot is different from the base machine [15:29:04] fair, any idea where to find its logs? [15:29:24] on the working 2001, there is no '/etc/calico/confd/config/bird.cfg'/'/etc/calico/confd/config/bird6.cfg' either [15:29:27] logs: sec [15:30:02] if you compare with 2009, bird is listening on port 179 there [15:31:11] sudo docker logs k8s_calico-node_calico-node-6bs4t_kube-system_60f6af95-be31-4b76-9fae-f6c3f2ce67f6_0 I think [15:32:40] klausman: it's possible that bird starts properly but doesn't have its BGP peers configured, and thus doesn't try to start its bgp deamon [15:33:00] that's probably why the process is running but nothing listening on 179 [15:33:22] it's also crashlooping or sth (or Luca is restarting it) since the PID changes [15:34:54] I am not restarting anymore [15:35:23] ml-serve2010:~$ sudo docker exec k8s_calico-node_calico-node-6bs4t_kube-system_60f6af95-be31-4b76-9fae-f6c3f2ce67f6_0 cat /etc/calico/confd/config/bird.cfg [15:35:25] strace of the bird provess shows sendto/recvmsg happening on FD 4 [15:35:29] doesn't show the BGP peers [15:37:49] y'know what I just realized? mayeb somewhere in puppet (or similar) there is a ml-serve200* pattern, and thus 201* don't get the right bits [15:38:55] klausman: or add stuff there https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/admin_ng/values/common-bgp.yaml [15:39:08] * klausman stares at `/^ml-serve20(0[1-9]|1[01])\.codfw\./` [15:40:05] XioNoX: ah wait not all the lsw TORs are listed yet in there [15:40:11] aha! [15:40:34] 2009 is on b7 [15:40:35] yes lsw-c2 is not there, I think, and neither is d2 [15:40:40] yeah [15:40:49] okok makes sense now [15:41:09] shall I make a patch? [15:41:17] klausman: be my guest [15:44:03] mh, whaty are the correct IPs? Can I find them in netbox somewhere? [15:44:03] klausman: the subnets are there : https://github.com/wikimedia/operations-puppet/blob/c1afafe325eacf47d2b74c1cd8cf8acf47ac585b/modules/network/data/data.yaml#L330 just need to use the first IP of the subnet [15:44:08] ah :) [15:44:51] there is no c8 as it's the fundraising rack [15:45:21] we should add "update that file" to one or multiple runbooks but not sure which ones [15:55:00] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1062728 [15:57:49] klausman: lgtm! [15:59:04] How will it be rolled out? [16:00:13] I presume helmfile -e XXX apply, but what's XXX? [16:00:24] no idea on that :( [16:00:41] Ah, *facepalm* [16:00:53] klausman: check the CI's output, the calico config is changed, so probably admin_ng [16:00:54] ml-serve-codfw, of course [16:01:08] you can use the -l name=calico-something [16:01:47] ack [16:03:38] helmfile -e ml-serve-codfw -l name=calico diff shows good diff, applying [16:04:40] and all is 1/1 [16:05:35] ok, pods starting inall three new hosts [16:06:02] and all running \o/ [16:06:24] XioNoX, elukey: thanks a bunch for your help! [16:07:27] nice! time to log off for me then ! [16:11:14] enjoy your evening! [16:15:53] nice :)