[14:08:07] Hrm. What wopuld prevent the istio ingress gw from looking up names? [14:08:21] Error while dialing dial tcp: lookup istiod.istio-system.svc on 10.67.0.3:53: read udp 10.67.22.128:48034->10.67.0.3:53: i/o timeout [14:08:49] calico not running fine, so timeouts to coredns [14:09:00] if this is a new node, did you check the BGP config etc..? [14:09:49] yes [14:10:18] BGP config is set (netbox), homer ran, ran puppet agent on all existing hosts [14:10:26] Calico, in tuirn can't get a cert [14:10:46] or at least that's what I saw initially [14:11:07] what is the problematic node? [14:11:14] ml-serve1011 [14:12:18] It's currently cordoned, but that should be an issue, right? [14:12:57] (I misremembered calico not getting a cert, that's the same ingressgateway, it can't get a cert because DNS doesn't work) [14:15:42] on lsw1-f5-eqiad the BGP conn for 1011 is in "Active", not "Established" [14:15:50] so calico is not working as it should [14:16:13] is lsw1-f5-eqiad supported in deployment-charts? [14:16:24] hang on [14:16:26] ah no it is missing [14:16:30] same problem as the last time [14:16:37] helmfile.d/admin_ng/values/common-bgp.yaml [14:16:40] yep, I remembered the moment you mentioned d-c [14:28:49] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1069225 [14:29:10] The other two hosts I am adding are on asw switches, and so will speak BGP to a core router, AIUI [14:40:30] reviewed, there is a small error to fix [14:45:05] yep, the dangers of c&p [14:50:46] klausman: fyi, I'm merging an admin_ng change as we speak, and will try to get that applied and out of your way asap :) [14:53:51] ack, thx [14:59:15] klausman: ah, I see you merged [14:59:57] so, I'm now seeing your diffs - I'm seeing a couple of BGPPeer objects in the diffs now [15:03:08] klausman: is it alright if I go ahead and pick these up in (wikikube) codfw / eqiad? seeing 2 objects (v4/v6) per each of e5 and f5 [15:04:47] yes, that sounds goog [15:04:49] good* [15:09:11] klausman: cool, done - I'll let you apply those to the other clusters :) [15:09:18] ack, will do [15:57:57] klausman: I'm running homer on cr*eqiad* and it's removing a bunch of ml-serve neighbours [15:58:26] ml-serve-ctrl100[1-2], ml-server100[1-4] is that normal? [15:59:19] hmm and a bunch of dse-k8s workers as well brouberol [16:00:00] I'm not aware of why it would do that. We're not removing any worker :/ [16:00:58] https://phabricator.wikimedia.org/P68327 [16:02:26] they have bgp set to false in netbox afaict [16:02:56] I'm on my way out, baby duty. inflatador, would you mind having a look please? [16:02:59] I can run a script that will add it back [16:03:05] claime: <3 [16:03:19] they're all supposed to have BGP right? [16:03:50] brouberol claime :eyes [16:06:37] [edit protocols bgp group Kubedse4] [16:06:39] + multipath; [16:06:41] is tripping me up [16:07:29] claime we're not making any changes to the dse-k8s env, I can't explain that output [16:08:07] oh, there was a ticket about adding IPv6 DNS records, but I can't imagine that it matters [16:09:16] ok... I don't know what happened there, and for multiple servers in two environments [16:10:22] my take is somehow they ended up with the BGP flag set to false for some reason I can't find in the changelog in netbox [16:10:32] and homer wasn't run on that cr for a while [16:11:19] That seems plausible. 100% of our k8s workers usage BGP, right? Regardless of cluster [16:11:23] The corresponding ml hosts in codfw have the BGP flag set to true [16:11:28] If they use calico yeah [16:11:39] I'll run a script to put the flag back [16:12:12] ACK, we're crazy but not crazy enough to one-off our CNI ;) [16:13:34] argh my script doesn't work >< [16:18:05] hmm it doesn't work for virtual machines for some reason [16:19:13] ok these I can do manually, the workers are done [16:21:26] yes, I mentioned this in #-foundations, I have no idea why it's removing machines [16:21:46] I've flipped the flag to true for all of them [16:21:54] I'm about to run homer again [16:22:01] ty [16:22:23] inflatador: basically anything that is running Calico should do BGP [16:27:13] * inflatador is still waiting for rip-ng to catch on [16:28:57] ok diff looks a lot more sane now [16:29:27] https://phabricator.wikimedia.org/P68327 [16:30:01] ok to commit ? [16:31:08] claime SGTM [16:39:17] calicoctl shows BGP Established on dse-k8s-worker1009 and ml-serve1009 [16:44:46] \o/ [16:44:51] thanks for taking care of it