[06:47:21] 10Traffic, 10MediaWiki-General, 10SRE, 10Browser-Support-Apple-Safari, 10Patch-For-Review: File:Chessboard480.svg WEBP thumbnail version not visible on safari when size is fixed at 208px - https://phabricator.wikimedia.org/T280439 (10Gilles) This started happening because Safari 14 is supposed to support... [07:42:55] 10Traffic, 10Prod-Kubernetes, 10Pybal, 10SRE, and 2 others: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10JMeybohm) Very cool! >>! In T238909#7023429, @akosiaris wrote: > [] Look into switching to `"externalTrafficPolicy":"Local"` in... [07:44:45] 10Traffic, 10Prod-Kubernetes, 10Pybal, 10SRE, and 2 others: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10akosiaris) >>! In T238909#7028764, @JMeybohm wrote: > Very cool! > >>>! In T238909#7023429, @akosiaris wrote: >> [] Look into sw... [09:31:45] 10Traffic, 10SRE, 10Patch-For-Review: Remove SLAAC IPs from Ganeti hosts - https://phabricator.wikimedia.org/T265904 (10jbond) >>! In T265904#6722401, @jbond wrote: >> note: I created a [[ https://tickets.puppetlabs.com/browse/FACT-2843 | bug against facter4 ]] which is related > > FYI this has been resolve... [11:09:52] XioNoX: quick q: How much time should pass before routes advertised by a BGP node that has failed (simulated via ip link set down) are retracted by cr*? kubestage4 has a hold timer of 30 but in our tests with jayme we observed considerably more. Stopwatch says 2.5m [11:10:12] mind you that ECMP is enabled, but that should not matter, should it? [11:49:54] alternatively, should we go for bfd? calico doesn't support it but it looks like we could add support and upstream it [12:05:22] 10Traffic, 10Commons, 10MediaWiki-Uploading, 10SRE, and 3 others: Various errors when trying to upload large files (Could not acquire lock, Service Temporarily Unavailable, 503 Backend fetch failed, 502 Next Hop Connection Failed) - https://phabricator.wikimedia.org/T280926 (10akosiaris) p:05Triage→03Me... [12:32:31] akosiaris: yeah 2.5min is not normal [12:32:57] ecmp should not come at play neither indeed [12:34:53] akosiaris: bfd is needed, of course it depends on what is likely to crash, but as we do mutihop BGP, the router would not even detect the server going down (vs. if we were to peer with the TOR, which is not supported yet in our infra) [12:36:21] but first we need to figure out that 2.5m issue :) Maybe there is an issue where cr1 think there is still a valid route through cr2 and the other way around [12:37:24] it's pretty easy for me to reproduce btw. I can also explain what we did if that helps [12:45:20] akosiaris: for sure, do you want to re-do it today? [12:46:19] XioNoX: I got an errand to run in 45 minutes. will 45m be enough ? [12:47:36] akosiaris: it's more than 2.5min so I hope :) [12:47:43] lol [12:48:06] akosiaris: let's try at least, maybe it's going to be a rabbit hole :) [12:48:46] sure. Let me know when you are ready. I am already set [12:50:14] akosiaris: which device are you testing with? peer IP, prefix? :) [12:50:56] ab -s 3 -c 2 -n 5000 https://10.192.76.197/_info on deploy1002 and deploy2002 (I am lucky enough that they get hashed to the 2 different nodes of the cluster - kubestage1002 and kubestage2002) [12:51:13] and at some point I just do ip link set down dev eno1 on kubestage2001 [12:51:24] simulating sudden and complete failure [12:52:08] so the prefix is 10.192.76.197/32 advertized by kubestage2001 (10.192.0.195) to cr1/2-codfw ? [12:52:36] when the node dies, for the next ~2m and 30s ab/curl don't work (on 1 of the 2 deploy hosts, the other one is fine as it gets hashed to the other kubestage host) [12:52:41] it's a /24 but yes [12:52:51] 10.192.76.0/24 that is [12:53:00] but I am only testing with that single ip for now [12:53:06] akosiaris: can you stop the advertistement on 1, so we don't work with ECMP [12:53:51] not sure I can ... [12:54:28] XioNoX: I would be easier to revert the change adding multipath on junos I think [12:55:03] akosiaris: or kill the bgp session to the other node [12:55:24] if we only rollback ecmp it will work as active/passive [12:57:06] I don't follow [12:57:38] what I was testing was the time it took to retract the failed node from the prefix so that all traffic flowed to the node that was ok [12:57:51] that's the thing that took 2.5m to happen [12:57:59] akosiaris: right now the routers know two paths to 10.192.76.0/24 [12:58:13] I want to remove one the the two [12:58:44] so we can test the withdrawn time for a single node first [12:59:08] ok, let me firewall bgp off then on the 2nd node (I don't think I got a good way of removing a single node and guaranteeing it will not suddently be added again automatically) [12:59:25] akosiaris: I can remove it from the router side if that's ok with you [12:59:44] ah, that would indeed be easier, yes please do [13:00:25] done, kubestage2002 is out of the picture [13:00:51] ok, let me restart some pods on the 1st node then cause the pods on 2002 are now useless [13:02:10] ok, ready [13:02:44] akosiaris: it's normal that I can't mtr to the test IP from bast2002 ? [13:03:09] yeah it's not responding to pings [13:03:23] akosiaris: let's try it like that for now [13:03:34] there is nothing to respond to pings, it's something we might have to leave with if we go this route [13:03:57] anyway, ready to plug kubestage2001 (.195) whenever you are [13:04:43] akosiaris: alright, pull the plug [13:05:08] ok both abs failed [13:05:23] cr2 still has the route, waiting [13:05:23] I still see the routes [13:06:23] 1m mark [13:06:31] oops sorry ma.rk [13:06:57] the router knows the peer is dead [13:07:01] but still has the routes [13:07:20] weird, isn't it ? 2m mar.k [13:07:31] and now it's done [13:07:36] it no longer has the route [13:07:54] somewhere shy of 2m and 10secs I think [13:07:59] akosiaris: ok, I collected debug [13:08:06] I'll go through it we can rollback [13:08:15] cool, thanks! [13:08:22] I am bringing the iface up again [13:09:29] and I reenabled 2002 node sessions [13:10:01] btw, that cluster is a testbed, it runs all our normal workloads but it's never pooled into any traffic nor any dev uses it [13:10:13] so we can do whatever you want there for debugging [13:10:25] akosiaris: [13:10:28] NLRI we are holding stale routes for: inet-unicast [13:10:28] Time until stale routes are deleted or become long-lived stale: 00:01:32 [13:10:28] Time until end-of-rib is assumed for stale routes: 00:04:32 [13:10:51] Stale prefixes: 2 [13:11:19] it might be because of gracefulrestart [13:11:53] oh and I saw that and then discarded it because I said "Oh, we are most definitely not doing graceful restart" [13:12:03] if you take the interface down, are you sure it's the same as unplugging the cable? or bgp has time to say "control plan is going down please hold the routes for my data plane"? [13:13:07] it's the same, it doesn't send event anywhere as far as I know, so bgp never finds out [13:13:29] but we can do the same test by shutting down the port on the switch [13:25:47] akosiaris: so yeah, as a test you can disable grcefulshutdown for this group with: https://www.juniper.net/documentation/us/en/software/junos/bgp/topics/ref/statement/graceful-shutdown-edit-protocols-bgp.html [13:26:37] akosiaris: or disable it on the server side all together? [13:34:39] akosiaris: so yeah, as soon as gracefulrestart is send in the BGP negociation, it will consider any "host down" as a graceful restart and wait longer. So you should remove gracefulrestart from the server side, at least until bfd is implemented. Depending on how BFD works on the server side: if it can be deployed to monitor the health of the control plane or if it's tied to the routing daemon. [13:36:05] ah, so it's calico that is sending in the initial packets that it supports graceful restart and junos obliges [13:36:36] yup [13:37:45] interesting. We probably want that (we haven't tested the scenario of doing a graceful restart but we should) [13:37:51] before removing it we need to figure out if it's needed or not of course. Maybe calico restarts its BGP daemon regularly for other tasks [13:38:34] yeah, chances are we actually want it. e.g. during routine node maintenance [13:38:52] hmm I see there is a task at https://github.com/projectcalico/calico/issues/2211 since 2018 [13:39:01] I 'll read up once I am back [13:39:06] bbl, ttyl [13:50:17] 10Traffic, 10Prod-Kubernetes, 10Pybal, 10SRE, 10serviceops: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10JMeybohm) During testing today, we had some sideline issues because calico-node was dying (as we brought down the network inter... [15:44:32] 10Traffic, 10SRE, 10ops-eqiad: cp1087 powercycled - https://phabricator.wikimedia.org/T278729 (10Cmjohnson) 05Open→03Resolved replaced cpu1 and cleared the idrac log, resolving, if the issue returns please re-open. [15:55:15] hello folks, is it ok to repool cp1087? (hw mainteance done) [16:18:10] elukey: yes, if it's up and all-green in icinga! [17:02:04] bblack: looks like it is good, proceeding with the repool! :) [17:02:24] 10Traffic, 10Prod-Kubernetes, 10Pybal, 10SRE, 10serviceops: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10akosiaris) > [] Simulate node failures and record/evaluate recovery times We 've looked into this with @JMeybohm. We 've notic... [17:03:13] 10Traffic, 10SRE, 10ops-eqiad: cp1087 powercycled - https://phabricator.wikimedia.org/T278729 (10elukey) To keep archives happy: repooled after a chat with Brandon :) [17:10:22] elukey: thank you!