[10:21:07] Etsy adopts Vitess for MySQL sharding, migrates in-house PHP ORM. https://www.etsy.com/codeascraft/migrating-etsyas-database-sharding-to-vitess?utm_source=Mastodon [10:22:58] Took five years :) [17:04:41] question for people that submit patches to puppet: the puppet CI's commit message validator is unhappy about the length of some of the lines in my commit message... but those lines are ones that contain URLs. :/ https://gerrit.wikimedia.org/r/c/operations/puppet/+/1256301/3 [17:05:00] is there a way of telling it to e.g. ignore the lines with URLs, and/or something like that? [17:05:51] (i could easily change the first URL to a https://w.wiki/ one, but the second would be hard to reduce the length of without removing the specific section of the page that it's linking to) [17:08:49] w.wiki is what I have done in cases where I can. I think there are some cases of bit.ly or other URL shortening services but well, we know the problems with that [17:09:55] Interestringly "However, do not break URLs to make them 'fit', as this will make them un-clickable; keep them, even if they are longer." is on https://www.mediawiki.org/wiki/Gerrit/Commit_message_guidelines [17:15:26] yeah. [17:15:48] and I was looking at my own commits to confirm, where I have done w.wiki or simply got lucky with the URL length [17:15:48] had a hunch, tried it; didn't work [17:15:52] so I do wonder what is the solution [17:15:57] I wonder if it'd be feasible for the commit message validator's code to recognise links like that in commit messages, to avoid needing to find workarounds (and to adhere to that line in the commit message guidelines :p)? (This is starting to sound like a feature request...) [17:16:00] it's interesting it's only complaining about 2 of the lines, not all 3 [17:16:00] (for now I've had an idea for my non-Wikimedia links - get rid of the https:// protocol from the start of them) [17:16:21] ah [17:16:25] cdanis: i believe that one of the links' lines doesn't exceed 100chars [17:17:00] https://gitlab.wikimedia.org/repos/ci-tools/commit-message-validator/-/blob/main/tests/data/GerritMessageValidator/really_long_url_ok.msg?ref_type=heads [17:17:03] (obviously not ideal to remove the protocol that will probably make them clickable... but, well, I kinda want CI to pass) [17:17:11] the rule is apparently they have to be on a line by themselves [17:17:25] oh. so i broke it with my numbering, lol [17:17:50] thanks for finding that cdanis :) [17:18:34] that's good to know [17:19:18] i'll probably just add a newline after [1] and before the actual link itself in that case [17:20:55] sukhe: let me know if you get a minute [17:21:04] topranks: hi [17:22:10] sukhe: you know what I think I just worked it out [17:22:32] it was related to the doh thing and conntrack exhaustion on the esams ganeti hosts [17:22:46] topranks: I was getting ready to blame the network [17:23:01] why do that when you can just blame team traffic? [17:23:09] always conspiring against us :) [17:23:39] :P [17:23:40] I updated https://www.mediawiki.org/wiki/Gerrit/Commit_message_guidelines [17:23:49] topranks: out of curiosity though, what happened? [17:23:53] actually I didn't answer my question [17:24:02] sukhe: do you know the background ? [17:24:46] topranks: I read the chat in the morning, yep [17:24:59] ok so you know the basics [17:25:00] (and left some messages in -sec as well) [17:25:23] due to some internet routing change a bunch of users who were getting routed to drmrs for wikidough started getting routed to esams [17:25:26] including that moritz bumped up the conntrack limit for esams to adjust for the routed ganeti bit where its forwarding traffic [17:25:36] this is likely due to their ISPs, failures somewhere, bgp policy updates somewhere on the internet [17:25:40] out of our control and we don't really care [17:26:03] ok [17:26:08] so - the major difference for us of the Wikidough VMs at each site is that in drmrs they are on our legacy ganeti setup [17:26:16] topranks: question that I still need to check. is it a bunch of users from drmrs, or all users from drmrs? [17:26:21] that uses a bridge device on the server and just forwards L2 frames from NIC to VM [17:26:23] checking graphs [17:26:34] this doesn't pass through L3 filters in nftables etc [17:26:53] the routed ganeti is different - the packet is sent to the ganeti host by the switch (ganeti host announces it in bgp) [17:27:18] and then the ganeti hosts acts as a router, the packet goes through the forward nftables chain and gets sent to the VM [17:27:31] yep, I read your backlog and makes sense [17:27:36] which means every flow gets a conntrack entry creation [17:27:52] the usage looks "normal", or at least what we've had in drmrs and not been an issue [17:28:07] I'm a little confused now though, I need to dig in more [17:28:21] both ganeti hosts in esams have over 300k conntracks now [17:28:32] yet the doh VMs only have ~5k each [17:28:58] when I said "i know what it is" was when I realised the doh VMs do normal recursive outbound on UDP 53, and assumed we must have a long timeout for udp on the hypervisor side [17:29:11] but the timeouts seem the same.. so I don't know how to explain this. I'll dig a little more [17:29:39] ah, yes, they do a full recursive lookup with a completely different pdns-rec instance [17:30:05] as in it does not hit the anycast internal referrer (we kept that separate to avoid cache poisoning attacks, generally distinct from prod infra, etc) [17:30:24] yeah that makes total sense [17:32:05] thanks for looking into this. the beer counter is overflowing :) [17:33:20] well we don't want it to fail again over the weekend [17:33:39] so yeah all the conntracks on the ganeti side are TCP 443, so presumably client doh connections [17:33:55] it seemed like we have sufficient for conntrack for now but worse case we can depool esams over the weekend [17:33:58] that's totally fine IMO [17:34:18] I'm not sure we can say that. we lifted a limit, it continues to rise [17:34:34] I see [17:34:36] every sign says we'll just hit that new limit at some point [17:34:57] ok that makes sense then. so yeah, depooling is simply stopping bird on these hosts and downtiming as you know [17:35:06] so if we can't figure out, we can simply depool and move on for now [17:35:18] conntrack timeouts are basically the same on the VM and the hypervisor [17:35:54] but why are there 300k conntracks for connections to doh3006 on ganeti3006, but only a few hundred on doh3006 itself ? [17:36:04] there is no conntracking on the VMs themselves as we disable it fwiw [17:36:08] yeah we can also depool [17:36:18] sukhe: ah ok that's the answer [17:36:41] firewall::service { 'wikidough-doh': [17:36:44] notrack => true, [17:36:46] ah, only from outside, it's making conntracks for connections to itself [17:36:48] tcp 6 49 SYN_SENT src=185.71.138.138 dst=185.71.138.138 sport=46954 dport=443 [UNREPLIED] src=185.71.138.138 dst=185.71.138.138 sport=443 dport=46954 mark=0 use=1 [17:36:50] (also a conscious choice as you can imagine) [17:36:58] which is why there are _some_ conntracks for me to count [17:37:06] yep... [17:37:35] yeah the answer here is that the ganeti hosts should not be creating conntracks for VM traffic, they should route it through the forward chain statelessly, and the VM itself should filter things [17:37:45] (same scenario as legacy ganeti with the bridge / L2 setup) [17:38:09] yep. though, question for you, is true for all VMs in general behind routed Ganeti or something more selective? [17:38:20] so we are basically saying no conntrack on ganeti forwarding, if VMs want to do it, they can? [17:38:27] yeah [17:38:39] I mean you can configure the hypervisor to conntrack - and make it act like a firewall [17:38:40] that matches the other setup then but I am just checking to understand if you wanted something different here on purpose or we simply overlooked this specific bit [17:38:40] if you want [17:38:58] but we aren't really doing that. we are tracking connections but not using the conntrack table for anything. just because that is default [17:39:05] got it [17:39:08] it was overlooked ] [17:39:38] in general I think it's ok to have same model as before yeah [17:39:48] I'll discuss with Ar z\el on Monday see if he agrees [17:39:54] ok [17:39:56] sukhe: you around for alittle while yet? [17:39:59] yes please [17:40:32] let's give it a few hours, if the numbers keep rising we probably want do depool wikidough in esams [17:40:41] and we can fix the conntrack stuff for routed ganeti next week [17:40:44] it might level off let's see [17:40:52] ok please @ me and I can do that (you can as well but it will be late for you) [18:10:34] dpogorzelski: It seems that ml-staging in codfw is trying to depool too many hosts at once during your sync [18:11:21] The load balancer is complaining because it's hitting the minimum threshold - it's refusing to depool as many as is requested [19:58:52] sukhe: those conntrack numbers just keep rising, I think it might be an idea to depool wikidough in esams over the weekend [20:03:07] topranks: ok let's do it. I can in 10 mins. [20:05:08] if you can that's great <3 [20:05:28] no rush at all btw, we're miles away from hitting the limit still [20:25:40] topranks: all done, depooled in esams [21:51:03] just checked there and everything looks healthy, increase in conntracks in esams stopped, wikidough working via drmrs for most of EU now instead [23:01:31] sukhe: seems I had only checked ganeti3007, ganeti3006 is still increasing in conntracks [23:03:16] I've shut bird on doh3006 to remove the route [23:12:51] topranks: I should have done that already as puppet was disabled as well [23:13:00] but thanks, maybe I missed something