[06:08:50] morning! I filed https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/518764/ to fold the cdh submodule in operations/puppet (first step) [06:08:58] any review would be great :) [06:29:19] looking [06:39:39] thanks! I remember to have used this procedure in the past (credits to Joe) but I don't recall if there were side effects [06:40:11] IIRC the labs puppet master didn't like it, but it wasn't synced via puppet-merge yet [06:40:41] so deployment-prep's puppet master might break, but it should be easy to solve.. This is the only use case that I have in mind [06:41:41] yeah, we folded some non-cdh analytics sub modules like a year ago or so, right? [06:43:39] and the analytics cloud vps project probably [06:43:40] yep varnishkafka, kafkatee and jmxtrans [06:44:25] now to complete the job we should do cdh, zookeeper and nginx [06:44:31] I already opened a task for the first two [06:46:37] there's also one for nginx [06:46:44] but blocked by something I forgot [06:50:34] at the time I created the patches to make it happen but it didn't gather many +1s :) [06:54:22] actually, there doesn't seem to be one, but when discussion came up the blocker was IIRC https://phabricator.wikimedia.org/T164456 [07:09:55] interesting [07:09:56] Could not find declared class ::cdh::hadoop at /etc/puppet/modules/profile/manifests/hadoop/common.pp:329:5 on node analytics1059.eqiad.wmnet [07:10:27] I am wondering if this is due to '::' and the environment/production, or something else [07:12:19] also pcc didn't tell me any of this [07:18:14] nope same error [07:21:59] at this point it might be something either related to my patch, or to the fact that puppet 5 uses environments/production in a different way? [07:22:58] ah no I am stupid [07:22:59] sigh [07:23:16] there is not /modules/ [07:31:32] fixed it, it works now [07:31:42] prepping the last patch to move cdh back to modules [07:38:59] cdh module finally into operations/puppet! [07:39:01] * elukey dances [07:43:18] \o/ [07:44:43] elukey: thanks so much for removing a git submodule [07:45:59] ah the productivity boost of not having the whole team typing `git submodule update` several times a week [07:46:00] volans: will also remove the zookeeper one, and possibly nginx if we have an agreement.. so submodules will be 0 :) [07:46:02] thank you!! [07:47:20] <3 [07:48:11] ema: I've a script to update the puppet repo :D [07:51:34] not only boost in productivity; until a few years ago even mariadb was also a submodule and IIRC at various points it almost caused outages when people forgot to sync the submodule until Jaime finally removed it [07:55:25] yeah like this dumb guy here https://github.com/wikimedia/puppet/commit/28bc35904b358fcb340abf3a514a59017e46c6dc [07:56:18] whooo never git adds directories but only files because paranoid? that would be me... [07:56:39] and so what happens to me is that I leave out files in my commits and wonder why they don't work :-/ [09:04:29] jijiki: T225284 thx for making my graphs look nice :) [09:04:30] T225284: thumbor haproxy trying to send syslog on wrong port - https://phabricator.wikimedia.org/T225284 [09:08:09] XioNoX: :D [10:53:44] akosiaris: do you have hints on the status of the docker packages in thirdparty/k8s for stretch (and buster)? [10:55:14] arturo: the stretch part is currently used successfully in production. It's relatively old andwe want to upgrade but are currently blocked on another component and a related bug. Tracked in https://phabricator.wikimedia.org/T216236 [10:55:51] there are no buster packages yet, we haven't even evaluated moving a kubernetes node to buster [10:56:00] and probably won't for some time. No time :-( [10:56:08] ok, thanks for the update [10:56:22] I will use stretch then. Briefly I had the idea of building the new cluster directly in buster [10:56:38] (the new toolforge cluster, that is) [10:59:29] this seems to be the version in stretch: [10:59:32] Package: docker-engine [10:59:32] Version: 1.12.6-0~debian-jessie [10:59:52] (which is a bit confusing bc the `jessie` string in there) [11:16:09] akosiaris: I'm staring at this comment in profile::kubernetes::node [11:16:13] # Funnily enough, this is not $kubelet_config cause we still need to support [11:16:13] # labs which doesn't use/support this. TODO Fix this [11:16:23] could you help me understand what's going on? [11:16:34] specifically, what would we need to do? [11:20:54] * akosiaris looking [11:24:14] arturo: yeah, IIRC upgrading [11:24:30] in production we use --kubeconfig [11:24:46] to configure some aspects of the kubelet. IIRC in 1.4 (toolforge) this is not supported [11:27:27] ok [11:27:39] arturo: since you are here. I am fighting with iptables and I am a bit at a conundrum [11:27:52] https://phabricator.wikimedia.org/T226237 [11:28:00] * arturo reading [11:28:14] I am wondering why those (correctly generated) icmp redirect packets get dropped [11:28:28] dropwatch and perf record tell me that it's nf_hook_slow() [11:28:37] lemme paste that in the task as well [11:31:34] added some more info [11:31:59] it's quite clear to me that something in the mangle table (or something right after it) drops the packet, but I can't for the life of me figure out what [11:33:33] since you don't have rules in raw or mangle table... there is no actual rule dropping the packets [11:33:48] I would take a look at conntrack and/or some sysctl configs [11:34:21] also, could you update https://phabricator.wikimedia.org/P8652 with `iptables-save -c`? [11:35:51] arturo: sure, done. You ain't gonna like it, it's huge [11:36:20] I 've already looked at conntrack -L | conntrack -E and they don't seem to be at fault [11:36:40] well, yes, those generated rulesets are usually very ugly [11:36:56] plus we are talking about locally generated ICMP packets, not sure what kind of connection tracking would apply there. I would expect nothing [11:37:27] locally generated ICMP redirect packets? I would bet they are generated for a reason [11:37:41] in response to some other client packet [11:37:50] oh, I know the reason. They are perfectly validly generated [11:38:15] it's that after the DNAT done by kubernetes the packet goes out the same interface it came in from [11:38:26] and the source IP is in the same subnet as the kubernetes noe [11:38:28] node* [11:39:11] so it sends an ICMP redirect to the client to inform it that a better gw would be it's own default gw. It's kind of an edge case for an icmp redirect but it makes sense due to the DNAT [11:39:27] the solution exists already. We will disable icmp redirect generation on those hosts [11:39:41] but that does not answer the question of why they are dropped [11:39:45] I suggest you turn on `sysctl net.ipv4.conf.all.log_martians=1` [11:39:55] ah, lemme try that [11:41:12] hmm [11:41:18] the moment I did that I have [11:41:20] Jun 25 11:40:38 kubernetes2001 kernel: [3111904.687433] IPv4: host 10.192.0.48/if2 ignores redirects for 10.192.64.188 to 10.192.0.1 [11:41:29] there you go :-) [11:41:29] I read that part of the code at some point [11:41:47] 913 #ifdef CONFIG_IP_ROUTE_VERBOSE [11:41:47] 914 if (log_martians && [11:41:47] 915 peer->rate_tokens == ip_rt_redirect_number) [11:41:47] 916 net_warn_ratelimited("host %pI4/if%d ignores redirects for %pI4 to %pI4\n", [11:41:47] 917 &ip_hdr(skb)->saddr, inet_iif(skb), [11:41:48] 918 &ip_hdr(skb)->daddr, &gw); [11:41:49] 919 #endif [11:41:52] sorry wrong paste [11:43:08] https://github.com/torvalds/linux/blob/master/net/ipv4/route.c#L924 [11:43:10] there we go [11:43:33] hm, so it rate limits? [11:43:42] probably yes [11:43:45] take a look at this https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt [11:43:59] specifically icmp_ratemask and icmp_ratemask [11:44:06] sorry icmp_ratelimit* [11:44:24] and icmp_msgs_per_sec [11:44:32] I already took a look at those. redirects are not in the mask [11:45:05] and we haven't messed with the default anyway [11:45:15] we definitely don't send 1k icmps per sec [11:45:40] ok [11:46:07] but lemme try to set icmp_ratelimit to 0 [11:46:21] we had a similar weird situation in our neutron setup due to the NAT setup: https://ral-arturo.org/2019/03/21/neutron-nat-martian.html [11:46:27] but if it was ratelimiting, wouldn't I see something in the tcpdumps? [11:46:43] just a sample of the total # of pkts, but still something [11:46:52] tcpdump won't show kernel doing ratelimiting I think [11:47:13] well, the packets that are below the threshold will be shown though, right? [11:47:31] does the kernel drops all ICMP redirects packets? [11:48:03] I did generate some with icmpush locally and I could see them in tcpdump [11:48:14] so no... it just drops those it generates itself I guess? [11:49:07] setting net.ipv4.icmp_ratelimit=0 changed nothing [11:51:26] reading the source code in the kernel from the link you pasted, it seems the issue is the client is ignoring the redirect, so the kernel inhibits itself from sending more (given they are ignored) [11:54:19] well... it not really sending them after all though [11:54:39] clear that part of the code is reached, but since the redirect never really exists the node, the client never receives it [11:54:58] which just adds to the confusion, ofc [11:55:36] did you run a tcpdump in the client looking for the icmp redirect packet? [11:56:04] I mean, are you really sure the client does not recv the icmp redirect packet? [11:56:44] yes. I am sure the packet never leaves the node and is never received by the client [11:56:59] wait, the CR may be filtering it! [11:57:51] CR? [11:57:56] core router [11:58:20] but I don't see anything relevant in the config at first glance [11:58:36] the junipers? doubtful. We are talking about hosts in the exact same ethernet segment [11:59:33] plus, I am pretty sure the packet never even reaches the ethernet cable [11:59:35] they are not [11:59:45] host 10.192.0.48/if2 ignores redirects for 10.192.64.188 to 10.192.0.1 [11:59:58] 10.192.0.48 is vlan 2017 (private1-a-codfw) [12:00:10] the sending node however is 10.192.0.11/22 [12:00:38] and 10.192.64.188 is kubernetes codfw pod IPs [12:00:50] ok, let me understand this better [12:00:58] 10.192.64.188 is a pod IP indeed. The one the original packet gets DNATted to [12:01:06] the packet flow is [12:02:18] mw22XX => 10.2.1.42:31192 => LVS servers => kubernetes200X => DNAT randomly to a pod IP e.g. 10.192.64.188:8192 [12:02:55] right after the DNAT, the kubernetes node realizes that the packet is now to exit out the same interface the packet come through in [12:03:06] and that's when it generates the icmp redirect [12:03:32] the original TCP packet is delivered fine and everything works fine [12:03:59] it's just we have a (valid again) ICMP redirect packet generation that gets discarded and I don't know why [12:04:37] I would trust the dmesg message: it gets discarded because apparently the client is ignoring it [12:05:10] all of them? we are talking about ~1 icmp redirect per sec [12:05:45] per https://grafana.wikimedia.org/d/PRA2F67Zz/t226237?orgId=1 that is [12:10:27] I'm out of ideas right now. We could schedule a full debug session, in which I actually jump to the involved hosts and try stuff [12:11:24] I assume LVS is using direct return, right? [12:11:33] yes [12:12:58] arturo: sure, it's kubernetes2001, feel free. puppet is disabled, calico is disabled, kube-proxy is disabled [12:13:29] and a quick way to view the stats is sudo sar -n EIP,EICMP 5 [12:13:45] under odisc/s you will see the discards [12:14:01] and under oredir/s the outgoing icmp redirects [12:14:15] those 2 rate match fully, always! [12:14:31] I have many questions right now. why the packet goes out using the same interface?, and why is that wrong or requiring a redirect at all? [12:18:06] how do you reproduce the tcp flow? [12:18:29] it goes out the same interface because of the DNAT. The original packet was for the LVS IP, but due to the kubernetes probabilistic load balancing to pods internally the destination address get rewritten to the pod IP [12:18:43] and it can be ANY pod in the fleet, including pods NOT present on the node [12:19:03] this is what happens in this case [12:19:06] what is the LVS IP BTW? [12:19:54] the one that seems to reproduce most of the traffic is 10.2.1.42 [12:20:00] but it's not limited to that one [12:20:18] it's just the easiest to observe the issue with [12:20:35] ok [12:21:57] note that this host runs 4.9.0-9-amd64 [12:22:22] we might as well try something newer at some point to at least rule out some fixed regression [12:24:20] do you have a way to trigger the redirect by hand? [12:25:06] ie, a wget command or something? [12:25:07] akosiaris@mw2222:~$ curl http://eventgate-analytics.svc.codfw.wmnet:31192/_info [12:25:26] although this can be better [12:25:28] hmm [12:25:29] gimme a sec [12:28:25] yup done [12:28:30] so, I 've fully depooled that host [12:29:03] so things like for i in `seq 1 1000` ; do curl http://kubernetes2001.codfw.wmnet:31192/_info; done [12:29:35] from say mw2222 (same rack row) would generate those redirects [12:30:47] only one ignored redirect message generated so far [12:31:48] (dmesg message, that is) [12:50:33] akosiaris: did you consider just disabling send_redirects for eno1 in kubernetes2001? [12:51:15] in this case, the icmp redirect does not make sense. The client is not using kubernetes2001 as gateway [12:51:28] arturo: yup. It would solve the issue. And I am gonna do that anyway at the end of this (it's the actual solution to the problem). But I can't help wanting to find out why the redirects get discarded [12:51:42] it's more a quest for knowledge right now than anything else [13:01:40] mm I agree, interesting situation [13:02:37] I need to pay attention to my own stuff though :-P please ping me if you find anything else [13:02:50] * arturo closes 10 browser tabs with kernel source code [13:03:07] BTW akosiaris this is way better to navigate kernel source code: https://elixir.bootlin.com/linux/v4.9.74/source/net/ipv4/ip_output.c#L1185 [13:03:12] (rather than github) [13:08:25] arturo: true, I just never remember it whenever I just want a paste [13:17:46] I have a question [13:17:53] how long ago was it that we used svn? [13:18:44] last time I used SVN was in 2008 I think XD [13:18:52] "Custom checks can be found on the private svn repository under ops/nagios-checks" [13:35:46] <_joe_> cdanis: 2012 I think [13:36:02] <_joe_> that's how far ops/puppet git history goes IIRC [13:49:23] why is it whenever I want to make documentation edits on wikitech, there's always some yak-shaving :) [13:49:42] https://wikitech.wikimedia.org/w/index.php?title=Template:Gitweb&diff=prev&oldid=1830408 [13:51:38] cdanis: https://www.mediawiki.org/wiki/Special:Code/MediaWiki [13:51:59] oh my [13:55:00] :-) :-) [14:05:56] if someone wants to kill a few minutes at a time, I am happy to do reviews for https://phabricator.wikimedia.org/T226508 [14:18:09] sorry all for that page. icinga puppet lagged behind the nodes. [14:18:15] was not expecing a page from it [14:35:44] how hard would it be to get icinga-wm to include some special annotation on criticals that paged? [14:39:53] an IRC highlight-word would probably be faster and more reliable than SMS ;) [14:42:31] <_joe_> cdanis: SMS has several advantages over IRC [14:42:36] <_joe_> but we can have both [14:42:42] I was certainly advocating both [14:57:39] cdanis: for the common case pretty simple, for the general one more complex because icinga-wm notifies in different IRC channels and for some groups might be paging while not for SRE and vice-versa [14:58:34] we have things to figure out there in the general case as-is ;) [14:59:46] so basically the quick hack would be something like this: [15:00:12] 1) add a new contact irc-sms or whatever, and add it to the sms contactgroup [15:00:42] 2) define a new notify-host-by-irc-sms and same for service that use the special syntax and use that as notification command for the new contact [15:01:24] 3) for the paging checks remove the admins group (that is added by default) when sms is set, to avoid duplicates [15:02:13] this *should* work for all pages to the sms group. Then special cases for subgroups might need additional tweaks [15:06:00] (the order would actually be 2-1-3 given that 1 is in the private repo and needs 2 [15:27:23] akosiaris: apologies for pointing you to that T226237 rabbit hole [15:27:23] T226237: Investigate outgoing discarded packets in the codfw kubernetes cluster - https://phabricator.wikimedia.org/T226237 [15:29:11] XioNoX: oh no, don't be. It's been a nice rabbit hole. I am learning in the process. I 've found my way out of it relatively quickly, I just want to figure it out now [15:29:25] but I can back out at any time [15:29:57] "but I can back out at any time" [15:30:03] that's what addicts say, no? :) [15:32:35] if anyone is able to review a labs/private repo change would be appreicated ttps://gerrit.wikimedia.org/r/c/labs/private/+/519051/ [15:32:46] pluse the script i used to prepare the patch https://gerrit.wikimedia.org/r/c/labs/private/+/519050/ [15:32:48] XioNoX: I think so. [15:33:20] at least my addiction has a clear goal. Figure why on earth those kernel generated icmp redirects get discarded [15:34:14] (I'm curently doing 300km/h in that train, and wifi works decent) [15:34:30] that sounds like a spanish train [15:34:40] akosiaris: I'm following the task with a lot of attention [15:37:13] XioNoX: fwiw, the way out is just disabling icmp redirect generation. Kernel tunable, really easy to do [15:37:38] yeah, shouldn't be needed anyway [15:37:53] yeah it's wrong to have them enable at the end of the day in this situation [15:38:09] but still... pretty interesting why they are being discarded [15:38:26] I 'd like to try a kernel upgrade just to rule that out [15:39:54] so the 2 questions are 1/ what is that SRC=10.192.0.11 DST=10.192.0.48 traffic, and 2/ why the icmp disappear [15:39:57] ? [15:41:22] 1/ is easy to answer [15:41:42] so, what happens is that clients (e.g. mw2222) try to talk to services on the kubernetes clusters [15:41:49] and use LVS IPs for that [15:42:02] up to hear it's everything as you know it [15:44:24] XioNoX: did you reach out to Job? [15:44:41] but the TCP packets reach the kubernetes nodes, the kubernetes probabilistic load balancing takes over and rewrites (DNAT) the destination address from the LVS IP (e.g. 10.2.1.42) to one of the pod IPs. It does this right before checking whether it should route the packet or not. It must route the packet now, but there are 2 possible outcomes. Either a) the pod is local to the node b) the pod resides on another node. In b) the pac [15:44:41] ket now must come out the same interface it came in from. This trigger a lookup of the source IP and if it is in the same subnet as the node then an ICMP redirect is triggered per the RFCs [15:45:20] the gw of course of the icmp redirect being the default gw of the node which is the junipers essentially [15:45:47] jbond42: paravoid did [15:45:51] 2/ I am still searching for the answer [15:45:58] ahh ok [15:46:42] akosiaris: ok, I understand [15:47:50] akosiaris: I would assume that once the packet enters the k8s land, it would travel using the overlay network, i.e, don't go back to the actual physical vlan using eno1 again [15:48:13] but perhaps that depends on the setup (flannel, calico, etc...) [16:00:00] arturo: there is not overlay network [16:00:08] s/not/no/ [16:00:21] at least in production where we use calico. In toolforge IIRC flannel is used [16:58:17] yes, and we aim to use calico as well in the next toolforge k8s version [18:45:43] when was the last time we had Solaris machines? [18:46:07] I'm severely tempted to just erase about half the content on Icinga @ wikitech [18:46:19] um [18:46:41] when was the lasttime solaris existed as a going thing [18:47:07] before 2013 [18:47:50] I'd have to really dig around to find a precise date... [18:48:01] also a little curious as to when we last had a machine named 'spence' [18:48:07] but not so curious that anyone should invest time in it [18:48:43] pmtpa so whenever that went out of service [18:49:57] https://wikitech.wikimedia.org/wiki/Survey.wikimedia.org there's a lot of ancient history still on there, whew [18:54:45] Toolserver had solaris servers pretty late IIRC [18:57:05] but that was before wmf took over the project [18:58:03] we never administered those [18:59:04] there's a lot of ancient history on wikitech, and only some of it is tagged with {{outdated}} or in namespace Obsolete: [19:01:13] unfortunately true [19:01:29] bug 1 will reign forever [19:06:35] I know this is not the first bug, but it is ironic: https://phabricator.wikimedia.org/T1 [19:07:07] hahaha that's awesome [19:07:22] 2014. sigh [19:07:30] do you rememer the number of the first bug from bugxilla? [19:07:46] there was fixed offset, I think 20,000? [19:08:04] kind of a missed opportunity that we didn't keep numbering the same :( [19:08:23] (and renumbered the pilot phab bugs instead for example) [19:08:25] I remember seeing it, people wanted to use it [19:08:38] and then, not worth it/not practical [19:09:27] oh, it still works: bugzilla.wikimedia.org/1 [19:09:42] this was the one apergos meant: https://phabricator.wikimedia.org/T2001 [19:09:56] that's the one [19:10:00] the 1, rather ;-) [19:10:29] the description is the best [19:11:25] "Migrated from sf.net bugtracker" [19:11:35] so that even predates bugzilla [19:12:04] ah, 2,000, not 20,000, ok [19:12:08] close enough :P [19:12:12] lol [19:12:19] order of magnitude, so ballpark :-D [19:12:31] that should be documented better [19:12:36] but T2001 [19:12:37] T2001: [DO NOT USE] Documentation is out of date, incomplete (tracking) [superseded by #Documentation] - https://phabricator.wikimedia.org/T2001 [19:15:49] and yet somehow it's still referenced :-)