[08:29:33] 10Traffic, 10DNS, 10Operations: Move "transparency.wikimedia.org/private" to "transparency-private.wikimedia.org" - https://phabricator.wikimedia.org/T188362#4005199 (10Peachey88) [08:52:42] 10netops, 10Operations: cr1-eqsin faulty interfaces - https://phabricator.wikimedia.org/T187807#4005229 (10ayounsi) Unit shipped with https://www.expeditors.com/ Supposed to arrive in Singapore on the 1st, and clear custom 2 days later, for a final ETA of 03-Mar-2018 20:25:00 SGT. As this is a Saturday 8pm, t... [10:33:53] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Pybal stuck at BGP state OPENSENT while the other peer reached ESTABLISHED - https://phabricator.wikimedia.org/T188085#4005400 (10Vgutierrez) [[ https://gerrit.wikimedia.org/r/414973 |Change 414973 ]] exposes the BGP session state over prometheus and ov... [12:47:21] re: drop-vs-reject, I wonder why? maybe we could change it to a final rule that rejects for WMF networks and then drops for non-WMF? [12:47:35] (to get better/quicker failures in-house in general) [13:01:56] 10netops, 10Operations, 10monitoring, 10User-Elukey: Pull netflow data in realtime from Kafka via Tranquillity/Spark - https://phabricator.wikimedia.org/T181036#4005760 (10elukey) 05Open>03stalled >>! In T181036#3979339, @Nuria wrote: > Are we planing to use tranquility to move the he data into druid... [13:37:13] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Re-setup lvs1007-lvs1012, replace lvs1001-lvs1006 - https://phabricator.wikimedia.org/T150256#4005869 (10BBlack) [13:37:17] 10Traffic, 10netops, 10Operations, 10ops-eqiad, 10Patch-For-Review: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#4005867 (10BBlack) 05Open>03declined Gave up on these machines! [13:37:21] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Re-setup lvs1007-lvs1012, replace lvs1001-lvs1006 - https://phabricator.wikimedia.org/T150256#2779434 (10BBlack) 05Open>03declined Gave up on these machines! [13:39:22] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4005877 (10BBlack) [13:42:51] 10Traffic, 10Operations: Fix lvs1001-6 storage - https://phabricator.wikimedia.org/T136737#4005883 (10BBlack) 05Open>03Resolved a:03BBlack [13:59:25] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4005921 (10BBlack) The hard part here is mapping out the necessary network ports correctly: * Each of the 4 servers is physically located in a different row (I'm assuming fo... [14:11:48] ema: when you get a sec to stare at VCL stuff: https://gerrit.wikimedia.org/r/#/c/404158/ [14:14:41] bblack: looking [14:24:43] root@lvs1006:~# facter -p net_driver [14:24:44] {"eth0"=>"bnx2", "eth1"=>"bnx2", "eth2"=>"bnx2", "eth3"=>"bnx2", "eth4"=>"bnx2", "eth5"=>"bnx2"} [14:24:47] ^ handy :) [14:25:46] :) [14:27:01] ema: have you started mucking with text@esams yet? [14:27:15] bblack: nope, I was waiting for swat to finish [14:27:22] how courteous :) [14:27:29] hehe [14:27:43] so the VCL patch seems to do what it promises [14:27:57] is Krinkle's comment still valid though? [14:28:05] > This would break the m.wikipedia.org and zero.wikipedia.org entry points because the (currently unreachable) Apache server config and the (equally unreachable) mobilelanding.php file have been broken for at least a year, possibly longer. [14:28:28] I'm gonna take the plunge on the RPS changes (which should be no-ops, but I'm gonna puppet-disable all the lvs+cp first to be sure. [14:28:45] no idea re: Krinkle's comment, I didn't see it [14:28:57] ok I'll ping him then [14:34:53] bblack: I'd swear I've either seen something like this before, or written something like this before [14:34:57] but I can't seem to remember [14:35:00] (net_driver) [14:35:26] a nearly identical bit of code, in python instead of ruby, exists inside of interface-rps.py :) [14:36:37] it's unfortunate that network driver interfaces are so non-standard [14:36:50] it's hard to do anything tricky in this area without being driver/card sensitive, which makes config abstraction nuts [14:45:04] ema: btw, with the zerofetch issues fixed, we shouldn't in-practice have vcl.discard crashes anymore regardless of the (still useful in the long term) vmod-netmapper upgrade [14:45:19] ema: so on that front, proceed in whatever order makes life easier. [14:47:02] bblack: nice [14:48:06] so we'd currently have 3 upgrades pending: varnish v4->v5, netmapper, retpoline kernel [14:48:13] yeah [14:48:18] and numa :) [14:48:25] and numa heh [14:48:51] when all else fails, change 4 things at once and see what sticks to the wall! :) [14:49:03] but yeah [14:49:20] I was planning on upgrading varnish+netmapper together and leave the rest for later [14:49:24] numa one should be fairly painless to cycle through with runtime depools [14:49:45] did you already do netmapper on the existing v5 hosts? [14:50:18] only cp5004 and pu [14:50:30] (well I guess it requires a restart though, maybe better to save it for package install just before retpoline reboots) [14:51:45] it=netmapper? [14:53:00] yeah, it seems reasonable to upgrade it before the retpoline reboots on the whole fleet [14:53:37] * ema likes to say fleet very much [14:53:59] aye captain! [14:54:01] :P [14:54:11] o> [14:54:25] so [14:54:46] what I hate is people saying "field operations" for data center work [14:54:46] with the vcl.discard crash out of the way (as it was situational and the situation is gone, regardless of the upgrade) [14:55:30] is it reasonable for us to do cumin-based discards of all cold VCLs with some quick CLI invocation across sets of cps, to clear out the loads of vcl cruft that happen on all these depool->repool around various transitions? [14:55:54] (since it may be a while until we have a whole new script that works sanely deployed, to do it automagically) [14:56:37] yes, we need to upgrade netmapper before that though [14:57:01] or are we 100% on the safe side now because all the netmapper files are in the right place? [14:57:03] well that was my point. technically we need to ugprade netmapper first, but in practice it now doesn't matter [14:57:06] right. [14:57:40] nothing ever deletes the netmapper files on any kind of future failure. the situation only arises when the host has never been able to fetch the files since installation. [14:57:46] (and now eqsin fetching is fixed) [14:59:05] +1 then yeah [15:01:49] (although we should still be wary of other bugs while we get used to it) [15:03:44] varnishadm vcl.list | awk '/cold/ { print $4 }'|xargs -I xx varnishadm vcl.discard xx [15:03:57] ^ seems to work on one host I tested :) [15:04:35] might be handy to do periodically when we know where generating tons of stale VCL through pooling, once we're comfortable it's not causing issues. [15:04:48] and what if the VCL id contains "cold"?!?1! [15:04:55] then someone named things dumbly [15:05:00] haha :) [15:05:47] bblack: swat finished, I'll start the v4->v5 upgrades and leave netmapper as is for today then [15:06:08] it will be a 2 Hard Things with One Stone: https://martinfowler.com/bliki/TwoHardThings.html [15:06:18] you name something poorly and the result is instant cache invalidation through crash :P [15:07:59] [not really, because varnish refuses to discard the active VCL anyways, and defers discarding inactive warm ones] [15:08:19] yeah nothing bad happens [15:09:59] I suspect between the cold VCL problem and the numa_networking problem, a lot of minor memory-pressure-related and perfy issues will evaporate [15:11:33] I wonder if we can graph IPI and other such numa-related stats, or if prometh already has it somewhere [15:11:44] heya FYI, luca and I plan to move upload varnishkafka to kafka jumbo in just a bit: https://gerrit.wikimedia.org/r/#/c/415016/1/hieradata/role/common/cache/upload.yaml [15:15:15] bblack: somehow unrelated, there's a way to get per-cgroup metrics in prometheus [15:15:49] meaning that you can easily distinguish between eg: nginx vs varnish memory usage [15:16:28] godog and I took a look into that a while back, it should just be a matter of installing a certain package IIRC? [15:18:40] yeah cadvisor iirc, plus the prometheus configuration [15:18:57] that one! [15:19:00] * godog applying skepticism to "just a matter of"s [15:19:05] :) [15:19:27] the best one I heard from a vendor was "just a matter of programming" [15:19:35] but it worked on my machine! [15:19:53] hehe indeed [15:23:55] bblack: `curl http://localhost:9100/metrics | grep numa` returns quite a few metrics [15:24:47] from the meminfo_numa collector apparently [15:31:07] yeah some of that's from vmstat, and some from /sys/devices/system/node/node*/numastat [15:31:38] (or at least, those kinds of things tend to be available in those places) [15:32:02] there's a different level of numa-related stats, but I suspect they can only been seen from e.g. "perf" [15:32:20] (e.g. QPI utilization, IPI-like things, etc) [15:32:32] I've been digging for some non-perf way to find those stats, but nothing so far [15:39:42] anyways, rewinding to the numa_networking thing, because I think I didn't say this outloud, but I did in another conversation yesterday [15:40:42] with modern kernels and the way our interface-rps stuff spreads IRQs and the nginx listeners on the cache boxes, etc... [15:41:14] the current situation when you really dig into things is basically nothing like what we'd expect/hope [15:41:39] the reality of the current situation is this: [15:42:12] *) eth0 is physically attached to numa node 0. You can set IRQ affinities to go to numa node 1 CPUs, but it's then quite inefficient at getting them there (non-local IRQs). [15:43:07] *) our standard bnx2x+interface-rps script distributes eth0 IRQs evenly over all physical CPU cores in both numa nodes. So half of them are jumping across numa nodes (cpu dies) over the QPI in efficiently. [15:44:06] *) the coming up from the other (software) side: we have nginx configured to pin processes on every CPU core in the system using SO_REUSEPORT, at the time on the vague hope that wherever a network IRQ happens to land, surely linux would do the smart thing and send it to a cpu-local listening socket. [15:45:18] *) but what linux actually does, is after the IRQ arrives on CpuX (from the card hashing/splitting traffic to all the IRQs->cores), the kernel does its own unrelated hash(connection-identifiers) and uses this to pick a new random one of our nginx reuseport listeners to send the traffic to. [15:45:38] with probability 1/n_cpus that it ends up where we want it to, and 50% odds of re-crossing the numa node barrier *again* [15:46:45] (that last part I don't think was necessarily always true in older kernels, but it certainly is now) [15:47:49] all of this is still an improvement on doing nothing (random cpu/irq balancing which you end up with if no tuning and irqbalance enabled or whatever). [15:47:52] but it's far from ideal. [15:48:43] "numa_networking: on" hieradata doesn't fix the whole thing. the kernel is still going to re-hash the traffic pseudo-randomly from the cpu core it arrived on to a different cpu core for nginx software pickup. [15:49:21] but it does confine the whole affair to numa node 0. it stops us configuring eth0 IRQs over on the other numa node, and stops us configuring nginx listening sockets over on the other numa node as well. [15:52:42] to prevent the re-hashing and make linux do what seems (from our naive non-kernel perspective, anyways) to be the most-obviously-correct thing and give the traffic to the cpu-local nginx socket, is trickier, we don't have a deployable solution yet, and it will require patches per daemon we care about. [15:53:04] so with numa_networking on we don't end up on different numa nodes as far as irq/socket action is concerned, just different cores? [15:53:10] right [15:53:26] and inter-core stuff on a single numa node isn't nearly as bad as all the cross-numa-node stuff [15:53:30] with penalties in term of cache misses instead of memory access? [15:53:34] it's still a fail on cpu cache hits and stuff [15:53:46] but at least it's not a cross-the-QPI numafail [15:54:31] so, with the numafail out of the way, looking at the "try to stay in one cpu core" part of the problem: [15:56:00] someone made this work back in late 2015: https://patchwork.ozlabs.org/patch/528071/ . This is the setsockopt(SO_INCOMING_CPU) solution. It allows the daemon to tell the kernel which sockets it has pinned to which CPUs (which you'd think the kernel would already know, indirectly :P) [15:56:13] 10Wikimedia-Apache-configuration, 10Discovery, 10Zero, 10Mobile, and 2 others: m.wikipedia.org and zero.wikipedia.org should redirect how/where - https://phabricator.wikimedia.org/T69015#4006422 (10Mholloway) Hi @DFoy, As described above in T69015#3761037, the Zero landing page previously accessible via... [15:56:48] there's even an unermged nginx patch to use it, too (which might need massaging): https://trac.nginx.org/nginx/attachment/ticket/1437/SO_INCOMING_CPU.patch [15:57:18] SO_INCOMING_CPU had a couple of minor design issues, though: [15:58:01] 1) It required the kernel to scan all the total global reuseport copies of the socket to pick one, which isn't cache-ideal either (but still not awful, and they did optimize this part of the problem as well as they could given the technique) [15:58:52] 2) It can only handle the simplest case people think of with SO_REUSEPORT packet routing, simple CPU affinity. it doesn't cover some other cases that matter for other problems (like SO_REUSEPORT -based socket takeover without loss, etc) [15:59:55] so someone else came along and invented SO_ATTACH_REUSEPORT_EBPF : https://patchwork.ozlabs.org/patch/560261/ [16:00:56] this allows you to attach a small BPF program to your set of reuseport sockets, which uses whatever logic you see fit (including simple cpu affinity) to map packets to sockets. It avoids the cache problems from (1) as well, so it's much more efficient anyways. [16:01:39] but apparently when that patchset (and/or related followups, it's been hard to bisect it down) went into the kernel, they broke the simpler SO_INCOMING_CPU setsockopt so it doesn't work in practice. [16:02:06] (but it's still documented, and apparently nobody's filed a regression or suggested deprecating it or whatever) [16:02:56] so now, in practice, that easy option isn't even possible, and we're off in the realm of writing BPF code to get traffic to stay on the same CPU (which seems like such a sane default thing for the kernel to do in the first place, but I guess implementation details and corner cases...) [16:06:01] that's about as far as I got down this rabbithole. Writing BPF programs to map this, and making such techniques sanely-patchable into nginx, is pretty non-trivial stuff. [16:06:55] but in any case, for now "numa_networking: on" is already implemented and tested on our end, and solves the greater part of the overall problem. [16:18:13] cloudflare wrote about similar stuff a few months ago in: https://blog.cloudflare.com/perfect-locality-and-three-epic-systemtap-scripts/ [16:19:08] but they seem to be turning off HT and not caring about NUMA, too [16:19:26] (thus, they can actually trivially map reuseport socket# to cpu#) [16:19:53] which makes for a very simple CBPF program: [16:19:55] https://github.com/cloudflare/cloudflare-blog/blob/master/2017-11-perfect-locality/setcbpf.stp#L18 [16:20:04] (also they set it from systemtap instead of patching nginx) [16:21:23] if you want to handle numa and HT, the mapping is never that simple. in the general case it requires EBPF and a "map" structure. [16:21:58] another annoying bit is that while the kernel has this concept of a "reuseport socket group", and each identical reuseport socket has a number within the group (which is what the BPF scripts assign sockets based on )... [16:22:11] (pybal) willikins:pybal vgutierrez$ git diff --stat origin/master origin/1.14 pybal/bgpfailover.py [16:22:14] pybal/bgpfailover.py | 151 ++++++++++++++++++++++++++++++++++++++++++++++++++++--------------------------------------------------------------------------------------------------- [16:22:18] 1 file changed, 52 insertions(+), 99 deletions(-) [16:22:21] (pybal) willikins:pybal vgutierrez$ git diff --stat origin/master origin/1.14 pybal/ipvs.py [16:22:24] pybal/ipvs.py | 9 +++------ [16:22:26] 1 file changed, 3 insertions(+), 6 deletions(-) [16:22:29] hmmm what am I missing here? [16:22:35] it's difficult (but not impossible) to be sure you even know which of your sockets are which reuseport#'s in the group. [16:23:14] there's no way to query the numbering of the sockets from the kernel. [16:23:38] the rule is they're numbereing in the order they're created. and then if you close any of them and it makes a hole, the last one in the list moves into the hole, changing numbers [16:24:10] considering that reuseport file descriptors can be passed between processes, participate in takeover between daemon restarts, get closed/re-opened as you recycle a process due to max_requests type of parameters, etc... [16:24:30] actually tracking that you know what order the reuseport numbers are in isn't exactly trivial either. [16:26:17] (also in the general case you can't know when you've "closed" the file descriptor from the kernel's pov, if you got it from another process via dup/fork/SCM_RIGHTS, when you close your copy and the old process still has its own copy, it's not yet closed and the numbering hasn't changed yet... [16:26:22] ) [16:28:00] there are clearly design problems in this area on the kernel side looking for creative solutions :P [16:29:24] how hard should it be to tell the kernel "Look, I've got ethernet IRQs mapped to CPUs 0, 2, 4, 6. I have 4x reuseport copies of a listening socket held by 4x distinct processes, which are pinned to cpus 0, 2, 4, and 6, respectively. Do The Right Thing." [16:32:11] SO_INCOMING_CPU seems like at least a better interface for userland. it's relatively trivial to explicitly tell the kernel the above things through it. [16:32:42] maybe a new kernel patch could revive the functionality of the interface, in a way that doesn't disturb the fancy BPF stuff [16:33:23] the bpf lookup code basically does: if(bpf_installed && bpf_has_valid_answer) { route_via_bpf } else { hash_it_arbitrarily }. [16:33:37] maybe a new patch could do: [16:33:37] bblack: ok to reboot cp1008? [16:34:01] if(bpf_installed && bpf_has_valid_answer) { route_via_bpf } else if(SO_INCOMING_CPU is set) { use_that_info } else { hash_it_arbitrarily }. [16:34:11] ema: yes [16:35:08] "if(SO_INCOMING_CPU is set) { use_that_info }" isn't trivial to do efficiently of course, but the right data could be put in the right places to make it so. [16:37:30] vgutierrez: did you mean why is 1.14 differing from master in general? [16:37:55] my fault... [16:38:05] I assumed that master was only a few commits ahead of origin/1.14 [16:38:09] dumb me :) [16:40:20] * vgutierrez feels very layer 8 right now *sigh* [16:40:58] it's a good thing! There's no beer at layer 7 [16:41:03] git log --graph --decorate --oneline --all [16:41:16] (if you have local checkouts of all the relevant branches) [16:43:26] pybal's graph has "issues" even recently heh [16:43:46] when looking at the 1.14-vs-master from their divergence at "2b428a5 Add metric pybal_service_depool_threshold" onwards [16:44:31] maybe I fail to understand the branch/release model [16:45:54] I mean, obviously 1.14 is cherry-picks to make stable releases [16:46:06] nooow it works [16:46:08] * vgutierrez flips table [16:46:14] pybal_bgp_bgp 1.0 [16:46:15] pybal_bgp_session_established{asn="64496",peer="10.192.16.140"} 1.0 [16:46:16] pybal_bgp_session_state{asn="64496",local_peer="10.192.16.139",remote_peer="10.192.16.140",side="active",state="ESTABLISHED"} 1.0 [16:46:17] pybal_bgp_session_state{asn="64496",local_peer="10.192.16.139",remote_peer="10.192.16.140",side="active",state="OPENCONFIRM"} 0.0 [16:46:20] pybal_bgp_session_state{asn="64496",local_peer="10.192.16.139",remote_peer="10.192.16.140",side="active",state="OPENSENT"} 0.0 [16:46:24] but master also has 1.14.x version numbers in commit msgs that were cherry-picked in whichever direction, but with differing history [16:46:39] ema so.. the pybal_bgp_bgp signaling that BGP is configured [16:46:57] and all the other stuff that we already discussed :D [16:47:15] vgutierrez: :) [16:47:36] pybal_bgp_enabled instead of pybal_bgp_bgp, maybe? [16:47:50] whatever you need <3 [16:48:01] the code will reach nirvana when pybal_bgp_bgp_bgp is implemented [16:48:27] well.. we already have pybal.bgp.bgp.BGP [16:48:38] as a python class [16:48:41] heh [16:51:46] ema: you can ignore lvs1007-12 [16:52:05] I should just reinstall those to spares now and unconfigured them elsewhere in puppet to reduce confusion [16:52:10] (well, the ones that are alive) [16:52:31] bblack: I'm using it to test my new shiny cumin script :) [16:52:39] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Pybal stuck at BGP state OPENSENT while the other peer reached ESTABLISHED - https://phabricator.wikimedia.org/T188085#4006626 (10Vgutierrez) After some very meaningful CR by @ema now 414973 looks like this: ``` pybal_bgp_enabled 1.0 pybal_bgp_session_e... [16:53:20] heh [16:56:27] ema: hit me hard on 414973 :) [16:56:38] let's see if we can deploy the icinga test this week [16:56:44] s/test/check/g [17:02:41] bblack: installing the latest libvmod-netmapper as well while upgrading varnish (no alternative, that's the version we have in experimental, heh) [17:02:48] it works fine, including vcl.discard [17:09:17] vgutierrez: looks good! [17:22:55] new shiny cumin script, tested on cp1008/lvs1010: https://gerrit.wikimedia.org/r/#/c/415047/ [19:08:21] re: pybal branches, I've been following the policy of merging bugfixes onto master (when they apply to master, which has always been the case so far) and then cherry-picking them to 1.14 [19:10:59] with `git log master...1.14 --cherry-pick` I don't see anything that shouldn't be there I think [19:23:31] yeah I think the only thing confusing me is the 14.x version numbers in the commitmsgs in master [19:23:39] if I ignore those, it all makes sense :) [19:24:31] I'm poking at the vcl reload -related things a bit today, I don't know if I'll get anywhere useful yet [19:25:04] cool! [19:25:19] text@esams: 1 host to go [20:18:06] {{done}} [20:18:13] I'm off cya! [20:22:20] cya! [22:27:43] 10Traffic, 10DNS, 10Operations: Move "transparency.wikimedia.org/private" to "transparency-private.wikimedia.org" - https://phabricator.wikimedia.org/T188362#4008362 (10Dzahn) a:03Dzahn [23:25:06] 10Traffic, 10Operations: varnish: discard cold vcl - https://phabricator.wikimedia.org/T187778#4008615 (10BBlack) [23:25:08] 10Traffic, 10Operations, 10Patch-For-Review: VCL discards crash varnish frontend child process - https://phabricator.wikimedia.org/T188089#4008613 (10BBlack) 05Open>03Resolved a:03BBlack