[09:13:39] 10Traffic, 10netops, 06Operations: LLDP on cache hosts - https://phabricator.wikimedia.org/T165614#3285703 (10ayounsi) a:03ayounsi [10:22:01] bblack: "Every class and classful qdisc requires a unique identifier within the traffic control structure." [10:22:08] (from http://linux-ip.net/articles/Traffic-Control-HOWTO/components.html#c-handle) [10:22:36] tc-fq is classless, so that would perhaps explain why we don't see unique handles there but 0: [10:30:49] so for the record a poor-man way to set e.g. limit=42 would be: [10:30:51] tc -s qdisc show dev eth0 | awk 'NR > 1 && /qdisc/ { print "handle " $3 " parent " $5 }' | while read h ; do tc qdisc replace dev eth0 $h fq limit 42; done [10:32:22] after doing that though, resetting with `tc qdisc del dev eth0 root` doesn't work: [10:32:26] RTNETLINK answers: No such file or directory [10:32:56] while `tc qdisc replace dev eth0 root mq` does [10:46:50] maybe a clean way to do that in interface-rps could be using netlink in the script [10:47:13] I think your loop probably ended up replacing the root with fq? I donno [10:47:36] NR > 1 should take care of skipping root [10:49:06] ok [10:49:33] well anyways, rather than "replace", the little shellscript from bbr-dev starts out with deleting the root and re-creates everything manually [10:50:20] it relies on a priori knowledge of the count of hardware transmit queues in $NBQ, but interface-rps already knows that info [10:50:28] right [10:51:30] so probably we could give interface-rps a string argument specifying a custom per-tx-queue qdisc+params, like: -s "fq limit 20000 flow_limit 200" or whatever [10:51:47] if the arg is present, it can use the tx queue count and do the same as that shell loop [10:54:06] but would that mean deleting root and recreating everything at every puppet run? [10:54:20] no, I don't think we re-run interface-rps on every run do we? [10:55:35] yeah we don't [10:55:54] it's suppose to add it as an interface-up command, and then trigger it during the puppet run only when augeas first adds it [10:56:19] ok [10:56:20] (which is also kinda wrong, it should refresh on script changes too :P) [10:57:18] (also, the $rss_pattern stuff indicating whether RSS is in use is dated as well... it now auto-detects the RSS pattern if not supplied, and I think we still supply it in most cases, but it would auto-detect all of those cases anyways) [10:57:38] so maybe some pre-refactoring of this junk to clean up historical cruft is in order heh [10:58:14] whatever we end up puppetizing for fq parameters should perhaps also override manual tweaks? [10:58:48] it would if we start with deletion [10:58:55] I think [11:00:38] relatedly: the BBR puppetization as it stands today requires manually flipping the qdisc to fq prior to turning on BBR from puppet (but works fine post-reboot) [11:01:33] maybe we can have it do that automatically for the common case using $interface_primary, but also have an option to disable the BBR class messing with queues if it's been handled elsewhere (e.g. interface-rps), or something like that. [11:01:50] s/queues/qdiscs/ [11:03:48] re: setting fq parameters in interface-rps, I'm not sure it's the best approach. interface-rps is something to run when the iface is brought up and never again in the common case AFAIU [11:04:08] whereas fq tuning is something we might want to do at runtime [11:06:17] fq tuning at runtime should probably be rare, we'd settle on the right values and puppetize them? I'd think both the fq setup and RPS setup belong running from iface-up, and also from puppet on change [11:06:23] they should both be idempotent anyways [11:06:37] but right now the RPS setup doesn't even fix itself on script update heh [11:07:31] oh yeah ok if we make the RPS setup pick up changes on puppet run (changes to the script itself or fq params) then it sounds good :) [11:08:11] the only thing I worry about a little, is the current RPS stuff is idempotent (it just sets sysctls to values they're already set to) [11:08:42] but if we just port in google's shellscript loop for mq+fq, it's going to delete->recreate the qdisc stuff [11:09:06] which is idempotent in its results, but might be disruptive to traffic on a deploy of an updated rps script that didn't change that part [11:09:42] but then again, we're only likely to update the script for more fq tuning anyways, the rest has been stable for a long while [11:10:32] it would add a lot of complexity to try to compare the current "tc qdisc show" and re-run that part only if the output looks "wrong" [11:15:03] heh of course in the current model, the fq params would be CLI args to the interface-rps invocation, which wouldn't be an easy thing to trigger on for refreshing [11:15:06] hmmm [11:16:13] yeah I donno what the right model is anymore [11:16:19] :) [11:16:29] unless we write another script that checks if params x y z are what we've currently got in tc qdisc show (or netlink equivalent) [11:16:46] but there's clearly some historically-outstanding fixups to interface-rps and its puppetization anyways, let me get those out of the way first [11:16:59] ok [11:38:09] heh, now I know why I didn't fix the interface-rps command back when rss pattern auto-detection went in [11:38:31] augeas would end up creating two the command, because it checks uniqueness on the whole command. so deleting an argument is problematic. [11:38:35] in interface::up_command [11:38:52] (ditto for adding a new arg for fq) [11:40:55] but it's only the LVSes that have the manual pattern argument, and I can fix it up afterwards with sed [11:41:35] but if we're going the route of interface-rps handling setting mq sub-queues, should probably move to a config file sort of solution right off... [11:42:17] or we can leave interface-rps mostly as it is, and do a separate script for the qdisc stuff [11:42:44] but it would also need a configfile and augeaus into the interface up commands, etc [11:42:57] and knowledge of the tx queue count, which interface-rps already has code to find [11:43:15] blerg :P [11:44:13] we could turn interface-rps in a python module for reuse, most of it seems like could be useful in other scripts too rx/tx queue count, finding out non-HT core count and such [11:44:50] s/too/too:/ :) [11:45:52] * ema needs food, bbl [12:49:34] 10Traffic, 10netops, 06Operations: LLDP on cache hosts - https://phabricator.wikimedia.org/T165614#3286077 (10ayounsi) >>! In T165614#3271670, @BBlack wrote: > 1. Probably the reason for a lack of neighbors is that some (most?) of the switches don't blanket-enable LLDP for all ports. They explicitly list ce... [12:53:33] ema: seems the bgp/ip unit tests don't run by default [12:53:47] * mark has no idea how that shit works and doesn't feel like figuring it out now [13:47:17] 10netops, 06Operations, 13Patch-For-Review: analytics hosts frequently tripping 'port utilization threshold' librenms alerts - https://phabricator.wikimedia.org/T133852#3286251 (10ayounsi) I went ahead and ignored those hosts for this specific alert (using T133852#3251556 ) Please reopen the task if needed t... [13:47:27] 10netops, 06Operations, 13Patch-For-Review: analytics hosts frequently tripping 'port utilization threshold' librenms alerts - https://phabricator.wikimedia.org/T133852#3286252 (10ayounsi) 05Open>03Resolved [13:50:03] 10netops, 06Operations: JSNMP flood of errors across multiple switches - https://phabricator.wikimedia.org/T83898#3286267 (10ayounsi) 05Open>03Resolved a:03ayounsi That's not happening anymore. [13:52:04] mark: looks like they do run https://integration.wikimedia.org/ci/job/tox-jessie/18396/consoleText [13:52:12] ah, maybe not in coveralls [13:52:19] oh ok [13:52:27] the command you put in the README didn't work either [13:52:34] i just submitted patch sets for some more [13:52:41] was useful for py3 conversion [13:52:46] nice :) [13:59:57] unfortunately it's kind of ugly to make them work for both py2 and py3 [14:00:17] the bytes alias for str in python 2 is not really the same as in py3 :P [14:05:12] mark: out of curiosity, how did you catch the v6 padding bug? [14:05:27] with the unit testing [14:05:30] :) [14:05:36] ha! [14:06:15] they all LGTM, I was about to comment that you should add a test to verify that the bug is fixed, but that was added in the next patchset [14:06:22] yeah [14:06:32] really i rebased it so the fix would go first [14:06:39] and there would be no tests failing [14:10:35] ema: I have a little more cleanup coming, but I've already run the python script with test fq config and it works, running through puppet compiler now [14:11:35] bblack: great! I'm going through the patches now [14:11:57] the last cleanup bit is the irqbalance crap, I'll probably amend it into the early cleanup patches somewhere [14:12:22] (we still have irqbalance::disable, but I think no systems ship it or install it anymore. I'm not even sure why that class doesn't fail, as it references a non-existent service to disable it...) [14:13:18] we have a single system running irqbalance IIRC [14:13:32] well as long as it's not also running interface::rps :) [14:13:42] I think outside of caches+lvs, the only RPS users is some labs dns thing [14:13:57] irqbalance was updated in one of the last jessie point releases and when I upgraded it, certainly no lvs/cp host :-) [14:14:14] 10netops, 06Operations: Report of esams unreachable from Fastweb/Init7 - https://phabricator.wikimedia.org/T165288#3286326 (10ayounsi) 05Open>03Resolved From Init7: ``` Update: 2017.05.17 09:00 (CEST) First link to AMS-IX has been enabled. Update: 2017.05.18 09:40 (CEST) Second link to AMS-IX has been enab... [14:14:56] what I really don't get right now, is on the LVS hosts there's definitely no irqbalance package installed or service existing, yet irqbalance::disable is applied on them and doesn't fail, and contains: [14:15:05] service { 'irqbalance': [14:15:05] ensure => stopped, [14:15:05] enable => false, [14:15:05] } [14:15:22] I would've thought that would fail in puppet. maybe it's smart enough to not care about a missing service if stopped+disabled? [14:16:35] (or dumb enough, if this combination of factors leads to it just checking the process list for "irqbalance" and ensuring some initscript is rm'd) [14:16:51] :) [14:16:53] > By setting the ensure service property to stopped (or false) puppet will check for the presence of the service on each run and stop it when ever it encounters it running. [14:17:07] so I guess if it's not running puppet is happy? [14:18:37] bblack: so the main reason to add a config file to interface-rps is to make augeas happy right? [14:18:52] well it's two things: [14:19:26] 1) Avoid running into augeas duplicating the command on future parameter changes (again... I'll have to manually fixup LVS hosts with cumin+sed or something after this deploy) [14:19:53] 2) Giving us a way to do a subscribe to re-run interface-rps.py on parameter changes [14:20:10] oh right [14:20:31] it still has the fault that it's going to destroy and re-create the qdisc setup on any parameter change, but the only parameters we would reasonably change at this point are qdisc parameters, so that problem can wait until it's a real problem [14:20:45] sgtm [14:21:41] so basically I want to push through deploying those 5 patches, then go back to my test hosts and manually turn on fq+bbr and watch how limit/flow_limit look, and I'll manually edit the interface-rps config to tune and find the right values, then go puppetize those into the config file, etc... [14:22:59] I'm really not all that sure about the whole fq-tuning thing. what reasonable queue lengths are for our situation, or whether a reasonably small amount of overall or per-flow drops are acceptable/normal, etc... [14:23:10] I don't want to end up creating some unecessary buffer bloat on the send side by over-tuning [14:23:28] (but I think the bbr pacing would avoid it actually bloating... I think) [14:24:03] note BTW that as godog mentioned a few days ago there's gonna be no fq required for BBR in newer kernels [14:24:10] probably some googling on mq+fq+10GbE (or at least fq+10GbE) tuning is in order [14:24:41] heh yeah [14:25:03] we could dump all of this complicated bullshit and just wait for a new kernel too, it's an option [14:25:25] I never looked at the patch, did they replace it with some kind of internal fq, or just the pacing part without fairness? [14:25:30] and I guess that using the more recent kernel on cache hosts isn't an option right? [14:25:35] moritzm would probably shoot us [14:26:01] https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=218af599fa635b107cfe10acf3249c4dfe5e4123 [14:26:16] we could/should separately ask ourselves the question: is fair queueing something we wanted anyways, regardless of bbr? [14:26:31] I know at one point I had a ticket open about investigating setting up some better-than-default scheduling [14:27:10] queueing fairly sounds appealing (not letting heavy users starve other users' flows in our transmit queues) [14:27:28] but then again if you assume we're not actually saturating anything on our end, which we shouldn't be, blind pfifo_fast is more efficient [14:27:30] the beauty of the current 4.9 kernel is that we're actually able to use the Debian shipped kernels instead of rolling our own, I'd like to keep that until we migrate to the next LTS early next year (4.14 or so) :-) [14:27:35] (which is the default) [14:31:12] ema: so in the patch-notes comparison... it doesn't sound like the server host perf diff is much to worry about, it's a small tradeoff in context-switches vs interrupts, and a very tiny difference in sys% cpu [14:31:23] 10netops, 06Operations, 10ops-eqiad: ripe-atlas-eqiad is down - https://phabricator.wikimedia.org/T163243#3286368 (10faidon) OK, I see the prompt in the console: ``` CentOS release 6.9 (Final) Kernel 2.6.32-696.1.1.el6.x86_64 on an x86_64 us-qas-as14907.anchors.atlas.ripe.net login: ``` We don't have the... [14:31:56] ema: so the tradeoff for us is avoiding the complexity/risks/changes of having to turn on fq to get bbr. so I guess our decision there hinges on whether fq was a good idea for us independently of BBR or not [14:32:41] ema: (but then also if we choose the non-fq path, we're apparently waiting several months for a new kernel, or making a lot of work for moritzm) [14:35:47] yeah [14:36:28] bblack: fq seems interesting regardless of the congestion algo to me [14:40:54] I'm still mentally on the fence about it. I was accepting it mostly because bbr needed it. [14:41:11] the idea of fair queueing seems nice, but the potential downsides are: [14:41:30] 1) We're doing more work to transmit packets instead of just shoving them out the door efficiently [14:42:35] 2) Currently we don't significantly drop any outbound packets locally (they all fit in the hardware buffers at our full rates, etc), yet fq at default tuning is dropping packets at two different levels. Is there a sane way to tune it? Should those packets be dropped? [14:43:04] I think 2 raises a lot of questions I don't have solid answers for, I just don't know this stuff well enough [14:44:22] it's entirely possible that fq at default tuning is a net-negative for user perf due to the drops, and that tuning it correctly for the high-end case (e.g. an upload cache, especially one that needs to handle a large faction of global load, like eqiad with esams depooled) leaves very bloaty buffers for the low-end case, and so we have to get into tweaking them on a case-by-case basis, etc... [14:45:37] + all this BS above about trying to puppetize the fq enable + tuning (and then eventually also handling the BBR + non-interface-rps case, for other hosts that want to try BBR but aren't the caches) [14:46:25] those should be easy though right? No mq, just root qdisc replace [14:46:46] yeah I mean puppetizing that properly is messy to handle both cases in some bbr puppet-class [14:46:50] but yeah it's work [14:47:24] to switch between having the bbr class turn on fq vs assuming interface-rps did it, and ensuring either way that it happens before the congestion control sysctl is modified at runtime [14:49:38] to put another spin on it: [14:51:02] all fq can control the fairness of is the server's own transmit queues. if under the default/stupid pfifo_fast we're not dropping packets anyways today, then there's really no functional reason to worry about one flow starving another. We could worry about stuff further downstream from the server host itself on the network, but BBR is taking care of buffering/fairness issues there for us. [14:52:11] and if we don't want to drop packets in the normal case, apparently we have to tune fq's limits up higher than default until we stop dropping packets artificially, at which point once again fq isn't really doing much for us anymore [14:52:45] the counter to that spin is: we might have corner cases where it saves us, because some abusive user actually does occasionally try to pull something from us at gigabits/sec and starves other users, but it's rare? [14:59:38] with per-flow stats we can find that out I guess [14:59:52] per-queue [15:00:38] meeting, bbiab [15:00:42] ok [15:00:59] man tc-fq also claims this: [15:01:01] TCP pacing is good for flows having idle times, as the congestion window permits TCP [15:01:04] stack to queue a possibly large number of packets. This removes the 'slow start [15:01:07] after idle' choice, badly hitting large BDP flows and applications delivering chunks [15:01:10] of data such as video streams. [15:06:10] mark: meanwhile I've added this on top of your tests https://gerrit.wikimedia.org/r/#/c/355229/ [15:06:34] yes I was looking at that as well [15:06:35] thanks [15:06:42] will review later [15:22:52] ema: do we track dropped now? [15:22:58] (without per-queue, but per-if or whatever) [15:23:36] but e.g. eth0 on cp1074 says this for tx: TX packets:128518571491 errors:0 dropped:0 overruns:0 carrier:0 [15:24:05] 26 days, 119 billion packets, no drops [15:24:13] err 128 billion, whatever [15:26:16] mq+pfifo_fast logs queue-level drops as well, checking that with cumin [15:27:08] so out of 100 cache nodes, 46 have zero drops across all the queues [15:27:25] and for one of the worst examples of one that had non-zero drops: [15:27:32] (1) cp3036.esams.wmnet [15:27:36] ----- OUTPUT of 'tc -s qdisc show... -v "dropped 0,"' ----- Sent 380639650772341 bytes 1588659862 pkt (dropped 2827, overlimits 0 requeues 884596674) [15:27:42] Sent148859336800603 bytes 364369180 pkt (dropped 449, overlimits 0 requeues 38059365) Sent 14707674871012 bytes 2710231782 pkt (dropped 138, overlimits 0 requeues 34635592) Sent 17414685009921 bytes 446445297 pkt (dropped 620, overlimits 0 requeues 40694174) [15:27:47] Sent 1515578666978 bytes 3057587499 pkt (dropped 1434,4 overlimits 0 requeues 32643896) Sent 24332788985953 bytes 1546007635 pkt (dropped 186, overlimits 0 requeues 43486754) [15:27:54] basically 5/16 queues there are showing non-zero per-queue drops [15:29:13] which according to the packet counters means cp3036 has dropped 0.0001% of all outbound packets over it's ~40 days of uptime [15:29:15] bblack: we don't, the qdisc patch has been merged upstream today. If godog agrees we need to build new node_exporter packages and deploy them :) [15:29:45] so I'd say tx drops aren't a pragmatic issue for us presently [15:30:00] ema: yep WFM, I can assist with it [15:31:45] bblack: and what happened yesterday during fq testing? Have you seen drops going up significantly or was it mostly requeue/overlimits? [15:32:36] so this is an example, 1/N of the fq subqueues on cp1074 with live traffic after a fairly short period of time (<1hr I think): [15:32:39] qdisc fq 0: parent 8001:9 limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 [15:32:42] Sent 24069852229 bytes 17997768 pkt (dropped 2997, overlimits 0 requeues 2) [15:32:45] backlog 2908b 2p requeues 2 [15:32:48] 10Traffic, 10netops, 06Operations: LLDP on cache hosts - https://phabricator.wikimedia.org/T165614#3286536 (10ayounsi) LLDP added to all the interfaces in asw-ulsfo/eqiad. Already configured in codfw and esams. Which solves the first part of the issue above (minus the devices where lldp crashes). About the... [15:32:48] 2047 flows (2046 inactive, 1 throttled), next packet delay 51566855 ns [15:32:51] 510511 gc, 3 highprio, 3544030 throttled, 2997 flows_plimit [15:32:53] 446 too long pkts, 0 alloc errors [15:33:37] so it was dropping, in that short term, ~0.016% of outbound packets, due to flows_plimit mostly it seems [15:34:37] it's not much, but it's a couple more orders of magnitude that the worst case we see in esams in the long term with pfifo_fast [15:35:19] (some of the fqs were better than that, some were worst, it varies per queue) [15:35:27] is flow_plimit the same as flow_limit? [15:35:47] yeah [15:36:31] in the upload case, that's what's notable. the top-level "dropped" for an fq seems to match flows_plimit [15:37:05] even cp1065 (text) did it a little, but not much. most queues were dropped=0, but a few like: [15:37:08] Sent 10056273232 bytes 9404327 pkt (dropped 37, overlimits 0 requeues 6) [15:37:15] backlog 7297b 4p requeues 6 [15:37:15] 2296 flows (2294 inactive, 2 throttled), next packet delay 2848760 ns [15:37:18] 727356 gc, 2 highprio, 1379949 throttled, 37 flows_plimit [15:37:20] 788 too long pkts, 0 alloc errors [15:37:54] so, it seems like we don't hit the "limit 10000" parameter, but we do hit the "flow_limit 100" parameter [15:37:57] ? [15:38:22] so my understanding is that when limit reaches 10k packets are dropped, when flow_limit reaches 100 they get requeued [15:38:40] no, flow_limit drops too [15:38:49] oh [15:38:50] (which is why the overall queue dropped == the flows_plimit count) [15:39:16] if (unlikely(f->qlen >= q->flow_plimit && f != &q->internal)) { [15:39:21] q->stat_flows_plimit++; [15:39:22] return qdisc_drop(skb, sch, to_free); [15:39:22] ^ in kernel's sch_fq.c code [15:41:37] interesting, sch_fq doesn't requeue, sch_mq does [15:42:23] I don't think sch_mq does either, it just serves the purpose of managing the split hardware queues? [15:42:53] if you look at mq_dump() where it reports its stats, it just sums the stats of the fqs within it [15:42:57] yeah [15:43:25] sch_generic then [15:43:34] oh it subclasses it? [15:44:22] lol, worthy of a long paste :) [15:44:25] under all circumstances. It is difficult to invent anything faster or [15:44:28] cheaper. [15:44:31] */ [15:44:33] static int noop_enqueue(struct sk_buff *skb, struct Qdisc *qdisc, [15:44:36] struct sk_buff **to_free) [15:44:38] { [15:44:41] __qdisc_drop(skb, to_free); [15:44:43] /* "NOOP" scheduler: the best scheduler, recommended for all interfaces [15:44:47] return NET_XMIT_CN; [15:44:49] } [15:44:51] bah it missed the first line of the comment [15:45:28] /* "NOOP" scheduler: the best scheduler, recommended for all interfaces [15:45:38] so, my thoughts on sch_fq and the flows_plimit drops we're seeing: [15:46:50] They should happen when our side transmits a large burst of packets in a single tcp flow towards a client. Given they were happening with BBR turned on, we can reasonably assume BBR had a decent bandwidth estimate for the client, so it's not a blind and unreasonable burst... [15:46:57] ema: heh, of course python3 doesn't have basestr... [15:47:18] so these must be clients with a high enough BDP to actually accept that kind of burst [15:47:19] but yeah shooting for compatibility is futile anyway [15:48:11] if we want to allow those high-BDP clients to use up the pipe without dropping packets under fq, we'd have to raise flow_limit (to what? we'll have to experiment until flows_plimit value drops off to some acceptably-small value, but we'll probably never reach zero in the long term) [15:48:42] but if we raise flow_limit, we may have to raise the overall "limit" too, to avoid dropping at that level? [15:48:52] (which isn't happening so far, AFAICS) [15:51:06] we could start by checking which one of limit and flow_limit is responsible for the drops with systemtap [15:51:18] (after bumping flow_limit) [15:51:20] well we know flow_limit is in the stats above [15:51:31] so the other curious thing [15:51:47] well nevermind, I answered my own question [15:51:52] what was it? :) [15:51:58] but moving on, there's ~2K flows per queue in eqiad [15:52:27] so the 10K limit budget is being shared by 2K flows = really only about 5 packets each if they were all trying to queue up [15:53:02] yet the point-in-time backlog in each queue at the time I ran the command was in the single-digits and probably representative [15:53:12] so I think normally, most of the time, the queueing is pretty minimal [15:53:21] it's just these occasional burst to high-BDP clients [15:53:35] (which end up dropping only a tiny percentage of overall packets) [15:53:50] so maybe we can get away with raising flow_limit a little for them and not raising limit, but stats would tell eventually [15:55:16] flow_limit 100 means only something like ~128KB of app data in flight in that TCP flow in our outbound buffer, it's not much [15:55:58] there's no rational way to translate that into a BDP limit, but it is effectively some kind of BDP limit when combined with "the rate at which the card dequeues to the wire" [15:59:52] oh my question earlier was I was wondering if there was a bug, because coincidentally in the paste of cp1074 queue stats, every fq had exactly either 2047 or 2048 flows at that moment [15:59:58] I thought maybe we had hit some limit [16:00:10] but cp1065 was showing ~2.2K/queue, so I guess it was just chance [16:01:36] there's also the "maxrate" param, which sets a pacing cap. that's an option too [16:01:58] but it sounds like if you set that param (default unlimited), it just ignores pacing inputs that are >maxrate [16:02:05] that might confuse BBR [16:03:40] off-topic slightly, but another thing I noticed moving between cp1074 + cp4021 is the bnx2x queue limits are different, and I don't know why, but it's probably a hardware difference [16:04:32] cp4021+cp1074 both have in modprobe config: options bnx2x num_queues=24 [16:04:44] cp1074 actually gets 24 hardware queues from that, but cp4021 only gets 15 [16:05:32] what's the reason for overriding num_queues? The default should be the number of CPUs [16:05:38] maybe because it's a slightly different 10GbE card, or it could be due to the change from R630 -> R430, maybe different PCI lane count or whatever which imposes a limitation [16:07:35] ema: I think earlier versions of the driver may not have automatically set to #CPUs, and possibly later versions that did were ignorant of hyperthreading (and so would create 2x IRQs/queues for 1x physical processor core) [16:08:01] both the hosts above have 24 physical cores and 48 logical (HT) [16:08:13] ha, the param description doesn't sound like English: "Set number of queues (default is as a number of CPUs)" [16:10:04] (and interface-rps is smart enough to only map the queue IRQs to 1/2 of the logical cores that belong to a physical core) [16:10:18] godog kindly built a new version of node_exporter. Should we deploy it on cp1074? [16:10:37] yeah you can, it will give us graphs [16:11:21] you'll need to tweak the list of enabled collectors from puppet though IIRC, since qdisc isn't enabled by default but should be easy enough [16:11:27] in the meantime, let me push through all the interface-rps cleanup / fq stuff I guess. it doesn't actually turn on any fq params yet, just enables support [16:11:36] or stop puppet and change the list manually [16:11:54] godog: yeah, I was going to do the latter first [16:12:35] and then we can manually (well, manually with /etc/interface-rps.d/eth0 + re-run the script) tune fq a bit on cp1074 and see if we can reduce the drops sanely [16:12:44] with bbr on again of course [16:13:08] and then circle back to whether we puppetize that and try to turn it on everywhere, or step back and avoid fq [16:26:32] godog: with my awesome grafana-fu I came up with this https://grafana-admin.wikimedia.org/dashboard/db/qdisc-stats?orgId=1 [16:26:58] how do you add a dropdown to choose the host(s) again? [16:30:47] hehe hitting the cog then templating [16:31:11] easier to copy the setting from an existing dashboard like varnish-machine-stats tho [16:34:18] ok it's stepping through puppet agent + sed cleanup on lvses now (tested already) [16:34:47] once we're past that in a few minutes, will start in on cp1074 fq+bbr testing again, but with tuned flow_limit and such [16:35:04] nice [16:35:49] bblack: I've updated node_exporter on cp1074, the gotcha is that the qdisc collector isn't enabled in puppet so I've added it by hand to /etc/default/prometheus-node-exporter with puppet disabled [16:37:04] ok [16:42:01] well the lvs run is taking forever because I stupidly used "-b 1" forgetting that puppet agent is so slow. moving on to cp1074 while that finishes up... [16:49:01] bblack: I've gotta go now, let me know where to pick up the work tomorrow :) [16:49:05] o/ [16:49:14] ok [16:53:45] the graph is picking up all zeros, but I can see it manually in tc output anyways [16:54:27] right now I'm going to let cp1074 go for a while (30+ minutes), using bbr+fq with default fq params, just for another comparison point [16:54:34] then maybe trying bumping flow_limit and see how it goes [17:05:43] 10netops, 06Operations, 10ops-codfw: ores200[1-9] switch port configuration - https://phabricator.wikimedia.org/T166156#3286863 (10Papaul) [17:17:37] oh I was wrong about the graph, it's just the units/times differences was throwing me off [17:17:42] it is logging the drops, in x/sec [17:18:19] so while we had some signifiant bumpies of requeues in the few minutes of data before fq, we had no dropped [17:19:18] it seemed like with just fq and no bbr there were fewer drops (but still a handful), they increased a little under BBR so far [17:19:30] (I guess that's to be expected all things considered) [17:19:48] so far they don't look as bad as yesterday (but meh different day/time), but they're still there [17:39:17] ok trying 200 to see if the drops drop off [17:40:04] !log fq on cp1074 reset to flow_limit 200 (resets counters) [17:40:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:21] also interesting, there was a small-but-notable rate of 503s on cache_upload that started about 10 minutes after turning on BBR, and ended a few minutes after bumping the flow_limit to 200 [17:56:20] it hit all sites, and technically it could be unrelated [17:56:36] but I tend to think that perhaps the fast flows hitting the plimit are the inter-cache flows in these cases [17:56:48] as opposed to some users with insanely good connections [17:58:41] bblack@oxygen:~$ grep 'upload\.wiki' /srv/log/webrequest/5xx.json |egrep '2017-05-23T(16:5[0-9]|17:)'|grep -w 503|wc -l [17:58:44] 16207 [17:58:46] and indeed, that seems to be the case [17:58:49] bblack@oxygen:~$ grep 'upload\.wiki' /srv/log/webrequest/5xx.json |egrep '2017-05-23T(16:5[0-9]|17:)'|grep -w 503|grep cp1074|wc -l [17:58:52] 15890 [17:59:05] so it's a pretty strong hypothesis that the flow_plimit stuff is all about inter-cache flows, not users [18:24:45] so I think, given that eqiad's at the backend of all things currently, it has the most inter-cache to worry about. if 200 works there, it should work everywhere [18:24:54] next mystery, I'm back on the magic 2048 thing [18:25:25] cp1074 has 24 hardware queues, and thus now 24x fq queues (that are active. there's another 48x that are other CoS bands we don't actually use, so those queues stay empty and that's ok) [18:26:15] the number of flows currently track in each of the 24 queues ranges from 2046 - 2051 [18:26:25] most of them being 2047 or 2048 [18:26:47] it's just unsettling to see an obvious power-of-two number there, when we expect something a little more natural... [18:27:11] also 24x2048 = 49152, which sounds eerily a lot like "limit on ephemeral ports" or something in that ballpark [18:28:18] also, it's the ethernet card flow-hashing magic that puts them in these queues (hashing src+dst ip+port), and you'd expect that to not work out quite so flawlessly where the flow split's range of min/max is <1% [18:29:07] random hypothesis not fully chased down: [18:30:33] 1) something to do with fq's "buckets" param. buckets defaults to 1024. it's the count of buckets in a hashtable for looking up flows. each bucket's supposed to contain a redblack tree for resolving collisions. maybe there's a bug here nobody else hits much, or maybe an efficiency issue that ends up creating a soft limit for us here? maybe raising the value changes something? [18:31:32] 2) ephemeral port and/or time_wait socket limitations something something something [18:32:06] 3) Somehow the fair queueing itself, combined with the queue limitations, is artificially capping how many successful flows can be in flight at once [18:32:41] 4) maybe the stat is just flat out wrong. it's based on some statistical input that's flawed with 1k hash buckets and as many conns as we have in flight [18:33:24] I donno, have some looking to do here... [18:34:39] what makes it seem especially strange is I observed this same near-2048x24 behavior last time too on the same host [18:34:53] but countering the strangeness of it all, cp1065 had more like 2.2k [18:43:22] !log resetting cp1074 queues again: "fq flow_limit 200 buckets 4096" [18:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:06] ok that made a big difference [18:46:27] so with the default "buckets 1024", all the queues eventually got to: [18:46:30] 2047 flows (2044 inactive, 3 throttled) [18:46:39] (or something +/- a little from that) [18:46:57] at "buckets 4096" it ramped in until they limited at: [18:47:00] 8192 flows (8189 inactive, 3 throttled) [18:47:35] so it seems like somewhere near buckets*2 flows, there's a cap on how many flows it will even track [18:48:24] that most of them are "inactive" is interesting. I guess that's fairly normal. even as they were growing through the lower ranges on the way to ~8K, inactive count tracked fairly closely with the total flow count [18:50:47] time to look at sch_fq.c again... [19:00:30] if (q->flows >= (2U << q->fq_trees_log) && [19:00:30] q->inactive_flows > q->flows/2) [19:00:32] fq_gc(q, root, sk); [19:01:07] which means: [19:01:57] if (flows > 2*buckets && inactive_flows > flows/2) { kill a flow; } [19:04:20] * dequeue() : serves flows in Round Robin [19:04:20] * Note : When a flow becomes empty, we do not immediately remove it from [19:04:24] * rb trees, for performance reasons (its expected to send additional packets, [19:04:27] * or SLAB cache will reuse socket for another flow) [19:04:44] so "inactive" just means the queue for a given socket is currently empty, at which point there's no real active accounting for it, I guess [19:06:42] but surely if we're rotating them out like this too fast, we're losing track of the rate information for pacing, you'd think [19:07:31] !log resetting cp1074 queues again: "fq flow_limit 200 buckets 10240" [19:07:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:24] yeah I think it fills up any buckets limit you give it, and it's "normal" for most to be inactive rather than throttled [19:17:10] oh and duh it's rounded up to ^2, so my "10240" actually became "16384" [19:17:37] it's taking them quite a while to fill that, should be ~32K limit and they're up to ~20K now [19:18:11] but it would seem they'll eventually fill any count of buckets within reason [19:18:23] the flow (and inactive flow) counts never go down, only up [19:18:47] they must just count as inactive until they're reused by a fresher conn [19:27:37] actual establish-connections count peaks out in esams upload at ~50K or so per node [19:29:03] if we assume a worst-case, which is doubling up esams conn count (depooled DCs + a little growth, at peak time), doubling again for 12 nodes (esams) -> 6 nodes (ulsfo), that's ~200K established conns [19:29:32] and we assume the 15-queue-limited bnx2x cards, that's ~13K connections per queue at that point [19:30:20] it seems reasonable to try to size things to be in that ballpark. So 8K buckets = ~16K "inactive" (unthrottled) conns kept afloat [19:31:05] 10Traffic, 06Operations, 10fundraising-tech-ops: Fix nits in HTTPS/HSTS configs in externally-hosted fundraising domains - https://phabricator.wikimedia.org/T137161#3287257 (10Jgreen) @BBlack ok I upgraded nginx and *ssl, and civicrm and the other frack-hosted sites should be fixed to include the HSTS header... [19:31:09] I think I'd feel fairly confident then that "buckets 8192" is enough overkill that we're not recycling inactive connections so fast that we're losing the ability to accurate pace their rate. [19:31:58] (not that I even understand half of how this works, for all I know losing established conns off the inactive list never harms the ability to control rate, but better safe since I don't know, and it's just some wasted mem) [19:34:59] still not seen any dropped packets since raising flow_limit to 200 [20:34:43] 10netops, 06Operations, 10ops-codfw: ores200[1-9] switch port configuration - https://phabricator.wikimedia.org/T166156#3287401 (10RobH) 05Open>03Resolved a:03RobH Done! [20:42:57] 10netops, 06Operations, 10fundraising-tech-ops, 10ops-eqiad: rack/setup/wire/deploy msw2-c1-eqiad - https://phabricator.wikimedia.org/T166171#3287418 (10RobH) [21:05:14] back on the whole fq -vs- internal pacing (4.13) stuff, note the commentary change in the patch for it: [21:05:17] - * NOTE: BBR *must* be used with the fq qdisc ("man tc-fq") with pacing enabled, [21:05:21] - * since pacing is integral to the BBR design and implementation. [21:05:24] - * BBR without pacing would not function properly, and may incur unnecessary [21:05:26] - * high packet loss rates. [21:05:29] + * NOTE: BBR might be used with the fq qdisc ("man tc-fq") with pacing enabled, [21:05:32] + * otherwise TCP stack falls back to an internal pacing using one high [21:05:35] + * resolution timer per TCP socket and may use more resources. [21:05:52] hrtimers + "more resources" doesn't sound pleasant [21:06:35] so far navtiming data is too fuzzy to tell (it seems to always be), but maybe after a day or three we can draw out some min/max/avg data compares vs a week ago [21:22:58] checking flows_plimit / dropped via cumin, with things enabled everywhere we do still occasionally see it even at flow_limit 200 [21:23:27] roughly 1/5 cache hosts show flows_plimit drops at all, and when they do it's usually only 1/N queues that they happened on [21:23:35] meaning it's probably limited to a few specific clients that hashed there [21:24:43] I think I'll chase this a little further and try 300, see if we can at least make it much rarer [22:25:48] at 300, zero drops in the first hour or so