[13:07:01] 10Traffic, 10Analytics, 06Operations, 06Performance-Team: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2495413 (10BBlack) @Nuria - Thanks, sounds awesome :) [14:14:10] In the new nginx there seems to be a big change - Change: now the "accept_mutex" directive is turned off by default. [14:14:37] ah they use EPOLLEXCLUSIVE [14:15:17] and there is also Bugfix: socket leak when using HTTP/2. [14:21:44] elukey: we had accept_mutex off in our config anyways [14:26:01] didn't know that! [14:26:33] well theoretically with the new version we'll avoid to wake up all the processes for epoll events right? [14:27:25] it seems good though both httpd and nginx dropped the accept-mutex by default finally [14:34:03] elukey: I haven't looked into the details of anything new. but historically accept_mutex:on made things more CPU-efficient (because only one worker process was waking up to accept(), based on rotated which holds the mutex), but turning it off makes things more efficient/low-latency for clients (because we have many processes all waiting together, so we can rapidly accept many connections without [14:34:09] waiting for a mutex handoff to rotate around between cpus/procs for every request). [14:35:36] at a certain higher expected level of incoming connections, and with CPU power provisioned in excess of real load, turning off the mutex and wasting a little CPU makes nginx more-responsive to lots of parallelism in incoming connections. [14:36:28] I haven't thought about pros of waking up all the processes, it makes sense [14:37:02] thanks :) [14:38:13] elukey: there's a lot of subtle tuning in our cache clusters, too. and to be honest, not all of it has really been thoroughly investigated as to whether and how efficient it all is and whether we could do it even better. [14:38:42] for lack of research time, a lot of things are "this *looks* like a good efficient idea for us, so lets turn it on and see if things improve or at least don't regress" and then we move on :) [14:39:25] a lot of things we guess about, but haven't spent a lot of time validating how it really really works at the low level. it probably changes between kernel upgrades sometimes too, and in the end it's only worth so much of our time to chase minor efficiency issues unless they're a real problem. [14:40:39] the RSS/RPS stuff comes to mind, and has some integration with nginx accept_mutex and how many nginx procs we run and how we bind them to CPUs [14:40:55] sure I agree 100%, I am only super curious about these low level details :) [14:41:42] (this is completely different from getting anything out of it, but this is another topic) [14:41:59] RSS/RPS is tuned such that for all the incoming packets on the network card, they're hashed (e.g. on the IP tuple of srcip:srcport+dstip:dstport) so the same TCP connection would land in the same hashslot, and the hashslots are divided up to route each hashslot's flow of packets over a different hardware interrupt [14:42:24] and we define one such hardware interrupt per physical (not HT) CPU core [14:42:52] (HT core count is always double physical core count) [14:43:33] and then on the nginx end where accept_mutex is off, we spawn 2x processes per physical CPU core, and that pair is pinned to only run on a certain pair of logical cpu cores (the HT pair corresponding to one physical core). [14:43:40] so they can both run on either of the logical cores [14:44:18] and then we basically hope that linux does smart-ish things in these cases [14:44:59] that it tends to wake an nginx acceptor on the same core the interrupt for the traffic landed on, since there's available unblocked listeners on all CPUs, you'd think some algorithm somewhere would prefer to wake a local listener than one in another core. [14:45:38] in which case we avoid bouncing too much stuff between cores, and mostly the traffic is divided up for good cpu core parallelism instead of things being randomly scheduled and bounced around between cores/caches. [14:46:05] the hashing to interrupts happens down in the network card [14:47:42] wow [14:47:49] on a cache node you can see some of this via: "grep eth0 /proc/interrupts" [14:48:22] you'll see a diagonal line in the output, it's showing all the hardware interrupt lines (IRQs) eth0 has, and how each one only hits a certain CPU core [14:48:37] (the colums are cpu cores) [14:48:51] niceeee [14:48:56] root@cp1065:~# cat /etc/modprobe.d/rps.conf [14:48:56] # This file is managed by Puppet! [14:48:56] options bnx2x num_queues=16 [14:49:19] ^ we puppetize num_queues based on the physical core count. that tells bnx2x (ethernet driver) to configure the card for one irq for each cpu core) [14:50:19] and then this script does the mapping of things: [14:50:53] https://github.com/wikimedia/operations-puppet/blob/production/modules/interface/files/interface-rps.py [14:51:09] it maps the IRQs to CPUs with all the linux kernel stuff [14:51:30] https://www.kernel.org/doc/Documentation/networking/scaling.txt is linked in the top comments of that script, and explains the kernel stuff [14:53:10] and then starting here is the part of our nginx config that pins 2x nginx processes to 2x virtual cpu cores (HT), which are "siblings", meaning they share a physical core. but the two are free to both bounce between those logical sibling cores: [14:53:14] https://github.com/wikimedia/operations-puppet/blob/production/modules/tlsproxy/templates/nginx.conf.erb#L10 [14:53:58] so on cp1065 (16 physical cores, 32 logical), the output of that template fragement looks like this in /etc/nginx/nginx.conf: [14:54:01] worker_processes 32; [14:54:03] worker_cpu_affinity 00000000000000010000000000000001 00000000000000010000000000000001 00000000000000100000000000000010 00000000000000100000000000000010 00000000000001000000000000000100 00000000000001000000000000000100 00000000000010000000000000001000 00000000000010000000000000001000 00000000000100000000000000010000 00000000000100000000000000010000 00000000001000000000000000100000 0000000000100000 [14:54:10] 0000000000100000 00000000010000000000000001000000 00000000010000000000000001000000 00000000100000000000000010000000 00000000100000000000000010000000 00000001000000000000000100000000 00000001000000000000000100000000 00000010000000000000001000000000 00000010000000000000001000000000 00000100000000000000010000000000 00000100000000000000010000000000 00001000000000000000100000000000 0000100000000000000 [14:54:15] 0100000000000 00010000000000000001000000000000 00010000000000000001000000000000 00100000000000000010000000000000 00100000000000000010000000000000 01000000000000000100000000000000 01000000000000000100000000000000 10000000000000001000000000000000 10000000000000001000000000000000; [14:54:27] 32 total procs. the first two are mapped to logical cpus 0+16 (which are both physical cpu 0), etc... [14:55:40] this is really great, thanks for the links! I'll try to make a graph (maybe with draw.io) out of this [14:57:51] cool :) [14:59:00] as I was prefacing, I've never dug into it too deeply. it would be awesome if we found time to do so. We might find this is inefficient and we could tweak it better. all of the above is sorta "this looks right, and it looks like we're giving smart code in the card, driver, kernel, and nginx the opportunity to do good things" but we really don't know that it's optimally-good. [14:59:29] but our cache nodes do seem to operate pretty efficiently in practice [15:00:44] yes this could be either already super tuned or there could be something that triggers an extra gear to the whole machinery [15:01:03] right [15:01:55] maybe we're missing one more setting to turn on to make it work optimally. or maybe we really need 4 processes per physical core and that keeps some bursts of incoming traffic on a given IRQ from spilling over to another core uneccessarily. [15:02:07] or who knows what. it would take some low-level digging on live traffic to find out. [15:02:38] maybe it's all completely-wrong and everything would be more efficient if we left everything at defaults and spawned 200 randomly-scheduled nginx workers! :) [15:02:45] ahahaha [15:02:51] (but I doubt it, some thought did go into it) [15:03:24] without the spread IRQs part, we did have interrupts all excessively schedule on a single CPU and sometimes saturating a core's capacity, etc. [15:03:57] but most of this tuning was looked at more deeply initially on the LVS servers (which also have all the parts of this except the nginx part). then it was kinda copied over to the caches and the nginx cpu pinning tacked on top. [15:08:47] makes sense, I am visualizing it as packet highways [15:24:00] 10netops, 06Operations, 10ops-eqiad: cr1/cr2-eqiad: install new SCBs and linecards - https://phabricator.wikimedia.org/T140764#2495881 (10faidon) We installed the new FPC on cr2-eqiad today — it's now up and online, all of its 32 10G ports. [16:16:25] 10Traffic, 06Operations: Age header reset to 0 after 24 hours on varnish frontends - https://phabricator.wikimedia.org/T141373#2496020 (10ema) [16:16:40] 10Traffic, 06Operations: Age header reset to 0 after 24 hours on varnish frontends - https://phabricator.wikimedia.org/T141373#2496032 (10ema) p:05Triage>03Normal [16:19:49] bblack, Snorri_: I've opened T141373 after chatting with bblack about it. Feel free to fix the task description if I misunderstood something [16:19:49] T141373: Age header reset to 0 after 24 hours on varnish frontends - https://phabricator.wikimedia.org/T141373 [17:17:35] ema: I´ll look through my data. Maybe I can find something regarding the first. I´m pretty sure it works like intended (having Age be transitive) [20:18:45] ema: bblack: hello. The lastest funny thing I came up with is to build the .deb package from Jenkins when a patch is proposed in Gerrit :D [20:18:56] currently building something for varnish4.git repo https://integration.wikimedia.org/ci/job/debian-glue/252/console [20:57:04] better (only 3 piuparts issues) https://integration.wikimedia.org/ci/job/debian-glue/253/ which might be because it is not build against jessie-wikimedia but just jessie [21:15:44] nice :) [21:47:06] 10Traffic, 10Analytics, 06Operations, 06Performance-Team: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2497082 (10ellery) @BBlack, @Nuria In order to run a randomized controlled experiment, you need to ensure that users are randomly assigned to treatment conditions at the start... [21:51:03] so I'm still kinda digging through this odd TLS stats mystery that's cropped up recently [21:51:26] since ~ July 20th or so (+/- a day as it ramped in), we've seen a notable shift in cipher stats [21:52:23] it's some of the mid-grade stuff (not best/latest TLS stuff, not ancient/crappy either), which had been stable or on the decline slowly over time, suddenly started ticking back up [21:52:43] the things ticking upwards are: [21:53:25] TLSv1.1 (before it was a little under 1% vs dominant TLSv1.2 or 1.0, now it's ~8%) [21:54:00] ecdhe-ecdsa-aes128-sha comes up from ~7% to ~12% ish [21:54:19] which we historically associate with: IE 7-10 on Vista+, Safari 5-6, Android 4.0-4.3, Java 7 [21:55:06] also ecdhe-rsa-aes128-sha has popped up a little bit [21:55:17] (but is much smaller to begin with. still, similar relative effect) [21:56:41] there's the whole thing with people in china smashing iphones around that date. if they all switched to android 4.x devices from iOS 9 devices.... ? but it seems crazy that many iphones were killed in that, to see it on our graphs so clearly. [21:57:26] sampling some raw varnishlog data (for very short windows and totally statistically invalid): of current TLSv1.1 reqs, a whole lot of them are coming from UAs that are similar to (with minor Windows Version variations): [21:57:33] Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.76 Safari/537.36 [21:57:52] which is a release of Chrome for Windows that came out back in Mar 2015, and is now deprecated/dead on Chrome's constant march of upgrade progress [21:58:55] it's all kind of mysterious. what changed in the world of https user agents ~July 20? [22:00:05] I'm trying to imagine what would cause large volumes of Windows users to suddenly install a Chrome version that's over a year out of date. [22:01:15] also, as far back as Chrome 30, Chrome supposedly supports TLSv1.2 [22:01:30] so I don't get why these supposed Chrome/41's would be on 1.1 [22:04:24] could be fake UA of course [22:20:58] can you look at the Cookie header too, to see if the GeoIP cookie points to a particular location? [22:23:16] could be a state actor using a MITM protocol downgrade attack, for example [22:24:30] yeah maybe [22:25:21] I've been looking longer at UAs on one text cache each in ulsfo+eqiad+esams, and it's always some variant of Windows + Chrome/41.0.2272.76 that's topping the TLSv1.1 stats [22:27:29] how are you looking at requests? out of curiosity I ran `varnishlog -n frontend -q 'ReqHeader ~ TLSv1.1'` on cp1051 but didn't see any requests. The same invocation shows log records if I change TLSv1.1 to TLSv1.2 [22:28:01] sampling cookies now [22:28:27] 1051 is cache_misc [22:28:35] I'm looking at cache_text [22:28:45] oh, right [22:30:07] anyways, I'm taking another 10 minute sample on 1 text node in esams+eqiad+ulsfo with Cookie: data [22:32:12] my guess would be that some bug in that version of Chrome is interacting poorly with some idiosyncrasy of how our servers reply which causes Chrome to downgrade [22:33:18] https://bugs.chromium.org/p/chromium/issues/detail?id=468076 looks related [22:34:10] but why did the stats anomaly take off several days ago? there's no reason people would suddenly be installing new copies of Chrome/41 [22:34:30] unless it was something on our end [22:35:44] I did merge the AES256 change roughly in that timeframe, but I don't think it lines up with when the stats change takes off [22:37:21] maybe it won't line up exactly because it only manifests when a new session is negotiated? [22:37:36] (pretty speculative, i admit) [22:38:57] ooh check out comment #6: https://bugs.chromium.org/p/chromium/issues/detail?id=466977#c6 [22:41:32] yeah [22:42:01] our AES256 change is nowhere near that stuff. but still, it's possible the change has triggered some very subtle buggy interaction with handshaking [22:42:22] I can try reverting and see what happens over the next day or two, it's worth a shot. [22:51:37] the more I think about it the more the AES256 thing makes timing sense. it looked off to me because I didn't force it around. It would've been slow to take real effect even on our end (nevermind session resumption). [22:52:27] but the number clients actually using the removed (ancient non-forward-secret) AES256 ciphers is tiny, and no chrome would've negotiated for that, not even 41. [22:53:03] so it's not a direct effect. something else either about the total size of the cipher list, or perhaps chrome does some advanced probing to try to guess server capabilities and mitigate bugs, and removing those ciphers trips logic in that. [22:53:50] chrome/41, anyways [22:54:20] the geoip stats didn't show anything really shocking [22:54:50] a little bit of a case for this being RU-specific, but I think I'm just being fooled by other bias (in the underlying populations of countries+browsers+OS's, etc) [22:55:54] anyways, I'm forcing updates around for the revert of the AES256 change, should be able to tell if it's the trigger fairly quickly [22:59:12] I could force ssl sessions to reset, too [22:59:39] but I'll wait and see if I can see a change without it, at least one starting up [23:09:46] 10Traffic, 10Analytics, 06Operations, 06Performance-Team: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2497291 (10BBlack) >>! In T135762#2497082, @ellery wrote: > As far as I can tell, the proposed method also violates the more important property that users need to be randomly a... [23:16:13] I can't really see the effect I'm looking for in the short term I don't think. trying reset of session caches... [23:59:11] meh the AES256 change doesn't seem to be it. still, I'll leave it reverted a few days to be sure.