[00:07:32] <wikibugs>	 10Traffic, 10netops, 10Cloud-VPS, 10Operations, 10cloud-services-team (Kanban): Evaluate the possibility to add Juniper images to Openstack - https://phabricator.wikimedia.org/T180179#3749641 (10bd808)
[09:29:25] <wikibugs>	 10Traffic, 10netops, 10Cloud-VPS, 10Operations, 10cloud-services-team (Kanban): Evaluate the possibility to add Juniper images to Openstack - https://phabricator.wikimedia.org/T180179#3749582 (10akosiaris) Just as a note, if it proves it's not possible to do this in our openstack, vagrant is also a valid...
[12:59:28] <moritzm>	 https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/
[13:17:24] <ema>	 nice!
[13:17:41] <ema>	 wikichip looks awesome I didn't know about it https://en.wikichip.org/wiki/WikiChip
[13:36:12] <ema>	 so, AVX-512 instructions use a lot of power, if you use them dynamic frequency scaling happens and affects other workloads 
[13:39:30] <ema>	 conclusion: disable AVX-512? Really? Can't you disable dynamic frequency scaling instead?
[13:41:21] <bblack>	 the chip has thermal limits.  the reason the frequency scales down is because the chip would melt if it didn't.
[13:41:28] <ema>	 ah!
[13:43:43] <bblack>	 it's a fascinating blog post for a few different reasons though
[13:45:00] <ema>	 the fact that boringssl performs better by using xor/shift/add instead of AVX2 multiplication is pretty cool
[13:45:09] <bblack>	 the drop in transactions on switching to avx-512-chapoly is not a huge deal at a 10% drop in transactions when fully-loaded
[13:45:33] <bblack>	 but yeah there's implications for the rest of a mixed server.  the shift to chapoly slowed down varnish-fe, basically.
[13:46:29] <bblack>	 but we run considerable performance margins to avoid having all kinds of stalling/delaying/queueing, so we have plenty of headroom to absorb such a problem and not notice it much, obviously.
[13:46:48] <moritzm>	 I also found in interesting that cloudflare switched to boringssl, wasn't aware of that
[13:47:09] <bblack>	 yeah I think they've been on it for a while
[13:47:31] <bblack>	 boringssl definitely has its upsides.  but it also adds some maintenance burden on our end, tracking their release-less development.
[13:48:01] <bblack>	 might be an option down the road though, especially if we have a dedicated person on the job :)
[13:48:51] <bblack>	 but also interesting is their link to https://blog.cloudflare.com/arm-takes-wing/
[13:49:11] <ema>	 very, yes
[13:49:20] <bblack>	 those falkor chips look really interesting.  we should try to experiment with one when the opportunity arises.
[13:50:35] <bblack>	 tl;dr - "The production version of the Centriq SoC will feature up to 48 Falkor cores, running at a frequency of up to 2.6GHz, for a potential additional 8% better performance.
[13:50:39] <bblack>	 [...]
[13:50:48] <bblack>	 The largest win by far for Falkor is the low power consumption. Although it has a TDP of 120W, during my tests it never went above 89W (for the go benchmark). In comparison Skylake and Broadwell both went over 160W, while the TDP of the two CPUs is 170W."
[13:51:28] <bblack>	 and then toss in the benefits of not being tied to all things x86 (the management engine fiasco, legacy BIOS, blah blah)
[13:54:03] <paravoid>	 non-x86 has a cost though :)
[13:54:19] <bblack>	 sure, as does anything too-new to be standardized and relatively-unbuggy :)
[13:54:29] <paravoid>	 anything from how d-i will work, to having to build packages twice etc.
[13:54:39] <bblack>	 but if we get the chance to get a vendor to give us an early model for testing, it could help pave a future path
[13:54:57] <paravoid>	 we've had a vendor try to pitch us ARM before, it didn't go very well
[13:55:01] <bblack>	 apparently even Dell has done some one-off demo servers on various ARMs
[13:55:02] <paravoid>	 but at the time it was much slower
[13:55:06] <bblack>	 but nothing for sale
[13:56:52] <paravoid>	 HP had project Moonshot
[13:57:47] <bblack>	 in general, rewinding to all the rambling about large NUMA systems and our 3x layers (nginx->vfe->vbe) embedded per-cache-host, etc.  There are potentially a lot of simplifications and perf wins if we could split the cache layer up into a larger count of smaller/simpler servers, at ~ the same overall budgets on things like cost, power, rackUs
[13:58:17] <bblack>	 which is orthogonal to these new ARM chips, but related in how we evaluate what might become available on the market
[13:59:27] <bblack>	 at least splitting it out into just separate fe and be nodes, anyways.  and possibly having the text/upload split only exist in the fe layer, with a shared be layer.
[14:03:10] <bblack>	 one could imagine a future in which we had a set of fe nodes for TLS+memory caching on something like these Falkor ARM chips, and a set of x86 be nodes for running large/fast storage with arrays of nvme drives or whatever.
[14:03:40] <bblack>	 (but one could imagine 100 different such alternative futures, it's just a thought)
[14:05:59] <bblack>	 the kind of thinking that makes the case for our current architecture is that the 2 layers (or 3 daemons) do different things, so you can more-efficiently use up all the resources of a fatter server with a combined host role.
[14:06:27] <bblack>	 TLS using the CPUs+NICs more, fe cache using the big memory more, be using storage subsystem more.
[14:07:28] <bblack>	 but in practice, I think we lose a lot of efficiency, especially on these fat NUMA servers, to the OS trying to balance out the perf and resource utilization of these varying sorts of workloads altogether in a single host's context.
[14:08:00] <bblack>	 and then we lose time trying to fine-tune to make it work better
[16:34:28] <wikibugs>	 10Traffic, 10Operations, 10Prometheus-metrics-monitoring: authdns prometheus metrics are not available anymore - https://phabricator.wikimedia.org/T180256#3751470 (10ema)
[16:34:34] <wikibugs>	 10Traffic, 10Operations, 10Prometheus-metrics-monitoring: authdns prometheus metrics are not available anymore - https://phabricator.wikimedia.org/T180256#3751482 (10ema) p:05Triage>03Normal
[16:36:47] <wikibugs>	 10Traffic, 10netops, 10Operations, 10Cloud-VPS (Quota-requests): Request increased quota for traffic Cloud VPS project - https://phabricator.wikimedia.org/T180178#3751487 (10ema) p:05Triage>03Normal
[16:39:31] <wikibugs>	 10Traffic, 10Operations: Puppet / LVS: confusion in service vs IP name - https://phabricator.wikimedia.org/T180257#3751493 (10Gehel)
[20:25:51] <wikibugs>	 10HTTPS, 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown: Wikimedia's recent upgrade to nginx v. 1.13.6 breaks older Android HTTP libraries - https://phabricator.wikimedia.org/T180269#3751881 (10Reedy)
[22:11:53] <wikibugs>	 10HTTPS, 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown: Wikimedia's recent upgrade to nginx v. 1.13.6 breaks older Android HTTP libraries - https://phabricator.wikimedia.org/T180269#3751867 (10Reedy) Do we have any examples of this actually affecting any apps?  Are these apps actively maintained?...