[00:13:11] <wikibugs>	 10Traffic, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2784442 (10Tgr) Related: {T74328}, {T67383}, {T75935}
[00:54:46] <wikibugs>	 10netops, 10EventBus, 06Operations, 10ops-codfw: kafka2003 switch  port configuration - https://phabricator.wikimedia.org/T150380#2784578 (10RobH) 05Open>03Resolved switch config updated
[01:46:21] <wikibugs>	 10netops, 06Labs, 10Labs-Infrastructure, 06Operations, and 2 others: Provide public access to OpenStack APIs - https://phabricator.wikimedia.org/T150092#2784615 (10Andrew) Update -- documentation for keystone/oauth is pretty poor, and keystone has a habit of leaving partially-completed features to rot so I...
[03:00:41] <wikibugs>	 10netops, 06Labs, 10Labs-Infrastructure, 06Operations, and 2 others: Provide public access to OpenStack APIs - https://phabricator.wikimedia.org/T150092#2784668 (10AlexMonk-WMF) >>! In T150092#2784615, @Andrew wrote: > It compiles properly for labtestcontrol but not for labcontrol, which I'm not yet clear...
[03:20:06] <wikibugs>	 10netops, 06Labs, 10Labs-Infrastructure, 06Operations, and 2 others: Provide public access to OpenStack APIs - https://phabricator.wikimedia.org/T150092#2784739 (10Andrew) > Maybe it needs a require network::constants before the template() or something?  Yep, fixed by changing the erb lookup.
[03:22:52] <wikibugs>	 10netops, 06Operations, 10ops-codfw: Spare pool servers switch configuration - https://phabricator.wikimedia.org/T150400#2784748 (10Papaul)
[10:02:03] <wikibugs>	 10netops, 06Operations, 10hardware-requests, 10ops-eqiad, 13Patch-For-Review: Move labsdb1008 to production, rename it back to db1095, use it as a temporary sanitarium - https://phabricator.wikimedia.org/T149829#2785180 (10jcrespo) Mark changed the vlan already. I  need @Cmjohnson to assign it an ip and...
[11:14:21] <wikibugs>	 10netops, 06Operations: HTCP purges flood across CODFW - https://phabricator.wikimedia.org/T133387#2230372 (10faidon) This is almost certainly a Juniper bug. asw-a/b/c/d-codfw currently run JunOS 14.1X53-D27.3 and asw2-d-eqiad runs 14.1X53-D35.3, the currently JTAC-recommended.  Since this issue has persisted...
[11:16:05] <wikibugs>	 10netops, 06Operations: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#2785288 (10faidon) p:05Normal>03High
[11:18:52] <wikibugs>	 10netops, 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2785297 (10faidon) asw2-d-eqiad has been confirmed to be affected by T133387 and enabling IGMP snooping on the QFXes breaks IPv6.  Since this aff...
[12:13:45] <bblack>	 ema: so route chages are esams from codfw to eqiad, and codfw from direct to eqiad
[12:14:33] <ema>	 right
[12:14:41] <bblack>	 I'd do the codfw first I think
[12:15:07] <ema>	 maybe waiting a bit longer for eqiad's cache to refill?
[12:15:07] <bblack>	 it will give a little more time for eqiad backends to recover before esams hits them again
[12:15:26] <ema>	 same concern, different angle :)
[12:15:36] <bblack>	 ulsfo and codfw are "stable" and have good cache contents, and less latency on the refill anyways
[12:16:08] <bblack>	 technically esams is stable by now too (from yesterday)
[12:16:20] <bblack>	 I just worry more about unecessary esams misses than codfw ones I guess
[12:16:30] <bblack>	 (way more traffic if nothing else)
[12:17:47] <bblack>	 in any case, eqiad's backend hitrate is already rocketing back up
[12:18:51] <ema>	 I've gotta go out for lunch, starving! See you in a bit
[12:19:22] <bblack>	 ok
[13:12:51] <wikibugs>	 10netops, 06Operations: Low IPv6 bandwth from Free.fr (AS12322) > Zayo > eqiad - https://phabricator.wikimedia.org/T150374#2785512 (10faidon) 05Open>03declined Since there is only one intermediate ASN between Proxad and us (Zayo) and the path within Zayo and within our network is basically exactly the same...
[13:37:04] <ema>	 bblack: shall we? https://gerrit.wikimedia.org/r/#/c/320774/
[13:37:37] <bblack>	 we shall!
[13:48:44] <bblack>	 heh, too many package upgrades, my brain fell behind
[13:48:55] <bblack>	 ema: I think I inadvertently upgraded the varnish packages on maps+misc, too :)
[13:49:01] <bblack>	 but not restarted yet
[13:49:12] <bblack>	 have been post-upgrade puppeted though
[13:49:51] <bblack>	 I guess I'll step through their varnishd restarts now
[13:49:55] <ema>	 ok
[14:17:35] <ema>	 bblack: esams back to eqiad? https://gerrit.wikimedia.org/r/#/c/320778/
[14:18:38] <bblack>	 yeah
[14:41:44] <bblack>	 now we're finally getting curve stats that sound right
[14:42:13] <bblack>	 ~41% X25519 vs ~51% chacha20-poly1305
[14:42:53] <bblack>	 chacha was released with chrome49, x25519 with chrome50
[14:43:06] <ema>	 nice
[14:43:14] <bblack>	 chacha is also in some new FF releases, x25519 is rare in anything popular other than chrome so far
[14:43:35] <bblack>	 before we had x25519 > chacha20, which couldn't possibly be right given the above heh
[14:44:32] <bblack>	 chacha is in FF from 47 onwards I think
[17:13:04] <ema>	 bblack: does this look like kinda-the-way-to-go for extrachance? https://phabricator.wikimedia.org/P4406
[17:13:20] <ema>	 (it compiles)
[17:47:49] <bblack>	 ema: yeah, although the max of 5 seems kinda arbitrary.  for those that desire the extrachance behavior, I imagine the value they want to use there scales as some function of the number of open connections to backends they're likely to have.
[17:48:20] <bblack>	 e.g. if you commonly have 10000 open connections, you may want to let extrachance find+fix the timing race 100 times to minimize the problem, I donno
[17:48:37] <bblack>	 I'm not even sure how that would scale, since it seems like a bad idea in general :)
[17:49:55] <bblack>	 I'd say if you're going to go that route, maybe let the fixed max be at least something like "100"
[17:50:16] <bblack>	 the alternative is make the parameter signed and allow -1 for unlimited (current behavior)
[17:50:51] <bblack>	 but then the patch gets more-complex to support the -1 case (you'd want to leave the existing extrachance=0/1 behavior, and then include the new counter separately if it's >0
[17:50:55] <bblack>	 )
[17:51:02] <bblack>	 err, >= 0
[17:51:17] <bblack>	 maybe that was a bad way of saying it
[17:52:15] <bblack>	 leave the existing code mostly-alone (including all references to the variable "extrachance").  have the new param be -1 => 100, where -1 is unlimited and default (current) behavior.  then at the top "int extrachance_limit = ...parameter...",
[17:53:26] <bblack>	 and at the bottom "} while (extrachance && (extrachance_limit-- != 0))"
[17:54:35] <bblack>	 but really either way works, I'd just bump the 5 significant higher for max in the simpler version
[17:54:51] <bblack>	 either way file a bug upstream, propose a solution, see what they sau
[17:54:54] <bblack>	 *say
[17:56:14] <bblack>	 they may say the bug lies elsewhere by fixing some other constraint on backends, it should be impossible for that to happen more than once even with the existing code in that function.
[18:11:58] <paravoid>	 so, I'm encountering a super weird issue
[18:12:13] <paravoid>	 pretty consistenly though, and it may or may not be indicative of larger issues
[18:12:20] <paravoid>	 I don't have much data yet, I'll report what I have
[18:12:43] <paravoid>	 using lftp from jessie (but not sid), fetching https://people.wikimedia.org/~jmm/node/nodejs-dbg_6.9.1~dfsg-1+wmf1_amd64.deb (a 122MB file)
[18:12:57] <paravoid>	 from Europe, so esams misc-web
[18:13:15] <paravoid>	 speed starts from 1MB/s and drops down to 300K/s
[18:13:49] <paravoid>	 fetching it with wget, curl or sid's lftp seems to get me speeds in the order of 10-20MB/s
[18:14:00] <paravoid>	 it's pretty crazy
[18:14:52] <paravoid>	 I can reproduce it from multiple different hosts *and* networks, that don't share paths
[18:15:17] <paravoid>	 could it be e.g. ciphers?
[18:18:02] <paravoid>	 just reproduced it from bast3001 fwiw
[18:18:17] <bblack>	 could just be lftp bug in network efficiency, too
[18:18:27] <paravoid>	 no because it doesn't happen in anything else
[18:18:35] <bblack>	 anything else being what?
[18:18:42] <paravoid>	 other downloads from other networks
[18:19:14] <bblack>	 I know, but there's clearly a pattern here
[18:19:31] <bblack>	 you have 3x UAs that work with us, and 1x that doesn't work with us but does with some others.
[18:19:51] <bblack>	 there's a difference in that UA for sure.  there may also be a difference in us vs the other example cases.
[18:20:33] <bblack>	 looking up lftp version info, etc...
[18:21:43] <bblack>	 (also, are you sure there aren't "obvious" differences in sid+jessie lftp behavior in this case, like ipv6 vs ipv4, or http/1 vs http/2?)
[18:22:56] <bblack>	 4.7.2 (sid version, jessie is 4.6.0) has this in changelog: ssl: improved ssl performance for small read sizes
[18:23:22] <bblack>	 perhaps the previous behavior had soem bad interaction with our dynamic record sizing?
[18:23:45] <bblack>	 4.7.0: ssl: optimized ssl for speed and lower syscall count.
[18:23:47] <paravoid>	 hmm
[18:23:50] <paravoid>	 could be
[18:24:23] <paravoid>	 jessie's negotiates with TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384, sid's with TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256, fwiw
[18:24:34] <paravoid>	 I'll see what jessie's curl does next
[18:24:39] <bblack>	 is this stock jessie/sid, or our jessie/sid?
[18:24:46] <paravoid>	 stock
[18:25:02] <bblack>	 I just mean, our jessie has CHACHA20 in both openssl-1.0.2 and openssl-1.1.0
[18:25:10] <bblack>	 I guess sid has 1.1.0 already?
[18:25:35] <paravoid>	 lftp is linked against gnutls
[18:25:41] <bblack>	 ah
[18:27:22] <paravoid>	 curl (w/ gnutls) also does TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384, so I guess unlikely that's some cipher limitation
[18:27:28] <paravoid>	 could be what you just said
[18:28:20] <wikibugs>	 10Traffic, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2786655 (10bearND) Another idea I brought up with GWicke on IRC yesterday is to embed the original dimensions in the URL (either as query parameters or in...
[18:28:44] <bblack>	 our tuning on that is pretty arbitrary FWIW (which we figure is better than nothing, but may not be)
[18:29:00] <bblack>	 the the cloudflare patch we're starting with a small record size, then there's an intermediate bump, then a final bump to 16K
[18:29:33] <bblack>	 most other implementations I've seen do it differently.  they start with the small records, then after 1MB of continuous transfer they switch to 16K (dropping back after output pause).
[18:29:43] <wikibugs>	 10Traffic, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2786657 (10bearND) > look into which users rely on the current thumb format  The apps and MCS rely on the current thumb format to resize thumbnails downwa...
[18:29:54] <bblack>	 then there's nginx upstream, which seems to recommend picking a middle value like 4K as static
[18:30:59] <paravoid>	 so
[18:31:05] <paravoid>	 fetching https://mirrors.wikimedia.org/debian/pool/main/n/nodejs/nodejs-dbg_6.9.1~dfsg-1_amd64.deb
[18:31:15] <paravoid>	 gets me 4MB/s
[18:31:18] <paravoid>	 but dropping
[18:31:44] <bblack>	 does it get faster with sid lftp on that as well?
[18:32:10] <bblack>	 mirrors would have some past version of our nginx build (IIRC?), but not the dynamic record size tuning or other bits
[18:32:39] <paravoid>	 no, about the same speed with sid's lftp
[18:32:40] <paravoid>	 hrm
[18:33:04] <bblack>	 root@sodium:~# nginx -V
[18:33:05] <bblack>	 nginx version: nginx/1.11.3
[18:33:05] <bblack>	 built with OpenSSL 1.0.2h  3 May 2016
[18:33:25] <bblack>	 might be an interesting test to update nginx there and see what happens (which will also bring in openssl-1.1)
[18:34:33] <paravoid>	 doing so
[18:34:34] <bblack>	 (but newer nginx there won't turn on dynamic record sizes without config changes)
[18:34:55] <paravoid>	 even better then
[18:35:55] <wikibugs>	 10netops, 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2786705 (10Cmjohnson) Received the new cables and finished with the row redundancy less the d1 to d8 link which will need fiber.
[18:36:04] <bblack>	 cache_misc (people.wm.o) settings for dynamic record size tuning is:
[18:36:05] <bblack>	     ssl_dyn_rec_enable on;    # cf patch default: off
[18:36:05] <bblack>	     ssl_dyn_rec_size_lo 1300; # cf patch default: 1369
[18:36:05] <bblack>	     ssl_dyn_rec_size_hi 4096; # cf patch default: 4229
[18:36:05] <bblack>	     ssl_buffer_size 16k;      # nginx default: 16k (also max possible)
[18:36:39] <bblack>	 of course there's tons of other differences with cache_misc, especially at the OS level (fastopen and various tcp params, etc...)
[18:36:56] <bblack>	 most of which you wouldn't think would affect this, but still
[18:37:12] <paravoid>	 jeez
[18:37:14] <paravoid>	 `nodejs-dbg_6.9.1~dfsg-1_amd64.deb' at 491184 (0%) 71.4K/s eta:30m [Receiving data]  
[18:37:23] <paravoid>	 updated nginx, no new settings at all
[18:37:24] <paravoid>	 same config
[18:37:49] <paravoid>	 bumped to 2.4MB/s now?!
[18:37:51] <bblack>	 that's EU->eqiad though right?
[18:37:52] <paravoid>	 wtf
[18:37:56] <bblack>	 heh
[18:38:02] <bblack>	 see if it's better with sid lftp?
[18:38:05] <paravoid>	 yeah but 70K/s and 30K/s is crazy
[18:38:24] <bblack>	 I'm just saying, it may be more-affected than your fetch from people, which terminates in esams
[18:38:33] <paravoid>	 yeah
[18:38:50] <bblack>	 does sid lftp get mirrors at normal speeds?
[18:38:54] <paravoid>	 ok, now I can't reproduce, maybe nginx was just booting up or something
[18:39:14] <bblack>	 ok
[18:39:27] <bblack>	 the default is actually 4k, I documented that wrong in the comment above
[18:39:40] <bblack>	 oh wait, no, comment is right :)
[18:39:54] <bblack>	 default is 16K records, which can make for slow TLS handshaking stuff
[18:40:42] <bblack>	 oh I missed a setting on cache_misc tuning, there's also:
[18:40:50] <bblack>	     ssl_dyn_rec_threshold 40; # cf patch default: 40
[18:40:50] <bblack>	     ssl_dyn_rec_timeout 1s;   # cf patch default: 1s
[18:40:55] <bblack>	 but those are supposedly defaults
[18:41:07] <bblack>	 on upload and maps it's set more-aggresive at:
[18:41:08] <bblack>	     ssl_dyn_rec_threshold 20; # cf patch default: 40
[18:41:09] <bblack>	     ssl_dyn_rec_timeout 3s;   # cf patch default: 1s
[18:41:48] <paravoid>	 put all of those in, can't reproduce
[18:42:11] <bblack>	 puppetfails in -ops, looks like apt/mirrors short outage?
[18:42:43] <bblack>	 for future reference "service nginx upgrade"
[18:42:59] <paravoid>	 it failed
[18:43:06] <bblack>	 huh?
[18:43:29] <paravoid>	 postinst tried to do that and reported a failure
[18:43:52] <paravoid>	 Upgrading ...[failed] in red
[18:43:53] <bblack>	 oh, why? that always works
[18:43:58] <paravoid>	 I didn't bother investigating why
[18:44:49] <paravoid>	 so it's not the ssl_dyn stuff
[18:45:13] <bblack>	 ok
[18:45:37] <bblack>	 but jessie lftp is still slow against mirrors now, and wasn't before bumping nginx+openssl?
[18:45:51] <paravoid>	 jessie lftp isn't slow against mirrors
[18:45:55] <paravoid>	 it's same as sid atm
[18:46:12] <bblack>	 ah ok
[18:46:14] <paravoid>	 so everything works fine against mirrors
[18:46:30] <paravoid>	 with both old nginx, new nginx, new nginx+ssl dyn config
[18:46:36] <paravoid>	 s/both/all of/
[18:46:41] <bblack>	 hmmmm
[18:47:42] <paravoid>	 it's all reproducible from bast3001 if you want to have a look yourself
[18:47:49] <paravoid>	 (but happy to continue to be your console too)
[18:48:47] <bblack>	 what flags for lftp, etc?
[18:49:11] <paravoid>	 nothing
[18:49:23] <paravoid>	 lftp https://people.wikimedia.org/~jmm/node/
[18:49:31] <bblack>	 when I just put the url after lftp, it does some chdir thing and gives me a prompt
[18:49:31] <paravoid>	 then "get nodejs-dbg_6.9.1~dfsg-1+wmf1_amd64.deb"
[18:49:33] <bblack>	 I've never used it
[18:49:38] <bblack>	 ok
[18:50:43] <paravoid>	 I just used it because it's smart enough to recognize apache's/nginx's etc. directory listing and giving you "ls" "get" etc.
[18:50:54] <paravoid>	 treat an http endpoint as an ftp server basically
[18:51:18] <bblack>	 lftp -e 'get nodejs-dbg_6.9.1~dfsg-1+wmf1_amd64.deb' 'https://people.wikimedia.org/~jmm/node/'
[18:51:59] <bblack>	 I get ~4MB/s from here to eqiad for people with sid lftp
[18:52:27] <paravoid>	 yeah sid lftp wfm too
[18:52:44] <paravoid>	 sid lftp is also sid gnutls, though, so that's a very different stack
[18:53:10] <bblack>	 heh
[18:53:21] <bblack>	 it changed tls libraries from 4.6.0 to 4.7.2? :)
[18:53:37] <paravoid>	 no
[18:53:44] <paravoid>	 it's jessie vs. sid gnutls too, I mean
[18:55:16] <bblack>	 oh ok
[18:56:55] <bblack>	 so bast1001 -> cache_misc@eqiad for that people transfer: wget reports 60MB/s (but it's 2s long, so it's probably inaccurate)
[18:57:07] <bblack>	 jessie lftp starts at 1MB/s and declines
[18:57:10] <paravoid>	 ooh interesting
[18:57:16] <bblack>	 it's down to 314K now
[18:57:17] <paravoid>	 yeah
[18:57:20] <paravoid>	 and stops at 300K
[18:57:24] <paravoid>	 or thereabouts
[18:57:29] <bblack>	 that sounds like generic tcp buffering problems
[18:57:31] <paravoid>	 and remains stable there, like it's throttled at 300K
[18:57:52] <bblack>	 but then that wouldn't explain it working with other sites well
[18:59:48] <bblack>	 both are using IPv6
[19:00:06] <paravoid>	 yeah that was my next test
[19:01:05] <paravoid>	 same with ipv4 fwiw
[19:01:45] <paravoid>	 the only reason I'm still investigating this is because I fear this may be affecting other UAs as well
[19:02:31] <paravoid>	 I really hope this isn't just some odd silly thing and I'm not wasting my and your time
[19:05:13] <bblack>	 you'd be surprised how much time I spend chasing seeming silly things only to find out they matter :)
[19:07:22] <bblack>	 even with mirrors, for eqiad<->eqiad lftp is slower than wget
[19:08:22] <bblack>	 this time wget reported 121MB/s, but lftp always gets limited to a bit under 30MB/s
[19:08:27] <bblack>	 (eqiad<->eqiad, hitting mirrors)
[19:08:47] <bblack>	 I wonder what makes cache_misc so much worse, though
[19:09:37] <bblack>	 something tcp-related I'd think
[19:11:38] <bblack>	 I donno I'm still inclined to blame lftp-4.6.0 and/or its gnutls more than anything
[19:12:24] <bblack>	 comparison-testing that same nodejs file from bast1001 to mirrors.xmission.com using lftp and wget, wget wins there too
[19:12:28] <bblack>	 (but not by as much)
[19:12:46] <bblack>	 also kinda variable, though :/
[19:20:23] <wikibugs>	 10Traffic, 06Operations, 10RESTBase, 06Services (doing): Restbase redirects with cors not working on Android 4 native browser - https://phabricator.wikimedia.org/T149295#2786916 (10Pchelolo) 05Open>03Resolved Merged and deployed. Resolving. Hopefully we've got all of the CORS edge cases now.
[19:32:55] <paravoid>	 so weird
[19:36:16] <wikibugs>	 10Traffic, 06Operations, 05Prometheus-metrics-monitoring: Error collecting metrics from varnish_exporter on some misc hosts - https://phabricator.wikimedia.org/T150479#2787072 (10fgiunchedi)
[19:39:30] <wikibugs>	 10Traffic, 06Operations, 05Prometheus-metrics-monitoring: Error collecting metrics from varnish_exporter on some misc hosts - https://phabricator.wikimedia.org/T150479#2787097 (10fgiunchedi) Also the same happens on all esams misc frontend, needs further investigation
[19:48:31] <gwicke>	 bblack: some mobile benchmarks I just did show slow production taking a long time for TLS negotiation compared to the labs proxy: https://www.webpagetest.org/video/compare.php?tests=161110_K9_29JN-r:1-c:0 vs. https://www.webpagetest.org/video/compare.php?tests=161110_CD_2FB9-r:1-c:0
[19:49:03] <gwicke>	 1918ms vs. 976ms, on 2G
[19:49:38] <gwicke>	 s/slow production/production/
[19:49:55] <bblack>	 what's swproxy?
[19:50:18] <gwicke>	 a serviceworker proxy
[19:50:18] <gwicke>	 https://swproxy-mobile.wmflabs.org/wiki/Foobar
[19:50:48] <gwicke>	 it runs a serviceworker on behalf of clients that don't support serviceworkers natively
[19:51:31] <bblack>	 I mean what TLS software is it?
[19:51:41] <bblack>	 node.js linked against ... ?
[19:51:44] <gwicke>	 the nginx labs proxy
[19:51:57] <bblack>	 oh it uses the labs proxy, ok
[19:52:32] <gwicke>	 the thing is that I'm rather sure that I didn't see this difference in tls negotiation timings ~2 weeks ago
[19:53:06] <bblack>	 entirely possible
[19:53:28] <bblack>	 but there are also a lot of different things that could be causing it
[19:54:54] <gwicke>	 I have seen this in several repeats of this test
[19:55:04] <gwicke>	 and they are all on a fresh chrome instance
[19:55:04] <bblack>	 I think in our navtiming, the closest proxy for it would be responsestart, right?
[19:55:42] <gwicke>	 I'm not too familiar with those metrics
[19:56:05] <gwicke>	 but yeah, if that timer includes tls negotiation, then it should show a change
[19:58:12] <gwicke>	 but yeah, this is from a single IP only, so might be some geoip mapping issue
[19:58:50] <bblack>	 or a lot of other things really, it's just hard to say from one example, without deep network debugging :)
[19:59:02] <gwicke>	 although some of the other latency sensitive parts are rather identical
[20:00:27] <gwicke>	 kk, I'll do some more checking, just wanted to rule out that there were major changes in tls recently that could have caused a regression
[20:03:21] <bblack>	 there are almost always major changes recently heh
[20:03:44] <bblack>	 but also, labs proxy and prod nginx are differently-configured to begin with
[20:04:51] <bblack>	 (prod cert has more SANs on it and thus is larger for the same authalg, prod sends OCSP stapling data (another ~1600 bytes), different cipher and authalg options in general, etc...)
[20:13:48] <gwicke>	 the timing delta is equivalent to one RTT
[20:14:05] <gwicke>	 I'm repeating the test from another location now
[20:15:45] <bblack>	 yeah I tried from here on my laptop and I do I see a diff, but my times are considerably lower in general, so it's hard to factor out the absolutes :)
[20:15:57] <bblack>	 I could maybe point at esams and find some better diff
[20:16:16] <bblack>	 but, again, we expect a cost to some of our config (it's possible stapling alone does it)
[20:17:09] <bblack>	 oh right can't point at esams for labs
[20:17:40] <bblack>	 I guess I could I'm reaching eqiad for prod equivalence, and then add some artificial network delay to enhance it
[20:17:47] <bblack>	 s/could/could ensure/
[20:18:23] <gwicke>	 just ran a test in amsterdam, and tls setup time is about the same there
[20:18:47] <bblack>	 oh heh
[20:19:02] <bblack>	 no wonder I saw a diff, I still have my /etc/hosts hacked pointing enwiki at amsterdam for me, from a previous test :)
[20:19:05] <gwicke>	 https://www.webpagetest.org/video/compare.php?tests=161110_3T_2JBS-r:1-c:0
[20:19:23] <gwicke>	 2052 ms
[20:20:09] <gwicke>	 interestingly, hitting labs from amsterdam is still faster: https://www.webpagetest.org/video/compare.php?tests=161110_PT_2JCC-r:1-c:0
[20:20:52] <bblack>	 is all of this through webpagetest, the results you see?
[20:21:33] <gwicke>	 let me test locally as well
[20:22:00] <bblack>	 to test locally and accurately, you have to hit eqiad
[20:22:16] <bblack>	 (as labs is only at eqiad)
[20:23:14] <bblack>	 I do get 1xRTT extra on prod vs labs here though, it looks like
[20:24:00] <bblack>	 my ping RTT is 63ms, my swproxy test is 61ms for SSL time, and my en.m.wp.o test shows 134ms
[20:24:18] <bblack>	 given normal jitter in all of those things, it's roughly +1xRTT
[20:24:27] <bblack>	 now to add some delay...
[20:26:29] <bblack>	 yup
[20:26:54] <bblack>	 I added 1000ms adtificial delay on my laptop, and now the times are 1.07s and 2.13s for the SSL part of chrome's timings
[20:27:02] <bblack>	 so it's +1xRTT
[20:27:47] <gwicke>	 *nod*, that's consistent with the webpagetest results
[20:27:59] <gwicke>	 the 2g setting there has 800ms rtt
[20:28:48] <bblack>	 I'll have to dig a lot to find out why
[20:29:04] <bblack>	 there's a chance it's just the unavoidable cost of the OCSP Staple, but it could be something else, too
[20:29:28] <gwicke>	 about two weeks ago I'm pretty sure that I didn't get such a big difference
[20:29:31] <bblack>	 it's been a long time (maybe ~2y?) since I've actually looked at the whole handshake at the sniffer level to see how things are
[20:30:30] <bblack>	 do you have any firmer idea on the "about two weeks"?
[20:31:03] <bblack>	 (FWIW, I can't see an appreciable change in the past ~30d on responseStart)
[20:31:57] <gwicke>	 my last testing session was October 28th
[20:32:07] <bblack>	 ok
[20:32:37] <gwicke>	 https://www.webpagetest.org/video/compare.php?tests=161028_CA_YGX-r:1-c:0
[20:32:52] <gwicke>	 SSL Negotiation: 1043 ms
[20:32:59] <gwicke>	 same 800ms 2G setting
[20:33:20] <bblack>	 so, since mid-october, we've upgraded openssl, then re-downgraded over a bug, then upgraded again, and Oct 28th falls in the middle period of "re-downgraded" :)
[20:33:23] <bblack>	 fun
[20:33:52] <bblack>	 but it may not be related to the openssl updates, we'll see once I get deeper into this
[20:34:18] <gwicke>	 full October 28th run results at https://www.webpagetest.org/result/161028_CA_YGX/
[21:03:37] <wikibugs>	 10Traffic, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2787386 (10GWicke) >>! In T66214#2786655, @bearND wrote: > Another idea I brought up with GWicke on IRC yesterday is to embed the original dimensions in t...
[21:16:53] <gwicke>	 bblack, this graph seems to suggest a change in responseStart timings around Nov 1st: https://grafana.wikimedia.org/dashboard/db/navigation-timing?panelId=6&fullscreen&var-metric=responseStart
[21:28:13] <bblack>	 yeah but it suggestions many such changes :)
[21:28:25] <bblack>	 *suggests
[21:29:45] <bblack>	 in any case, I did confirm that OCSP makes the difference.  it's what pushes us over the bytes boundary to get the extra RTT.
[21:30:11] <bblack>	 but there's no fundamental reason that should've changed recently, so more digging is in order
[21:30:39] <bblack>	 (but I can test with artificial delay and the openssl CLI and confirm expected RTT on prod without "-status" to ask for OCSP, and +1xRTT if I add "-status")
[21:31:17] <bblack>	 we may have eaten the extra RTT cost in the name of OCSP stapling way way way back when we first enabled it, but it shouldn't have changed recently
[21:31:41] <bblack>	 anyways, lots more digging to do yet
[21:44:43] <wikibugs>	 10Traffic, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2787495 (10GWicke) >>! In T66214#2786657, @bearND wrote: >> look into which users rely on the current thumb format >  > The apps and MCS rely on the curre...
[21:51:21] <wikibugs>	 10Domains, 10Traffic, 10DNS, 06Operations, and 2 others: Point wikipedia.in to 205.147.101.160 instead of URL forward - https://phabricator.wikimedia.org/T144508#2787513 (10CRoslof) So, as I understand it, there is no problem with the current state of affairs that the requested change would fix. Also, I ha...
[22:12:27] <wikibugs>	 10Traffic, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2787573 (10Tgr) >>! In T66214#2786657, @bearND wrote: > Requirements for a long term solution should include, in addition to scaling, the ability to crop...
[22:42:53] <bblack>	 hmmm, it may actually be that the extra RTT is since openssl-1.1.0 deploy, and has to do with nginx and/or openssl's buffering logic at the start of the connection
[22:43:42] <bblack>	 initially this was looking like the old iw3 limit, but everything client and server side for my tests should do iw10
[22:43:52] <bblack>	 it's just a similarish limit in buffering somewhere, that's stopping us at ~4k
[22:45:07] <bblack>	 probably here: https://github.com/nginx/nginx/blob/master/src/event/ngx_event_openssl.c#L812
[22:45:20] <bblack>	 is where they worked around a similar issue long ago, I bet something changed about that workaround with 1.1.0
[23:28:53] <wikibugs>	 10Traffic, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2787864 (10bearND) >>! In T66214#2787573, @Tgr wrote: > Is there any situation where that information could not be easily provided alongside in a more str...