[07:34:47] 10Traffic, 10Varnish, 06Operations, 13Patch-For-Review: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502#2521610 (10ema) New test performed this morning: sudo varnishd -a :81 -f /var/tmp/frontend-v4.vcl -F -n frontend sudo rm /var/tmp/varnish.main1 ; sudo varnishd -a :31... [08:40:18] 10Traffic, 10Varnish, 06Operations: Analyze Range requests on cache_upload frontend - https://phabricator.wikimedia.org/T142076#2521768 (10ema) [09:51:49] 10Traffic, 10Varnish, 06Operations: Analyze Range requests on cache_upload frontend - https://phabricator.wikimedia.org/T142076#2521872 (10ema) [11:04:39] 10Traffic, 10Varnish, 06Operations: Analyze Range requests on cache_upload frontend - https://phabricator.wikimedia.org/T142076#2521949 (10ema) I've collected 30 minutes of frontend GET requests on cp1048 as follows: varnishncsa -m 'RxRequest:GET' -F '%{Range}i %{Content-Length}o %r' -n frontend Out of 1... [11:52:10] 10Traffic, 10Varnish, 06Operations: Analyze Range requests on cache_upload frontend - https://phabricator.wikimedia.org/T142076#2521992 (10faidon) Note that for some media file types, such as Ogg (the container format), it's impossible to know the file's duration in seconds from the header of the file. Brows... [14:09:14] bblack: so the chacha draft % is relatively small? [14:09:28] and declining presumably [14:14:03] no cronspam from cp* hosts since June \o/ [14:15:00] on the other hand there's been a critical alert from librenms for cr2-eqiad.wikimedia.org - BGP Session down got better [14:15:24] surely "got better" should be good news? :) [14:16:54] oh, I see, all recoveries are critical if the original severity was critical [14:18:05] paravoid: this morning we had two interfaces down on cr1-eqord for a few minutes, I wasn't sure whether it was time to be very worried or not [14:19:07] then they came back so I carried on with my life [14:21:51] paravoid: it's about 19.4% avg so far, over the 22h of data we have [14:22:06] paravoid: (that is, 19.4% of chapoly-ciphered requests use the draft mode) [14:22:24] ema: yeah, don't get too worried about that [14:22:26] it's hard to say if it's really declining until we have some longer-term data [14:23:21] chapoly overall is something like 23% of all requests [14:23:28] so draft mode is ~4% of all requests [14:25:06] there's a lot of unknowns about the future, though, with how the android ecosystem works. maybe a new vendor-deranged variant of Android 5 is the baseline of a brand-new phone next week that becomes super-popular in China for the next three years. And it brings in chapoly-draft from upstream and uses it a lot and doesn't fix it for rfc mode. [14:25:17] etc [14:26:01] even if it just stays stable at ~4%, though, if the patch for draft+openssl-1.1.0 is late and/or kinda questionable on security/stability, it's probably droppable. [14:26:32] there's only so much we can optimize for, and it is just an optimization at this point, so long as AES-GCM remains fundmentally unbroken. [14:30:44] semi-related, we should push a patch up to nginx to log the negotiated ECDH curve, too [14:31:15] right now it's all NIST P-256 or whatever name you want to call that one. but when we switch to 1.1.0 we'll start getting x25519 ECDH too, and it would be nice to track stats on that as well. [14:31:58] (right now there's no way to know, it's an invisible detail stats-wise) [14:35:14] a vendor-deranged variant of Android 5 without Chrome or UCB or QQ etc. [14:35:27] yeah [14:35:28] that gets super popular [14:35:31] kind of unlikely :) [14:35:47] well [14:36:09] I haven't dug into the existing stats, but I presume at least some of the draft-mode we get today is from Android 5/6 that do have Chrome [14:36:24] just not updated chrome you mean? [14:36:27] not all http(s) requests from those phones go through Chrome, many go through the base libraries/browsers [14:36:32] which is what only does Draft [14:36:50] there's a built-in android browser that isn't chrome, and there's libraries for apps to use that aren't chrome's [14:37:34] WebView [14:37:53] that's separately updated too though [14:38:40] Since Android 4.4 (KitKat), the WebView component is based on the Chromium open source project. WebViews now include an updated version of the V8 JavaScript engine and support for modern web standards previously missing in old WebViews. New Webviews also share the same rendering engine as Chrome for Android, so rendering should be much more consistent between the WebView and Chrome. [14:38:46] In Android 5.0 (Lollipop), the WebView has moved to an APK so it can be updated seperately to the Android platform. To see what version of Chrome is currently used on a Lollipop device, simply go to Settings < Apps < Android System WebView and look at the version. [14:38:51] The WebView will auto-update for mobile devices with Android L and above. [14:39:05] maybe it's just the library, then? [14:39:29] ssllabs still reports android 5/6 as draft-mode negotiators (but not the Chrome versions that can be installed on them, obviously) [14:41:30] I'm taking a UA sample of draft-mode negotiators (just 1 minute on cp1065) to get some idea now [14:42:21] top 5 entries just uniq-ing on the whole UA string: [14:42:22] 94 Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X; en-US) AppleWebKit/537.51.1 (KHTML, like Gecko) Mobile/11D257 UCBrowser/10.7.3.808 Mobile [14:42:26] 110 Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36 [14:42:29] 148 Mozilla/5.0 (Linux; Android 4.4.4; SGH-M919 Build/KTU84P) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/33.0.0.0 Mobile Safari/537.36 NAVER(inapp; search; 470; 7.3.2) [14:42:33] 248 Mozilla/5.0 (Linux; Android 5.1.1; LS-4503 Build/LMY47V) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.93 Mobile Safari/537.36 [14:42:36] 255 Mozilla/5.0 (iPhone; CPU iPhone OS 9_3_3 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) GSA/17.0.128207670 Mobile/13G34 Safari/600.1.4 [14:42:39] NT6.1 == Win7 [14:42:54] there's also a fair sprinkling of variants of this: [14:42:57] 52 Mozilla/5.0 (Linux; Android 6.0.1; SAMSUNG SM-G900P Build/MMB29M) AppleWebKit/537.36 (KHTML, like Gecko) SamsungBrowser/4.0 Chrome/44.0.2403.133 Mobile Safari/537.36 [14:43:28] which is latest android release (new one due any day), but SamsungBrowser Chrome-derivative that's delayed on release cycle I guess [14:44:58] I wasn't even aware mobile safari on iOS 9 did it [14:46:19] 9.3.3 is the latest iOS build. presumably apple keeps Safari up to date with that [14:46:33] so I guess Apple did draft-mode and didn't do RFC at all in a released version yet. [14:51:48] anyways, categorizing just by OS: windows/mac/iphone are smaller. most of it's linux/android stuff doing draft, on e.g. Chrome/48 [14:52:33] Chrome/48 isn't that old. Mabye 49 was the first version with proper support for the RFC version [14:54:50] https://www.chromestatus.com/feature/5355238106071040 [14:54:55] ^ yeah, Chrome/49 was the first one [14:56:17] so there's a good chance that (a) the Chrome/4[78] traffic that isn't horriblu out of date will update to 49+ and (b) Apple will eventually do an iOS9 point-release with RFC mode, both of which will help. [15:00:51] hmm we get hits from iOS 10 Beta users too, and they still use draft-mode [15:03:47] turning this around: the only UAs that negotiate RFC chapoly are all actual Chrome variants [15:04:06] Chrome/49+ itself, or CriOS (which is Chrome for iOS) [15:04:26] except for this crazy thing: Mozilla/5.0 Linux: Android/5.1 /Aqua Craze Build/KOT49H Browser/AppleWebKit537.36 Chrome/39.0.0.0 Mobile Safari/537.36 System/Android 5.1; [15:05:00] some Indian handset [15:19:31] 10Traffic, 10Varnish, 06Operations, 13Patch-For-Review: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502#2522544 (10ema) Some more observations on the stalling issue: - v3-plus "stalls" for a while (~3s in my tests) on miss - v4 stalls on first hit and doesn't stall on subse... [15:20:09] bblack: https://phabricator.wikimedia.org/T142076#2521949 0.29% range requests is much less then what I expected. Thoughts? [15:21:45] ema: yeah I looked earlier, but then I got distracted :) [15:21:49] I think the regexes look suspect? [15:22:23] but I'm guessing that's not the whole script [15:22:24] so I donno [15:22:59] e.g. what does this mean? /^bytes=0-[0-9]/ && $1 ~ "bytes=0-"$2-1 { silly++ } [15:23:20] yeah I should have added a comment there [15:23:23] /^bytes=[3-9][2-9][0-9][0-9][0-9][0-9][0-9][0-9]/ { high++ } [15:23:26] /^bytes=[3-9][2-9][0-9][0-9][0-9][0-9][0-9][0-9]/ { high++ } [15:23:34] ^ what about 40.... ? [15:24:03] so $1 ~ "bytes=0-"$2-1 means: [15:24:08] $2 is the CL [15:24:15] probably easier is the whole script :) [15:24:31] match all Range: bytes=0-CL-1 [15:25:19] that would match Range requests for the whole file essentially [15:25:23] ok [15:25:38] and the /^bytes=0-[0-9]/ && is for performance (lol) [15:25:45] so "silly", in VCL we'd look for those and just strip Range [15:26:02] yes [15:26:31] and the norange is definitely all requests that lack a Range header? [15:26:40] yes [15:26:58] varnishncsa emits a - if the Range header is not specified [15:26:59] well it certainly puts the problem in perspective heh [15:28:23] in that dataset, high-range requests are so rare as to be ignorable [15:29:01] oh and now I see the problem with the high range regex :) [15:29:06] you're right [15:29:15] regular range requests aren't completely ignorable (and they probably represent more bytes than their request-rate would indicate), but still, very small [15:29:34] even with the wrong regex, you're only going to be off by ~20% of the sample or whatever [15:29:44] regex fixed, it's 13 high range requests [15:30:02] ok so it was 30% :) [15:30:42] I'd say lets take something broader just to confirm it's not an anomaly with just 1 server on short time [15:31:15] maybe a 24h sample in one node each in esams, ulsfo, eqiad? [15:31:42] but while that runs, let's conversationally assume the results will be approximately what we've seen already [15:32:09] oh also, let's rewind and discuss the persistence thing... [15:32:18] so you're saying you think the stalling behavior differs with file vs persistence, right? [15:32:37] s/deprecated_persistent/file/ -> no stall [15:32:49] hmmmm [15:33:06] and there's high disk I/O during the stall itself [15:33:13] but we're only basing that on timing really... I wonder if that result can be confused by space-allocation issues or whatever [15:33:23] right [15:33:38] your test setup is a VM in labs right? [15:34:00] today I tried on my workstation [15:34:04] (real hardware) [15:34:07] because I think the fallocate() we use in persistent is sensitive to first-allocation (delete the file before starting varnishd) [15:34:22] and also only works right on certain FS (ext4 is ok) [15:34:57] so probably to be reliable, we need to be starting on real disk with ext4, and make sure to delete the persist file before starting, make sure any initial io spike is done from the initial fallocate() [15:35:25] to be sure we're seeing real varnish-logic-stalling as opposed to iowait stalling down in the OS because of space allocation issues or whatever [15:36:00] interesting test, yes [15:36:26] it would be better if we had some way to know other than guessing based on the timing [15:36:33] but I don't know that varnish offers such insight heh [15:36:55] in any case, solving that mystery maybe isn't that important. [15:37:10] the important thing is it doesn't stall on "file" for concurrent low-range on a large file, right? [15:37:34] if it happens to also not-stall on persistent under ideal conditions, that's awesome too, but either way I think I'm still pushing for "file" for the varnish4's [15:37:46] (I never did finish writing up the rationale though. it's half-done) [15:38:03] it didn't stall with file this morning I'm pretty sure, let me try again [15:38:26] the only point to testing persistent under definitely-ideal conditions is to add more leverage for choosing "file" really, but I think we're already there. [15:38:46] yeah, no stall with -sfile [15:39:38] honestly, if it doesn't stall on concurrent already-fetched ranges.... [15:39:54] I don't think there's a compelling need to start doing vcl_hash on Range [15:39:59] especially give then overall Range stats [15:40:04] certainly not [15:40:30] we can do what we're doing today and just pass on missed high-range (temporarily, while some low-range/whole-fetch ends up getting the whole object into storage) [15:41:30] the no stall with -sfile, that's also with stock VCL right? no hacks from our VCL with hash_ignore_busy and/or pass? [15:42:01] oh yeah, this is all with stock VCL [15:42:06] ok [15:42:28] or "almost stock" [15:42:31] I'm assuming (some intersection of your existing results may confirm?) that a concurrent high-range request still stalls on stock VCL though, right? [15:42:34] (Range hack) [15:42:38] https://phabricator.wikimedia.org/T131502#2521610 [15:43:15] that's not stock at all, that does a (pass) on range requests [15:43:38] on the frontend, yes [15:43:53] oh, ok [15:44:04] I should read more, I assumed you were just testing 1 layer [15:44:23] ok.... [15:44:35] anyways back to earlier question... [15:44:37] I'm assuming (some intersection of your existing results may confirm?) that a concurrent high-range request still stalls on stock VCL though, right? [15:45:23] that is: client a requests a low range, client b asks for a high one? [15:45:26] e.g. if you start req1 with bytes 0-10/1G, and immediately after ask for a few bytes at the end of the file, that's a hit that stalls in stock VCL with -sfile still? [15:45:38] let's try! [15:46:24] no stall [15:46:33] (with -sfile) [15:46:34] how is that possible? [15:46:49] I mean, there's a transfer time for the 1G object to arrive before those bytes can possibly be there [15:47:03] I really have no idea [15:48:00] well, requests to the applayer under that scenario may answer some questions [15:48:09] https://phabricator.wikimedia.org/P3642 [15:49:20] there was a half-second stall there, after a half-second sleep [15:49:28] does the 1G file come in from the backend in ~1s? [15:49:45] trying turning the middle sleep down to .1? [15:49:59] I've removed the sleep altogether [15:50:00] real0m0.851s [15:50:20] that has to be the transfer time to load the whole file from the backend [15:50:28] maybe it's just that that part is much faster with -sfile [15:50:51] right, it's still stalling then but it's not really as noticable [15:51:14] so the question remains, are we still really stalling on low-range too? [15:51:32] 10Traffic, 06Discovery, 06Operations, 03Discovery-Search-Sprint: Setup LVS for elasticsearch service on relforge servers - https://phabricator.wikimedia.org/T142098#2522624 (10Gehel) [15:52:31] because we can't assume super-fast transfers in the real world, where the fetch in question might be e.g. esams->eqiad [15:57:57] bblack: looks like we are still stalling yes. It's not as clear as with -sdeprecated_storage, but yes [15:58:18] also, note that with 3.0.6plus we "pay" the price on cache miss instead [15:59:02] uh xkey meeting soon :) [16:01:03] yeah [16:19:28] bblack: Heh. "Error creating new cert :: Too many certificates already issued for: wmflabs.org" -- from using LE on (probably anything?) new in labs :) [16:21:02] "This is limited to 20 certificates per domain per week." [16:22:23] This could (in theory) hit production too, if we tried provisioning too many things with LE at once. [16:30:42] ostriches: talk to Krenair :) [16:35:14] hi [16:35:21] I'm here [16:35:53] bblack: do we have a way to setup LVS on labs support services? [16:36:35] bblack: I'm setting up the relforge cluster (T142098) and a load balancer in front of it would make sense. [16:36:36] T142098: Setup LVS for elasticsearch service on relforge servers - https://phabricator.wikimedia.org/T142098 [16:37:00] Krenair: hehe, we requested too many certs for wmflabs. [16:37:05] ostriches, I know [16:37:09] Rate limit are 20/domain/week :p [16:37:09] ostriches, probably me and paladox [16:37:26] I'm getting lost in the networking part, but it seems that the production LVS servers should not have access to labs support subnet... [16:37:35] thing is it considers wmflabs.org to be the domain, rather than beta.wmflabs.org [16:37:42] wonder if we should ask them to change that [16:38:18] I was also reading https://letsencrypt.org/docs/integration-guide/, which talks about "larger hosting providers" and the like. [16:38:27] Basically: people who need certs in bulk, potentially. [16:40:32] "easier to provide rate limits adjustments if needed" [16:40:38] so... they might be able to adjust our rate limit? [16:41:14] I don't know if labs classifies as a "larger hosting provider" [16:48:23] though we did get into the closed beta [16:53:14] https://crt.sh/?q=%25.wmflabs.org [16:53:43] most of those are in SAN [16:54:10] GoDaddy issued certs for phab-02...? [16:57:56] 10Traffic, 06Discovery, 06Operations, 03Discovery-Search-Sprint: Setup LVS for elasticsearch service on relforge servers - https://phabricator.wikimedia.org/T142098#2522897 (10Gehel) Reading [[ https://wikitech.wikimedia.org/wiki/LVS#Add_a_new_load_balanced_service | documentation ]], it seems that we only... [16:59:49] News to me? [17:14:39] 10netops, 06Operations, 10ops-eqiad: Replace cr1/2-eqiad PSUs/fantrays with high-capacity ones - https://phabricator.wikimedia.org/T140765#2523010 (10Cmjohnson) 05Open>03Resolved [17:14:42] 10netops, 06Operations, 10ops-eqiad: cr1/cr2-eqiad: install new SCBs and linecards - https://phabricator.wikimedia.org/T140764#2523011 (10Cmjohnson) [18:43:48] bblack: https://phabricator.wikimedia.org/T141506#2523206 appears to be another "Chrome 41 weirdness" bug [18:48:17] oh nice [20:05:44] 07HTTPS, 10Traffic, 06Operations, 06Release-Engineering-Team: Retire gerrit.wikimedia.org SSL cert - https://phabricator.wikimedia.org/T142131#2523742 (10demon) [20:27:38] 07HTTPS, 10Traffic, 06Operations, 07Security-General: Investigate our mitigation strategy for HTTPS response length attacks - https://phabricator.wikimedia.org/T92298#1105632 (10GWicke) Related: https://tom.vg/papers/heist_blackhat2016.pdf [20:46:45] 07HTTPS, 10Traffic, 06Operations, 06Release-Engineering-Team: Retire gerrit.wikimedia.org SSL cert - https://phabricator.wikimedia.org/T142131#2524001 (10RobH) I used to habitually revoke old certificates, but I was advised against doing so indescriminately by @bblack. My understanding is unless the priva... [22:37:36] 10Traffic, 10Varnish, 06Operations: Install XKey vmod - https://phabricator.wikimedia.org/T122881#2524425 (10GWicke) Related discussion on XKey & purging from today's meeting with @bblack, @Smalyshev, @ema, @mobrovac & myself: https://docs.google.com/document/d/1dIYQTSoJE2DC5aU7_pr4oE59f0QDJ5dlDYIYmwBOyDI/edit [22:40:03] 10Traffic, 06Operations: Content purges are unreliable - https://phabricator.wikimedia.org/T133821#2524429 (10GWicke) Related discussion notes on XKey & purging from today's meeting with @bblack, @Smalyshev, @ema, @mobrovac & myself: https://docs.google.com/document/d/1dIYQTSoJE2DC5aU7_pr4oE59f0QDJ5dlDYIYmwBOy...