[07:25:51] 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3704733 (10Gilles) upload.beta.wmflabs.org refuses SSL connections right now, I see that it's not on that list [07:49:57] 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3705637 (10hashar) I guess we only fixed the text cache. Puppet fails on deployment-cache-upload04.deployment-prep.eqiad.wmflabs :( ``` Error: /Stage[main]/Nginx/Package[nginx-full]/ensure... [07:59:22] 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3705644 (10hashar) I have applied a similar configuration in hiera for deployment-cache-upload04 While installing nginx-extra, the service failed to restart which blocks puppet: ``` nginx... [08:15:08] 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3705665 (10hashar) In `profile::cache::ssl::unified` I have commented out the `tlsproxy::localssl { 'unified': ... }` to get the Varnish conf updated eg: ``` - new cache_local = vslp.vslp... [08:24:50] 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3705702 (10hashar) Next error: ``` Notice: /Stage[main]/Cacheproxy::Instance_pair/Varnish::Instance[upload-backend]/Exec[retry-load-new-vcl-file]/returns Command failed with error code 106... [08:29:44] 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3705726 (10hashar) ``` # dpkg -S /usr/lib/x86_64-linux-gnu/varnish/vmods/libvmod_std.so varnish: /usr/lib/x86_64-linux-gnu/varnish/vmods/libvmod_std.so # apt-cache policy varnish varnish:... [08:39:52] 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations, 10Patch-For-Review: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3705759 (10hashar) p:05Triage>03Normal **Status** https://gerrit.wikimedia.org/r/#/c/386077/4 cherry picked on the beta cluster puppetmaster Puppet and Varnish... [09:40:01] 10Traffic, 10Operations: Age header reset to 0 after 24 hours on varnish frontends - https://phabricator.wikimedia.org/T141373#3705842 (10ema) 05Open>03Resolved a:03ema >>! In T141373#3703459, @BBlack wrote: > Anything left to look at here? I've checked on a text-esams frontend and there's now plenty of... [09:57:46] [Mon Oct 23 07:26:54 2017] SERVICE ALERT: cp4021;Check Varnish expiry mailbox lag;CRITICAL;HARD;10;CRITICAL: expiry mailbox lag is 2076994 [09:57:51] [Mon Oct 23 18:58:44 2017] SERVICE ALERT: cp4024;Check Varnish expiry mailbox lag;CRITICAL;HARD;10;CRITICAL: expiry mailbox lag is 2049591 [09:59:51] I've looked for mbox lag alerts, during the last 7d these are the only two ^ [10:01:06] both upload, and I can't see what recent changes could cause them [10:14:30] this is a massive improvement from the past weeks so nice work people :) [10:16:58] elukey: thanks! still sad though, we thought this was behind us by now [10:18:04] the specific case of cp4024 might have been traffic-induced, there's been an increase of frontend hfp rate since yesterday https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=59&fullscreen&orgId=1&from=1508694690484&to=1508832512314&var-server=cp4024&var-datasource=ulsfo%20prometheus%2Fops [10:18:06] varnish 5 will solve all the problems [10:18:07] :P [10:18:13] :) [10:21:19] no, ATS will [10:27:36] <_joe_> tsk [10:27:39] <_joe_> that's kubernetes [10:28:06] <_joe_> or debian packages, or nodejs; I'm not sure which [10:36:59] I was sure it will be quantum computing... [10:37:32] but probably that will create all new and unpredictable problems :D [10:41:45] hfp rate increased on all upload-ulsfo nodes [10:52:03] openstack [11:21:50] the hfp are presumably large objects [11:21:56] (well, the rate of fetching them) [11:23:40] the hfp graph does line up well with the mailbox ramp [11:24:40] but not the cp4021 case [11:26:03] I take that back, there's a lesser tower of hfp that lines up with cp4021 [11:27:51] if it's a case of a popular and not-too-huge file, this could be the sort of thing where exp(-size/c) would have fared better than the hard hfp size limit [11:34:34] bblack: one such case of upload hpf is https://upload.wikimedia.org/wikipedia/commons/0/08/Porities.jpg [11:35:10] also somehow at a certain point the backend hashing changes, it's not clear to me why: [11:35:13] while true; do curl -v https://upload.wikimedia.org/wikipedia/commons/0/08/Porities.jpg 2>&1|grep "x-cache:" ;done [11:35:25] < x-cache: cp1074 hit/82, cp3038 hit/10, cp3048 pass [11:35:25] < x-cache: cp1074 hit/82, cp3038 hit/11, cp3048 pass [11:35:25] < x-cache: cp1074 hit/82, cp3038 hit/12, cp3048 pass [11:35:25] < x-cache: cp1074 hit/82, cp3038 hit/13, cp3048 pass [11:35:25] < x-cache: cp1074 hit/82, cp3038 hit/14, cp3048 pass [11:35:27] < x-cache: cp1074 hit/82, cp3038 hit/15, cp3048 pass [11:35:30] < x-cache: cp1074 hit/81, cp3049 hit/2, cp3048 pass [11:35:33] < x-cache: cp1074 hit/82, cp3038 hit/16, cp3048 pass [11:37:48] you can see the hitpasses with varnishlog -g request -n frontend -q 'Debug ~ "HIT-FOR-PASS"' [11:37:53] lunch, bbl & [11:42:44] weekly restart depools? [11:43:37] and yeah that object is only 302K, so it's not far over the current hard hfp boundary [11:43:57] exp(-size/c) wouldn't have such a horrendous bump at that mark [11:44:34] (weekly restart depools, or temporary v->v backend healthcheck failures?) [11:49:12] ema: move the work you did in https://gerrit.wikimedia.org/r/#/c/379512 over to upload-frontend-only (for now), and have it replace the block that does 4-hit-wonder and 256K=hfp? (maybe for now, if admission param is zero turn on the old block of code there, otherwise replace it?) [13:13:49] - Debug "VSLP picked preferred backend 4 for key 238487a9" [13:13:54] - Debug "VSLP picked alternative backend 11 for key 238487a9 in healthy" [13:14:13] here's what happens when the backend changes ^ [13:24:00] that must be a temporary healthfail [13:25:34] yeah I thought so too but `backend.list -p` on the frontend instance seems ok [13:27:27] oh [13:27:36] https://github.com/wikimedia/operations-software-varnish-libvmod-vslp/blob/master/src/vslp_dir.c#L392 [13:28:01] an alternative backend is chosen with a certain probability apparently [13:29:33] https://github.com/wikimedia/operations-software-varnish-libvmod-vslp#void-set_rampup_ratioreal [13:30:01] yeah but this is cache_upload? [13:30:10] it is, yes [13:30:11] oh wait, sorry, I was remembering that from shard [13:32:10] we could maybe tune rampup_ratioreal and/or rampup_time, but given we're moving away from vslp anyways maybe it's not worth the effort. the values aren't insane. [13:32:26] and indeed cp3049 is chosen around 10% of the times in my test [13:32:35] bblack: true! [13:32:35] (but I probably would've picked a smaller percentage and a longer time, e.g. replace 10%/60s with 5%/180s [13:33:01] or really, different parameters for the different warming types [13:33:34] maybe something like 1% for pre-warming the alternative under normal conditions, and 10%/60s for pre-warming a just-unfailed backend before returning to normal. [13:34:00] anyways, maybe we look towards tuning those better for "shard", it has some similar tunables [13:36:09] yeah, shard has it split more-appropriately [13:38:43] starting from the defaults (which are ramp/warm disabled), one could set e.g.: "x.set_warmup(0.01); x.set_rampup(180s);", and the behavior would be that under normal conditions 1% of reqs pre-warm the alternative backend, and just after repooling from unhealth, the percentage of reqs to the newly-healthy instance ramps linearly from 0-100 over 180s. [13:40:31] (for bonus points, it would be nice if the warmup was in parallel. instead of diverting 1% of requests, it should mirror 1% of requests as a background fetch whose results are ignored, while still fetching all client queries from the right place) [14:46:48] ema: so, complex review stuff on the exp(-size/c) patch: [14:48:23] 1) Since we do have some substantial differentials in FE memory cache sizing, probably better than having explicit tunables would be to auto-calculate "c" based on the formula given by dsb (search ticket for "A simple linear fit on this") [14:48:38] in which case the hiera param should basically be an on/off switch? [14:49:05] but note that also, using "bc" I was unable to reproduce that formula matching his optimal-size graph either, maybe I misunderstood the units [14:50:52] 2) The new code should handle the "no content length" case, I think. Unless we plan to first deploy a universal patch to disable streaming on the backendmost if no CL header is received, and validate that this results in useful CL headers for all requests coming back out through all frontends. [14:51:34] but even then, I think it's probably safer to have this block handle it correctly, in case of future change. Maybe if CL header is missing, set the size param to 0, which should make the probability 1.0 at the end? [14:53:31] 3) We still need to think about the coalescing behaviors of all of this, I think? (unless you already have?). The potential problem with our naive size-cutoff hfp was that large objects would always coalesce their miss requests without an hfp, resulting in a bunch of delays for clients. [14:54:16] for the exp(-size/c) solution, I guess we're expecting that if enough clients are stalling for an object through a single frontend, popularity will eventually get it over the exp(-size/c) filter before the stalls are too awful? [14:55:51] (but that might not be true in all cases that matter. You could image a very large file that's not very popular (say, a 1GB file that gets fetched usually ~1/minute or something). but the transfer times take a while, the FE basically-never caches it, but anytime say 2 or 3 clients request concurrently, they're going to get stuck on bad coalescing behavior?) [14:56:25] s/image/imagine/ [15:02:13] 4) The whole new block may need some protection against cases that don't matter? e.g. it shouldn't run on requests that are already a pass, etc? Like the ones on the 4-hit-wonder code? (requires X-CDIS=="miss" + status=200)? [15:09:15] "wcalc" seems to get closer, but still doesn't look right [15:09:39] 1) I think it's more convenient to keep 'c' in a hiera setting, that would allow us to easily experiment different values on different hosts [15:10:01] e.g. based on https://phabricator.wikimedia.org/F4703367 we'd expect for a cache size ~200GB to get a "c" somewhere in the ballpark of 64K [15:10:04] but: [15:10:10] bblack@alaxel:~$ wcalc -c '((200)^0.9064392)/(2^-18.16135)' [15:10:10] ~= 35715349.1576561881012763615123441101 [15:10:24] (35MB) [15:11:07] making (200) bigger (for other small units than GB) just makes the answer bigger, of course [15:12:27] 2) yeah, if there's no CL I agree that we should set size to 0 [15:12:31] re: 1: but there's an optimal value for any given fe_mem_gb. either we manually calculate that based on fe_mem_gb and plug it in per-cache-hardware-type and keep it in sync with fe_mem_gb changes, or we calculate it. [15:13:39] (well, to backtrack on myself: there's an optimal value given dsb's test dataset. obviously, patterns can evolve over the long term. but IMHO that's just a reason to re-run the simulations, not to try to hand-tune the value and observe the probably unobservably-lost-in-noise changes to the cacheability graphs) [15:15:16] if the desire is to also have a knob that can affect overall pass-rate, it should be a separate knob IMHO [15:15:47] (as in "I want to turn this knob to turn down cache admission on this backend because I think it will save us from mailbox meltdown") [15:21:19] apparently, the units in the optimal c calculation are TB [15:21:37] ah! [15:21:44] bblack@alaxel:~/repos/puppet$ wcalc '(0.1^0.9064392)/(2^-18.16135)' = 36364 [15:21:45] bblack@alaxel:~/repos/puppet$ wcalc '(0.2^0.9064392)/(2^-18.16135)' = 68161.2 [15:21:52] ^ that gives answers that look close to the graph [15:22:38] (it's a linear fit to the slightly-non-linear graph, but close enough) [15:25:58] current fe_mem_gb sizes in prod fall into 3 sets based on the remaining host hw configs and our current calculations: [15:26:05] err, 4 sets: [15:26:55] (cumin output - noise): [15:26:57] (46) cp[2001-2002,2004-2008,2010-2014,2016-2020,2022-2026].codfw.wmnet,cp[1071-1074].eqiad.wmnet,cp[3030-3049].esams.wmnet [15:27:01] 121G [15:27:05] (19) cp[1045,1048-1055,1058,1061-1068,1099].eqiad.wmnet [15:27:10] 77G [15:27:14] (3) cp[3007-3008,3010].esams.wmnet [15:27:18] 11G [15:27:20] (12) cp[4021-4032].ulsfo.wmnet [15:27:24] 209G [15:30:52] plugging those through the formula gives "c" values: [15:30:55] 209 = 69427.1; 121 = 42303.4; 77 = 28083.2; 11 = 4813.01 [15:31:30] (but that's c values for cache_upload's traffic flow at a point in the past, probably not very relevant to text/misc) [15:32:03] but upload has 209G, 121G, and 77G nodes presently [15:39:30] bblack@alaxel:~$ ruby -e 'fe_mem_gb = 121; c = ((fe_mem_gb/1024.0)**0.9064392)/(2.0**-18.16135); print "#{c}\n"' [15:39:33] 42303.425163580454 [15:39:37] ^ to stick it in an erb template :) [15:40:10] :) [15:42:19] or arguably, if you want the hand-tunable to reduce cache entry + optimal value all in one [15:42:42] you could calculate "c" as above, and then multiply by the hiera admission param as it stands now [15:43:17] or something like that [15:45:16] admissionprob = exp(-clen/c) * adm_param [15:46:04] then adm_param = 0.5 would halve the normal admission rate. values over 1.0 would increase it (eventually leading to >100% probability, but the comparison still works) [15:48:15] we still should have per-cluster tunables for the optimal c calculation too, but there's maybe not much point parameterizing them until we can re-run new simulations per-cluster [15:48:59] (cache_upload best-known params being rate=0.9064392 and base=-18.16135) [15:50:45] rewinding a bunch to point (3) earlier about coalescing... you could make the argument that all frontend reqs should have req.hash_ignore_busy = true [15:51:02] leave coalescing to the backend instances, so it still gets done before it cross wan links or reaches an applayer. [15:51:28] but having some extra parallel reqs, briefly, from local_fe->local_be, isn't much of a cost for ensuring no stalls while sorting out optimal cacheability [16:14:12] yeah, no fe->be coalescing seems reasonable! [16:35:19] 10Traffic, 10Operations: Renew unified certificates 2017 - https://phabricator.wikimedia.org/T178173#3707139 (10BBlack) [17:02:40] 10HTTPS, 10Traffic, 10Operations, 10Parsoid, 10VisualEditor: Parsoid, VisualEditor not working with SSL / HTTPS - https://phabricator.wikimedia.org/T178778#3702250 (10Deskana) >>! In T178778#3702293, @PlanetKrypton wrote: > This appears to be the response / request and it's accompanying error > > https:... [17:07:41] bblack: patch updated addressing (4) and fixing/adding unit tests. Tomorrow I'll carry on! o/ [17:14:50] ema: ok thanks for working on this brutally-ugly problem and listening to my rambling :) [17:52:34] 10HTTPS, 10Traffic, 10Operations, 10Parsoid, 10VisualEditor: Parsoid, VisualEditor not working with SSL / HTTPS - https://phabricator.wikimedia.org/T178778#3707527 (10PlanetKrypton) @Deskana I had plugin temporarily disabled so people didn't try to use it. Try now. [19:51:53] bblack: have you seen that the recommended browser for XP doesn't seem to work? https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Firefox_on_XP [20:13:18] 10Traffic, 10MediaWiki-Authentication-and-authorization, 10Operations, 10Security-Core: Investigate usefulness of SameSite cookies for logged-in accounts - https://phabricator.wikimedia.org/T158604#3708032 (10BBlack) Bump, I'd like to see this happen, it seems like a pretty healthy and cheap layer of prote... [20:14:33] MaxSem: it should work, probably there's other factors or misunderstandings going on there... [20:15:16] FF-on-XP ships its own crypto libraries wholesale, it doesn't use the OS for any of it (actual crypto, cert chain verification, etc - unlike the last releases of Chrome for XP, which still used the OS for the cert chain bits) [20:19:12] maybe, something is broken with cipher suite negotiation so it picks 3DES? [20:21:14] I replied there with some browser-test URLs to try [20:22:01] I don't think anything's broken with ciphersuite negotiation on our end, we track that stuff pretty aggressively. It's entirely possible there's ancient/broken "web security" https proxy software installed on the machine, and/or it's compromised by malware which does the same already. [20:24:56] checking randomly for sanity, I can see a healthy rate of requests flowing live right now with the expected UA+Cipher indicators on our caches, e.g.: [20:24:59] - ReqHeader X-Connection-Properties: H2=1; SSR=0; SSL=TLSv1.2; C=ECDHE-ECDSA-CHACHA20-POLY1305; EC=X25519; [20:25:02] - ReqHeader user-agent: Mozilla/5.0 (Windows NT 5.1; rv:52.0) Gecko/20100101 Firefox/52.0 [20:25:12] ^ That's FF52-on-XP, negotiating the most-modern TLS possible [20:42:56] :) [22:54:43] 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations, 10Patch-For-Review: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3708461 (10Krenair) Are we okay to close this now? Do we want to look into what caused the initial varnish upgrade? [23:45:32] 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations, 10Patch-For-Review: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3708577 (10greg) That's a good question (re what caused the varnish upgrade) so I guess we should figure that out. The timing seems oddly non-deterministic (from my u...