[00:04:40] 10HTTPS, 10Traffic, 10Operations, 10Wikimedia-Blog: Change automatic shortlink in blog theme - https://phabricator.wikimedia.org/T165511#3365971 (10Tbayer) >>! In T165511#3362318, @Volker_E wrote: > That's what I expected. The shortlink didn't seem to be reason for the error. > As I've said, I didn't have... [04:42:11] 10Traffic, 10Analytics, 10Operations: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#3366282 (10Nuria) >These metrics are 429s emitted from RESTBase, and not Varnish. Right, that is why we should continue to see throttling on the rest base end. Do take a second... [06:33:38] 10Traffic, 10netops, 10Operations: codfw row A switch upgrade - https://phabricator.wikimedia.org/T168462#3366437 (10Marostegui) We need to make sure we downtime the following DBs in EQIAD as they have cross replication with some of the dbs affected here, so we can avoid pages like we had yesterday for cross... [07:13:53] 10Traffic, 10netops, 10Operations: codfw row A switch upgrade - https://phabricator.wikimedia.org/T168462#3366501 (10Joe) [07:14:11] 10Traffic, 10netops, 10Operations, 10User-Joe: codfw row A switch upgrade - https://phabricator.wikimedia.org/T168462#3365105 (10Joe) [08:28:29] <_joe_> ema: pybal has lost connection to etcd at least on lvs2003/2006 [08:28:37] <_joe_> I guess after the network maintenance yesterday [08:28:46] <_joe_> we *need* to fix that damn thing [08:29:01] <_joe_> can you look in the logs maybe? [08:29:13] <_joe_> I'm fixing something else related to this atm [08:29:18] <_joe_> and then doing a rolling restart [08:29:34] <_joe_> but if you can find in the logs the stacktrace from yesterday, that would save me some time [08:29:36] _joe_: oh this might be the reason for elukey's wtf yesterday (kafka2001 not getting depooled IIRC) [08:29:40] <_joe_> yes [08:30:44] _joe_: I'll take a look at the logs [08:30:53] <_joe_> thanks [14:15:03] 10Traffic, 10Operations, 10Performance-Team: Upgrade to Varnish 5 - https://phabricator.wikimedia.org/T168529#3367415 (10Gilles) [14:16:53] gilles: varnish's h2 is still very experimental afaik [14:17:16] I think it's experimental in the same sense that all h2 implementations are [14:17:22] they explain the semantics of that in a blog post [14:17:36] https://varnish-cache.org/docs/5.0/whats-new/changes-5.0.html says "Very Experimental HTTP/2 support" and disabled by default [14:17:45] https://varnish-cache.org/docs/trunk/whats-new/changes-5.1.html says "HTTP/2 support is better than in 5.0, and is now enabled and survives pretty well on our own varnish-cache.org website, but there are still things missing, most notably windows and priority, which may be fatal to more complex websites." [14:18:16] and "We expect HTTP/2 support to be production ready in the autumn 2017 release of Varnish-Cache, but that requires a testing and feedback from real-world applications." [14:18:43] ok, might worth upgrading to varnish 5 ahead of that, though [14:19:01] sure ok [14:19:19] maybe, maybe not, Brandon would know more for sure :) [14:19:22] without necessarily leveraging http/2 support yet [14:20:08] I need to do more research but I suspect that our current stack might not handle what's supposed to make http/2 fast well, with nginx proxying stuff [14:20:27] it certainly won't hurt to try an alternative http/2 implementation like this one and seeing what performs better [14:20:40] like what? [14:20:48] priorities, content interleaving [14:21:07] all of that requires the server to be smart to get the best performance [14:21:18] most http/2 implementations are pretty naive about it [14:21:37] and I don't know how smart nginx can be when it's proxying stuff from varnish (over http/1.1 I presume) [14:22:29] yeah, but somehow I doubt varnish will catch up with nginx anytime soon [14:22:35] I suspect that in a worse case scenario we're just serving stuff in the same order as HTTP/1.1, wrapped in a thin layer of HTTP/2 [14:22:50] which kind of defeats the purpose for performance [14:23:17] not really, it's still an improvement in multiple fronts [14:23:23] e.g. tcp slow start [14:24:34] we don't measure it with RUM, but our synthetic testing shows HTTP/1.1 being faster on our real content [14:24:36] IIRC we measured the deployment of SPDY/H2 at the time and it was a significant perf improvement [14:24:39] I'm writing up a detailed reply to the ticket in any case. V5 touches a lot of thorny intersections for us :) [14:25:30] _joe_: pybal on lvs3001 seems to also be affected by the OpenSSL error we've seen earlier this morning [14:26:04] what did we measure on real users when we switched? [14:26:18] page load time, I think? [14:26:30] I don't remember much, I may have to do some archaeology on my inbox [14:26:42] ori was involved [14:27:26] <_joe_> ema: then restart it? [14:27:29] also, didn't you guys disable it in prod for some time to take measurements not too long ago? [14:27:48] _joe_: I was just about to reboot it, but it seems to be in a interesting state [14:28:08] _joe_: pybal.log is empty, last logs in journal Jun 19 20:42:16 [14:28:41] <_joe_> gilles: < gilles> I suspect that in a worse case scenario we're just serving stuff in the same order as HTTP/1.1, wrapped in a thin layer of HTTP/2 [14:28:49] <_joe_> that's how I understand it [14:29:03] <_joe_> ema: uhm [14:29:12] pybal.log being empty might be the reason why we haven't noticed the error earlier today [14:29:22] <_joe_> yup [14:29:30] paravoid: I don't remember that [14:29:38] https://phabricator.wikimedia.org/T125979 [14:29:41] just found it :) [14:31:06] also https://phabricator.wikimedia.org/T125208 sounds similar to what you're saying now, but closed/resolved? [14:31:10] re: HTTP/2 and ordering and such, even if we're serving the same stuff in the same order, HTTP/2 brings at least the ability to mix multiple download streams in parallel [14:31:39] IIRC Chrome had some issues with stream prioritization, while FF was doing better in that regard [14:32:01] also, while on the server side nobody really has the generic smarts to do pushes efficiently, etc... I believe clients already have some advantageous uses of HTTP/2 features to prioritize streams by content type (e.g. prioritize js/css) [14:32:23] bblack: have we verified that this interleaving actually happens with our setup? [14:32:31] Chrome ends up using all that concurrency to load images ASAP, which can delay the loading of render-blocking styles & as a result delay first paint [14:32:55] FF correctly prioritizes styles [14:33:31] I suspect when the app/server -driven push smarts get broader adoption, it will basically be via appserver-sent headers anyways. e.g. MW when sending the main page body also sends X-H2-Push: /foo/bar.png, /baz/asdf.js, and the H2 implementation (e.g. nginx) will then pre-fetch the URLs to start pushing them at the client. [14:33:34] paravoid: those were both about SPDY, it sounds like we never verified the same hypothesis with HTTP/2 [14:33:43] but that's just a guess - either way it means it doesn't much matter which *level* of our stack that happens at [14:33:45] I'm not saying that we should, though [14:34:02] I'm more interested in pitting HTTP/2 implementations against each other, if we start having alternatives [14:34:07] and yes, the H/2 parallelism is obvious in the diff on browser waterfall graphs [14:34:25] e.g. you don't see it staircasing out one image load at a time, you see it load many in parallel with overlapped times [14:34:51] gilles: that experiment was SPDY vs. HTTP/1, not SPDY vs. HTTP/2 [14:35:10] push is only workable when the new standard for browser cache hash summaries is there, even then the perf benefits are a bit unclear compared to the link preload header for example [14:35:31] gilles: so I think it probably holds, considering H2 is a superset of SPDY [14:35:38] if anything H2 should perform even better [14:36:07] paravoid: it's actually very different, SPDY is stone age compared to H2 and H2 can be implemented in very different ways in those areas that matter regarding priorities, etc. [14:36:14] right, IIRC H2 was basically the same as SPDY, with the major diffs being the use of ALPN, and the HPACK header compression [14:36:17] SPDY was much simpler, so less divergence between implementatiobs [14:36:48] afaik it's what bblack said [14:36:49] they were so similar that ~90% of the diff for nginx "implementing" http2 was just s/spdy_/h2_/ on their SPDY implementation :P [14:36:52] HPACK and ALPN [14:37:21] and I think there are like 3-4 implementations in total? [14:37:25] but that's probably because they implemented H2 is the dumbest way possible [14:37:38] where all these possibilities to be smarter aren't leveraged [14:37:39] nghttp2 is the biggest one, I think even apache is using that one [14:37:49] nginx has its own, and varnish is writing its own (and lagging behind) [14:38:05] h2o is probably its own implementation [14:39:21] http://blog.kazuhooku.com/2015/12/optimizing-performance-of-multi-tiered.html [14:39:46] yeah, we don't do push yet [14:40:44] fwiw, https://phabricator.wikimedia.org/T124966 has measurements on the effect of inlining css [14:41:42] we've discussed push quite a bit with our team and we're skeptical it would be that useful for what we do. certainly not without cache manifests [14:41:45] push should be able to achieve similar timings [14:44:44] so I think we're discussing a few different things here, right [14:44:57] one (minor one) is varnish 5, independently of h2 [14:45:44] second is whether h2 as currently implemented here is slower than http/1.1 [14:46:14] third is how h2 implementations compare with each other performance-wise, with the same feature set that we support now [14:46:35] yes, that last one is what I'm most interested in [14:46:48] and fourth is whether additional h2 features, such as push, perhaps with the help of backends, could yield further improvements [14:47:58] push as it is ignores browser cache. browsers can abort a request when they realize that it's in cache, but with the round trip that's still a significant amount of data sent for nothing before it gets aborted. so it's almost guaranteed to be worse than link preloads [14:48:25] cache digests, which are a solution to that problem, are still in a very early draft stage and not supported by any browser [14:48:35] it's only an option for service workers at the moment [14:48:58] I think we should revisit push when we can do cache digests one way or another [14:49:29] h2o hacks this around with cookies [14:49:42] do you really want more cookies? [14:49:46] push should be fairly similar to inlining in terms of performance [14:50:03] I don't really love this kind of hack in general [14:50:05] unconditional push, that is [14:50:08] right, so we're circling back to content composition [14:50:44] in any case, don't focus on push, I think (4) is just whether we can tune things across layers to yield further perf improvements, I think [14:50:51] and maybe the answer is "no" or "not yet", ok :) [14:51:13] https://h2o.examp1e.net/benchmarks.html [14:51:19] IIRC the Chrome folks were also working on fixing prioritization on their end [14:51:20] always to take with a pound of salt, but still [14:51:24] another thing may be serving jss/css with higher priority [14:52:23] yes, we can think about things like that where we ignore some of what the client is requesting in terms of prio and do what we think is best [14:52:32] ideally h2 implementations should be scriptable in that respect [14:53:31] I think that's still a bit pie-in-the-sky at the moment [14:53:54] servers and browsers alike are still trying to figure out a decent default implementation, let alone making it fine-tunable [14:54:00] before we start embarking to all of that, I think it'd be useful to establish the baseline though [14:54:13] h2 appearing as slower than http/1.1 in your tests sounds odd [14:54:41] I think it's 99% likely to be an oddity of our synthetic testing [14:54:45] I doubt that's the case but that's just a hunch at this point, can't back it up with numbers [14:54:51] but may be worth tracking down further [14:54:56] which is why I don't think it's worth investing time in another live experiment like the SPDY one [14:55:09] there are two ways to look at speed: a) getting bytes across the wire (h2 wins), and b) time to usable page (h2 currently loses in some cases) [14:55:39] right, but before we start testing out other implementations or features, perhaps we should fix our synthetic testing? [14:55:51] i.e. have a benchmark we can trust [14:56:12] we're working on that right now, we have an alternative to webpagetest in the works [14:56:23] and this summer we're going to improve/fix navigationtiming in various ways [14:58:53] gilles: on the rest of the non-H/2-ish topics around V5: https://phabricator.wikimedia.org/T168529#3367560 (I donno what happened to our phab bot updates) [14:59:02] we can also think about benchmarking the implementations with our actual content, in the same way h2o ran its own [15:04:47] bblack: s/Pound/stud/ [15:06:01] makes sense to me otherwise :) [15:07:55] oops, I fixed a few stupid typos/thinkos since too, I didn't see that one heh :) [15:13:04] also there's something to be said about Varnish Software's and Varnish's divergence [15:13:16] and phk's stance on H2 and TLS [15:13:45] they still don't support nor have plans to support TLS on the backend, and even in the 5.0 changelog you can see the passive aggressiveness with which he's approaching this [15:13:58] sorry, 5.1 [15:14:03] "To enable HTTP/2 you need to param.set feature +http2 but due to internet-politics, you will only see HTTP/2 traffic if you have an SSL proxy in front of Varnish which advertises HTTP2 with ALPN." [15:14:56] I'd even go as far as to say that backend TLS is even more important than TLS termination [15:15:12] yeah, that too, I should've added it as well [15:15:36] yeah you mentioned multiple long-term issues, I think that covers it :) [15:15:36] it's another reason for start at the back with ATS, too [15:16:12] I'm just mentioning it because I'm not sure if it's ultimately sourced in the open core model [15:16:17] or just phk's stubborness [15:16:41] I think they've said before that they implemented outbound TLS in the commercial product, but not open source [15:16:46] yeah [15:16:53] but that's different people right :) [15:16:58] yeah [15:20:07] hey, so lvs3001 is not rebooting properly because of issues with one drive T166965 [15:20:08] T166965: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T166965 [15:20:17] right [15:20:18] the error message from the bios is: [15:20:19] There are offline or missing virtual drives with preserved cache. Please check the cables and ensure that all drives are present. Press any key to enter the configuration utility. [15:20:49] godog and I tried to clear the cache to no avail [15:21:10] any suggestions? [15:21:13] well, I don't think the hw raid is actually in real use here, I think it's configured as 2x 1-drive raid arrays [15:21:44] I think your only recourse to get booting again is probably to delete the failed logical drive on the raid controller setup page? [15:22:07] (and assuming the actual disk is dead, that's pretty much it and we're booting on one drive for now) [15:23:16] ema: cleared which cache? [15:23:26] ema: there's a submenu somewhere to discard the preserved cache [15:23:41] the controller preserved cache, though at reboot the controller still thinks it has the preserved cache there [15:23:46] right [15:23:53] I've left the console in the ui btw if someone wants to take a look [15:24:10] why on earth did we do software raid on top of single-disk hardware raids? ) [15:24:10] it seems the failed disk (sdb) isn't always even detected by the controller [15:24:34] paravoid: I asked myself the same when I saw it [15:24:39] paravoid: who knows! I don't know why we ever buy or configure hw raid for simple dual root disks TBH [15:25:09] probably I did that in the last re-setup of lvs3xxx, but I donno (set up partman + raid controller to use sw rather than hw) [15:26:48] but then again, when I look at the partman lines, it seems like we've been moving in the opposite direction, perhaps inconsistently: [15:26:53] files/autoinstall/netboot.cfg: lvs100[7-9]|lvs101[012]|lvs2*) echo partman/flat.cfg ;; \ [15:26:56] files/autoinstall/netboot.cfg: lvs100[1-6]|lvs[34]*) echo partman/raid1-lvm.cfg ;; \ [15:27:18] (the new lvses in codfw + eqiad use "flat" which is for hardware raid, and the older lvses in eqiad + esams+ulsfo all use sw raid) [15:29:39] or LVM for that matter heh (I don't know why we use it in a case like this vs simpler md-only setup) [15:30:05] LVM is great for flexibility because you want to make some big space able to be re-provisioned in other/new ways at runtime as things evolve [15:30:17] but again if we just have a simple host with no real storage needs and dual root disks, what's the point? [15:31:40] so what's the plan with that disk?