[01:52:16] 10Traffic, 10MobileFrontend, 6Operations, 7Regression, 7user-notice: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2097986 (10Krinkle) [02:10:18] 10Traffic, 10MobileFrontend, 6Operations, 7Regression, 7user-notice: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2098016 (10Jdlrobson) I can replicate this locally n... [08:58:42] bblack: I finally found the time to work on a systemtap script to detect NPN/ALPN provided by clients [08:58:45] https://phabricator.wikimedia.org/P2719 [09:00:10] bblack: to run it we need systemtap-runtime and libssl1.0.0-dbg installed. I've tested it on cp1008 and it works. Can we try it on a real system? :) [10:38:38] ema, bblack (whenever you have time): https://dpaste.de/jezo/raw - atm I added the support to log HIT/MISS/PASS vcl tags and TTFB, but we might want to have more. Checked also what we are using atm, and I can't find anything except the two mentioned. [10:47:20] also, in other news: https://news.ycombinator.com/item?id=11217477 (Nginx reverse proxies retries PUT/POST/DELETE on response timeout by default) [11:31:05] 10Traffic, 10MediaWiki-Uploading, 6Multimedia, 6Operations, 10Wikimedia-Video: Uploading 1.2GB ogv results in 503 - https://phabricator.wikimedia.org/T128358#2098525 (10zhuyifei1999) A possible workaround is to use async chunked uploading, but pywikibot does not yet support so. [13:00:36] silly question, as I haven't been paying close attention [13:00:45] what's the status of varnish 4 + persistent backend? [13:01:04] have we tried it/are we trying it? [13:01:35] 10netops, 10Monitoring, 6Operations, 10Scap3 (scap3-adoption): Deploy libreNMS with scap3 - https://phabricator.wikimedia.org/T129136#2098628 (10faidon) [13:02:46] 10netops, 10Monitoring, 6Operations, 10Scap3 (scap3-adoption): Deploy libreNMS with scap3 - https://phabricator.wikimedia.org/T129136#2096359 (10faidon) Trebuchet was super broken for a while and the version we currently run is straight out of git — so I would be more than happy to switch to scap3 ASAP. Wh... [13:16:22] paravoid: haven't really tried it, kinda of assuming it still works for now [13:17:00] heh [13:17:13] iirc they've renamed the backend to deprecated_persistent or something [13:17:27] I also wonder about all of our woes with it, like the mmap() hacks [13:17:40] something to think about/verify before we propose a v4 goal for text/upload :) [13:19:14] paravoid: I'm still more on the track that we can probably get away without it. I'd rather not be forced into that decision, but still :P [13:19:36] heh [13:20:06] I had a long rambling conversation with ema about it the other day trying to outline all the pros and cons in more detail, and walk through the possible bad scenarios where persist could've saved us but we're more-screwed without it [13:20:16] that whole line of thinking still needs more work and research though [13:21:26] right [13:21:52] the TL;DR of that conversation, though, is that I strongly suspect there are only very very improbably scenarios where it helps, and we can probably have mitigation plans for those rare scenarios, and the upsides are rather large [13:22:52] (getting off deprecation, avoiding all the excess patches and puppetization worried about persist, no more stale-leak from brief restarts, XKey as it exists today isn't really persist-compat [13:23:38] working around xkey + persist without doing some heavy patching to xkey would mean some operational tradeoffs and changes too [13:23:45] (xkey's index is just a hashtable in memory) [13:25:37] so the basic paths forward on xkey are: (1) use it as-is with persist, standard procedure on single-node restart is wipe the cache because xkey index is gone too, and if we suffer mass cache restart somehow, we'll have to accept that we have stale content problems for a few days while we slowly wipe/ban up to when xkey restarted [13:26:09] (2) patch it to persist its index, either safely and separately and in reasonable sync with persist storage (ewww) or have it actually store the index inside cache storage somehow (ewwww) [13:26:23] (3) dump persistence [13:27:13] The "ewwww" represent not necessarily ugly, just lots of uneccessary work [13:29:11] and to run back through some of the scenarios where we think persist makes a big operational difference: we're mostly down to two basic categories: someone can remote-crash varnish daemons or kernels quickly, or we need to get everything rebooted very quickly for a very critical sec patch [13:30:39] the varnish daemons one is unlikely (unlikely enough that it's not worth the trade on its own, IMHO): if they only crash one layer the infra will already protect itself with the other layer. It would have to be a bug where they can push the bug-inducing request down through the whole stack to MW/services, and then on response it crashes each layer after they successfully transmit to the next lay [13:30:45] er up with the bug still intact and crashy. [13:31:50] if they're remote-crashing kernels with some simple packet... well (a) those are rare and we can't be perfect, and (b) even with persist they might recrash us over and over, we're kinda screwed either way if we can't filter or prevent that once we see it happening. [13:32:35] (and keep in mind, they'd have to do it from a globally-distributed botnet source to even reach all the caches too) [13:33:14] if it's intentional restarts for sec bugs: I think we can put a window on that and operationally test down to it for verification. [13:33:48] e.g. today I'm pretty confident on past experience we could restart->wipe everything over a period of, say, 3 days and not have some massive fallout in terms of excess MW load refilling caches. [13:34:17] the real acceptable limit is probably lower than that. We could test through that scenario at progressively lower values. IMHO if we can do it in 12 hours, that's a reasonable response time for a patch. [13:35:18] there are going to be rare scenarios where we'll wish we had persist, but IMHO they're rare on the order of once in X years, vs the cost of continuing to support persist just for those rare scenarios... [13:37:58] the other angle on this, is our perspective on how hard we have to protect MW from the true traffic load is outdated too [13:38:05] it's before HHVM and other perf work, etc... [13:38:45] the amount of burst it can take has probably gone up in recent times. We could also put together a plan to progressively test that at some point (e.g. VCL code to force cache-miss on X% of requests and ramp that in until we see the first signs of pain) [13:45:36] my last thought on that this morning is that also: if MW/services survive the initial very short burst successfully, the load will quickly drop off as the hottest items fill into the caches [13:46:07] if they really can't take full load even briefly (which is entirely possible), we can mitigate that with req ratecaps or parallel-conns limits from cache->services. [13:46:41] then the MW/services don't fall over, the caches show some intermittent 503s, but they actually do successfully fill in their caches quickly and the misses + 503s again drop off pretty quickly. [13:48:28] the only truly-ugly scenario is one in which MW/services fall over so hard under miss-load that the caches can't get filled in and the situation keeps persisting itself [14:07:13] bblack: morning! I don't think I understand the ulsfo-to-codfw [14:07:18] *patches [14:07:56] why adding stuff to the yaml files in one patch and then removing it in another one? [14:08:22] ema: it's not working right anyways in my compiler testing, which means I'm failing to understand some hiera lookup subtlety [14:09:07] ema: but the idea was that the cache::route_table in hieradata/common/cache.yaml is the default for all caches, and the ones specified in hieradata/role/common/cache/text.yaml and such override per-cluster [14:09:28] so that set of 5 commits lets us progressively try different clusters, and once they're all on and ok we can collapse it back to the default [14:09:49] oh so the idea is *not* to merge them all, I see :) [14:09:56] but clearly I made some simple mistake, because the misc one doesn't affect the hosts it should in puppet-compiler [14:10:04] well the idea is to eventually merge them all, yes [14:10:12] right, not all at once I meant [14:10:16] yeah [14:14:31] I think I see the hieradata issue. it needs to be 'cache::route_table' in the per-cluster files, not just 'route table' [14:15:17] because it's looked up as 'cache::route_table' [14:15:23] $cache_route_table = hiera('cache::route_table') [14:16:33] yep that must be it [14:17:55] anyways, even it all were to be merged at once, I wouldn't be shocked at me adding then removing code in sequential commits heh, there's been a lot of that in this broken-down refactor series lately :) [14:18:06] it's more to document the thinking/steps in logical order [14:34:39] hrm [14:34:49] varnishkafka's logrotate has [14:34:49] service rsyslog reload >/dev/null [14:35:05] but that init script has no reload [14:35:15] that's in the package and the package hasn't been updated in a while [14:35:26] so I guess this is the wrong logrotate perms issue making this visible now? [14:47:32] probably [14:47:53] vk logrotate conf is directly in puppet too, I guess overriding the package if it provides one [16:01:35] paravoid: https://gist.github.com/ema/180f5e6ff680ee143071 - ema has a working POC to replace varnishkafka with 164 lines of python :) [16:01:43] haha [16:01:57] that's kinda awesome :) [16:02:00] how does it perform, though? [16:02:29] last I heard was ~5% CPU utilization, when it wasn't sending to kafka yet (just parsing to JSON, using python cjson) [16:02:37] ori's initial code for some of those stats-gathering python daemons was using quite a bit of CPU [16:02:45] > 10-15% IIRC [16:04:04] paravoid: yes last time I tried on cp1008 with all the PURGE (so a few requests) it was around 5%, but without throwing stuff at kafka [16:04:06] so -again, IIRC, which may be entirely wrong- he had to do write code to filter in the C library code [16:04:47] oh so 5% is just for PURGEs on an otherwise empty cluster [16:05:17] try running it on an upload cache :) [16:05:26] will do :P [16:05:47] I don't want to dishearten you, sorry -- it may not work for us, but it's still worth a try I think [16:05:58] paravoid: sure! :) [16:08:28] 3.3% before exploding (cjson.DecodeError) on cp1048 [16:09:32] paravoid: yeah but PURGEs outnumber gets on all caches individually right now :) [16:09:57] ohrly [16:10:13] because GETs are evenly distributed, but PURGEs are mirrored everywhere [16:10:15] even the split up PURGE streams? [16:10:33] we still don't have text/upload split to separate multicast [16:10:36] yeah it looks like (again, without kafka output) it's not too bad [16:10:36] oh [16:10:55] I've rolled that out and reverted it like twice back in December-ish, kept running into user complaints of purge failure on new uploads [16:11:01] oh! [16:11:10] the addresses are allocated and exist, though :) [16:11:14] and I thought I remembered this being deployed, I was super confused now :) [16:11:16] the python script is stable at 3.3%, varnishncsa a little < 5% [16:12:22] paravoid: my theory on upload caches seem to "miss" purges from the user POV when split is this: text purge rate is much higher than upload purge rate, and in both clusters we have known race-conditions between purging the layers, and somehow the excess traffic flow of pointless text purges reduces the incidence of the races in upload [16:12:42] (in other words, at the naturally-slower rate of true upload purges, they hit all the caches very fast, fast enough to race frontend-vs-backend and such) [16:13:22] functionally the multicast split works right, though [16:26:40] paravoid: I am a strong supporter of ema's prototype, I am close to send a code review for vk but the code is really heavy to maintain (also no test) [16:28:25] ema: are you using python 3 or 2.7? [16:28:40] 2.7 [16:29:09] would it be possible to check straight away if it is compatible with 3.5+? [16:30:18] 10Traffic, 6Operations: confctl: improve/upgrade --tags/--find - https://phabricator.wikimedia.org/T128199#2099271 (10Joe) p:5Triage>3Low [16:30:56] elukey: there is no python3-cjson AFAICT [16:31:13] buuuuuu [16:31:43] elukey: do you think I can start testing kafka output on a test topic? [16:32:05] yep I don't think that there are any issue.. ottomata? [16:32:17] elukey: he's not here :) [16:32:43] ah snap [16:32:51] well let's move to analytics then : [16:32:53] :P [16:49:24] elukey: it seems to be working [16:49:40] kafkacat -C -b kafka1012.eqiad.wmnet:9092 -t test [16:52:50] ~6% CPU usage with message buffering set to 1 to avoid dying of boredom [16:54:33] ema: looks good then! [16:56:05] one thing that I wanted to ask: are we covering all the vk'c current format placeholders (%{VAR}o/i, etc..) or is there something missing in ncsa? [16:56:24] I believe that we can survive even with something a bit different :P [16:57:02] I would also check for python 3.X+ compatibility for longer term maintenance, because eventually we'll need to move it and the sooner the better (but even starting with python2.7 would be good) [17:07:40] lol it turns out that kafka.queue.buffering.max.ms and kafka.queue.buffering.max.messages are not exactly the same thing [17:08:29] :) [17:09:34] OK yes this is promising [17:09:53] take a look on cp1048, vkncsa.py and varnishncsa [17:11:23] so about double? [17:11:38] that's not a bad trade for "no more varnishkafka C code" IMHO [17:13:23] I'm sure we could profile it and find ways to optimize better, too, but IMHO if it's not in problematic performance territory, it's probably not worth the effort for now. [17:13:42] (even if we expend the effort, probably want to stick to optimizations that don't make the code terribly uglier or harder to maintain) [17:14:23] 10netops, 10Monitoring, 6Operations, 10Scap3 (scap3-adoption): Deploy libreNMS with scap3 - https://phabricator.wikimedia.org/T129136#2099469 (10thcipriani) >>! In T129136#2098628, @faidon wrote: > What would be the next steps for this? Scap makes a few different assumptions than Trebuchet that you'd need... [17:14:26] right, it's below 10% for sure [17:14:51] so more than double but not outrageous [17:16:10] and less than varnishreqstats for example [17:19:53] yeah, looks good! [17:22:47] bblack: also, I'm not sure if you had the time to look at his but I think I've managed to find my way through libssl for NPN/ALPN https://phabricator.wikimedia.org/P2719 [17:23:04] ema: no I haven't looked yet, but will at some point today! [17:23:27] cool, the nice part is that we can identify precisely clients doing NPN and choosing http/1.1 [17:23:44] (tried with spdycat on my machine) [18:07:57] 10Traffic, 10MobileFrontend, 6Operations, 3Reading-Web-Sprint-67-If, Then, Else...?, and 2 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2099704 (10Jdlrobson) [18:18:27] 10Traffic, 10MobileFrontend, 6Operations, 13Patch-For-Review, and 3 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2099795 (10Jdlrobson) [18:21:54] 10Traffic, 6Operations, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Switch ulsfo to backend to codfw rather than eqiad - https://phabricator.wikimedia.org/T127492#2099821 (10BBlack) 5Open>3Resolved a:3BBlack [18:21:56] 10Traffic, 6Operations, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Traffic Infrastructure support for Mar 2016 codfw rollout - https://phabricator.wikimedia.org/T125510#2099823 (10BBlack) [18:26:21] 10Traffic, 6Operations, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Traffic Infrastructure support for Mar 2016 codfw rollout - https://phabricator.wikimedia.org/T125510#2099839 (10BBlack) Status update: The only remaining work here ahead of the big switches of the applayer services... [18:30:16] 7HTTPS, 10Traffic, 6Labs, 6Operations, and 2 others: Detect tools.wmflabs.org tools which are HTTP-only - https://phabricator.wikimedia.org/T128409#2099871 (10Dzahn) Yea, it meant the request came via http and was 200. [19:26:04] ema: re npn/alpn stap: 1. can't sort out kbuild pkg / .config issues to recompile? 2. is it known that SSL_get0_next_proto_negotiated is only called if there was no ALPN? otherwise stats are distorted and 3. the ALPN thing needs to log ALPN spdy-only too (which is possible) [19:56:06] 10Traffic, 6Operations: Fix puppet on deployment-cache* hosts in beta labs - https://phabricator.wikimedia.org/T129270#2100309 (10Ottomata) [21:56:03] 10Traffic, 10MobileFrontend, 6Operations, 13Patch-For-Review, and 3 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2100862 (10Jdlrobson) If I can get https://gerr... [22:12:34] 10Traffic, 10MobileFrontend, 6Operations, 13Patch-For-Review, and 3 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2100943 (10Jdlrobson) SWAT scheduled for today...