[00:41:28] 10netops, 6Operations, 10ops-eqiad: Update the port description on the network switch for analytics 1017 and 1021 - https://phabricator.wikimedia.org/T131218#2159707 (10Dzahn) [04:52:37] re: varnishkafka - I uploaded a series of 7 patches on top of elukey's "remove loglines", which start at https://gerrit.wikimedia.org/r/#/c/280294/ and end at https://gerrit.wikimedia.org/r/#/c/280381/ [04:53:37] in some ideal world if we're lucky, I've made the code cleaner/better/faster/etc, and the memory leak may have vanished as a side effect [04:54:23] in some realistic world, maybe every single patch breaks the code in new and exciting ways and it's totally not worth digging into why, and we should just toss those changesets into the Abandoned pile and forget about it :) [04:55:11] but if you have time in the morning, try the last of the 7, see if it works and doesn't leak and whatnot. it's kind of a moon-shot sort of thing. maybe it helps, maybe it doesn't. don't waste too much time staring at them if they don't work. [07:11:46] o/ Will take a look to them today and I'll try to test them! [08:36:16] https://puppet-compiler.wmflabs.org/2220/cp1043.eqiad.wmnet/ [08:36:35] puppet compiler is pretty cool [08:50:48] 10Traffic, 6Operations: Deploy infra ganeti cluster @ ulsfo - https://phabricator.wikimedia.org/T96852#2160180 (10fgiunchedi) [09:40:19] elukey: on cp1043 varnishkafka dies with Mar 30 09:39:40 cp1043 varnishkafka[2681]: VSLQ_Dispatch: Varnish Log abandoned or overrun. [09:41:17] varnishncsa/varnishlog are also doing weird stuff though [09:41:30] 10.64.0.171 - - [30/Mar/2016:09:41:01 +0000] "GET http://maps.wikimedia.org/_info HTTP/1.0" - - "-" "Twisted PageGetter" [09:41:33] Log abandoned [09:41:36] Log reacquired [09:41:38] Log overrun [09:41:41] Log reacquired [09:51:22] aha! [09:51:32] Wrong turn at vslp_dir.c:412: Incomplete VSLP hashcircle [09:51:57] we forgot to call init_hashcircle after creating the vslp director [09:53:40] elukey: so yeah, nothing to see here, varnishkafka seems to be running fine [09:53:45] \o/ [09:53:48] was reading now [09:54:07] I am happy that the VSLQ_Dispatch works :) [10:46:58] elukey: re removing format.key: it just made the code simpler as I was going, but concerns about "upstream" are valid, etc [10:47:10] still, it's an expensive feature in terms of complexity that we're not using anyways [10:47:36] I could refactor the remaining stuff to work with it too, the other changes aren't necessarily dependant on it [10:48:31] bblack: morning! [10:48:50] (but if so, maybe test if the remained of the changes combined actually work ok and/or plug the leak first, before I bother re-making them on top of +format.key) [10:49:03] morning :) [10:49:03] yep yep definitely [10:50:02] I am not sure that we have a leak though because I am almost convinced that this is something triggered by the VM/hypervisor somehow, there is not trace of leaks for the moment.. Anyhow, I am testing on the same VM each commit with 5k requests (all MISS) to varnish [10:50:35] ema: how does cp1043 look so far. also I was thinking earlier: don't we need to wipe persistent storage on up/down-grade as well? [10:52:06] bblack: it's depooled and I'm almost done fixing some minor stupidity with /etc/ganglia/conf.d/varnish.pyconf [10:52:25] it's looking good though [10:52:46] (by "wipe" I mean rm -f /srv/sd*/varnish* between stop->start. even if they're basically binary-compat on storage, surely the meaning of stored obj fields will have changed subtly, etc) [10:53:14] bblack: good point [10:53:23] vk: down to 3 code reviews, should be finished in ~1hr [10:55:45] elukey: on the leak: it would be unusual for a VM to trigger leak behavior. but probably if you can run valgrind over vk with a full 5k run through varnish like you're talking about, with the new patches, and it still leaks, they may be much easier to track down at least. [10:56:41] if the basic cleanup didn't fix it and/or seems risky, then we could always backtrack with the new understanding and fix just the leak in the earlier code and leave scary refactors out. [10:57:45] bblack: https://gerrit.wikimedia.org/r/#/c/280412/ [10:57:50] then again the "leak" may be completely false. it could just be initial buffer growth inside librdkafka or whatever, if you're not sure the test scenario is really "stable" and just measuring by process mem stats growth. [10:59:19] bblack: the weird thing is that it happens also with vk3 on the same VM but different packages of course (Debian Jessie). I tried also with mw-vagrant (Ubuntu) and the leak is less evident. And with all my runs yesterday valgrind was only able to show some Kbs of memory as not "possible" leak, not clear one. [11:00:27] elukey: the valgrind output pasted yesterday, was that from a large request count through varnish itself as well? and was it actually talking to kafka output, or using the stdout/null output funcs, or? [11:01:04] all of the leak hints in that one seemed like they came from setup-time code anyways, not from code that would be invoked in the per-request runtime loop, so they didn't seem legit. [11:01:12] yes yes I was executing a curl in a for loop changing query string each time. [11:01:32] but the valgrind test was only with stdout [11:01:52] it's common for code to "leak" once per execution like that, unless one is very pedantic about freeing up every allocation at exit() time pointlessly, basically. [11:01:58] yep yep exactly [11:01:59] alright, we're now puppet-clean on cp1043 [11:02:39] bblack: I believe that all your changes are good anyways, this code cleanup was needed and I would be inclined to merge everything [11:05:17] bblack: persistent storage "wiped", basic testing with curl/varnishtest OK [11:05:26] ema: awesome [11:05:34] I'll have some lunch and then I think we can try repooling it [11:05:47] have you tested real maps fetches, or just test fetches like / ? [11:05:59] I've tried a png file [11:06:03] I was thinking, try some tiles/js from legit maps.wm.o requests, compare hash of outputs, etc [11:06:06] ok [11:06:57] oh yes good point. So far I've just checked that Content-Length matches [11:07:09] of course, with how our chashing works.... [11:07:35] some of your test requests are going cp1043->cp1043, and some are cp1043->cp1044. so might have to try a couple of random URLs until you've tested both [11:07:49] for half-upgraded cluster compat, and also definitely testing the new backend too [11:07:52] oh I was just curling localhost directly [11:07:58] :80 and :3128 [11:08:29] but yes sure those against :80 can end up on cp1044 [11:08:35] right so curl on :3128 is, in the *general* case, never really legit, as it skips over a bunch of frontend transforms. although in practice on maps there aren't any anyways :) [11:08:59] but yeah I'd check that some URLs over :80 that X-Cache through both 1043 and 1044 backends work out ok [11:09:21] sounds good! food first though. bbiab [11:09:25] ok [11:11:03] another thing that would be interesting is to header-diff the :80 outputs from 3x scenarios: 1044->1044 (current live v3) 1043->1043, and 1043->1044. It will be interesting if there's behavioral differences to note in how headers look (headers added or removed, or output is substantially different) [11:12:11] heh of couse since 1043 is depooled, 1043's own frontend won't currently query 1043's backend :) [11:12:31] could stop confd on the host and hack the dynamic directors output, though, to get it to use itself again. [11:13:07] exciting stuff! [11:20:11] this one is a big heavy: https://gerrit.wikimedia.org/r/#/c/280380/1/varnishkafka.c :D [11:20:53] :) [11:21:30] I don't think it makes any bug-difference. it just makes human/machine analysis easier, and has some micro perf tradeoffs built in. [11:23:52] the main thing is that no scratch_foo() calls match_assign/match_assign0 [11:24:17] match_assign still calls scratch_foo() though, when warranted by force_copy (because the input is on the stack) or warranted by the need for escaping output into scratch. [11:25:19] or another way to think of that: scratch_foo() calls only deal with allocating/writing to scratch space, as their name indicates [11:26:11] before there were runtime call chains with stuff like: parser invokes scratch_foo invokes match_assign invokes scratch_foo and such, with temporary scratch allocs that rewind and reuse partway through that callchain [11:26:13] yep, makes sense and it is more clear. This part of the code was the "hic sunt leones" part of my mental map [11:27:03] it's fun being on the other side of the fence for once [11:27:16] would you mind to tell me (just a summary) how did you approach the debugging/coding/testing/etc..? [11:27:30] usually I'm the one writing unreadable C code for pointless 0.01% perf gains. now I'm unwinding someone else doing it on the grounds of ETOOCOMPLEX :) [11:27:43] ahhaha [11:28:50] elukey: I didn't really debug/test at all when writing those patches. I got a good working environment setup (labs vm with git checkout + prereqs installed)... [11:29:43] and then for the general code cleanups, I just editing the makefile and added lots of warnings flags, and also ran "scan-build make" (with s/gcc/$(C))/ in the makefile) to get clang-analyzer runs, and installed/ran cppcheck too [11:30:10] I wasn't willing to get so invasive as I'd have to, to make all the tools/warnings come back clean, this just fixed up some of the egregious ones, and I left the rest alone [11:30:47] the rest with the deeper changes, I only went through the problem->solution mentally and made sure it still compiled, I didn't really debug/test that either [11:32:02] the mental flow there was basically: look for the allocation-related things, since there might be a leak. rule out the simple ones and the ones that aren't in the runtime loop. [11:32:27] the rest, if they're too complicated to easily mentally reason about, see if there's a way to refactor them simpler without making horrible perf tradeoffs, etc [11:32:37] thanks! I was curious to learn ways to approach codebases like this one in the future, it was pretty hard for me and ema to get a decent understanding of the vk internals initially.. [11:33:24] with complex C code like vk.c, especially in a case like this where IMHO not enough focus was put on cleanliness to begin with, it's probably not reasonable to expect anyone to have a decent understanding at it that didn't write it. [11:34:12] I just happen to have spent years staring at and debugging and writing similar code, so it's slightly easier for me to jump through the mental hoops of it. [11:34:16] that's not necessarily a good thing :P [11:35:00] :D [11:36:39] it is a bit frustrating for Analytics since vk could be expanded a lot more, for example with Andrew's patch to have dynamic kafka topics. [11:37:03] IMHO the long-term sane path forward is to work on the python replacement [11:37:27] then upgrading for new features isn't so scary and error-prone. it should be possible to get the python version working fast enough in general. [11:37:28] with tons of unit tests :D [11:37:45] and yes, tests would've helped a ton. refactoring without tests is flying very blind :) [11:42:49] code reviews completed, no complains and all +1s. Shall we merge or are we going to build a package and test it first in maps? [11:49:23] elukey: I'd test it first and see how it flies [11:50:01] we may not want to take the risk of merging so much refactor into such a scary codebase in the end, but testing on maps helps [11:50:09] +1 [11:52:51] brb lunch! [11:59:58] and really, even if it tests ok and we like the new patches, should still consider putting format.key back in place and then remaking the other patches on top. [12:01:07] in diff line-noise terms, undoing the format.key change will cause lots of merge fail for the big match_assign/scratch patch, but mentally it's not difficult to fix up the mess. [12:04:40] ideally the format.key removal patch should've been at the end of the chain, but the loops and indirection it kills made it easier to think about the remainder at the time [12:15:43] I believe that format.key was added only because librdkafka exposes a way to add keys, but I am not really sure who uses it. I mean, we don't and there might be somebody upstream relying on it but with some warnings in README.md we'll definitely catch a github ticket with "please re-add it!" [12:16:59] it is not a good way to upgrade sw but since we are doing a big change we should also to it properly :) [12:19:34] stopping confd on cp1043 to make it go through itself [12:20:18] BTW what is "Twisted PageGetter"? [12:20:35] I see it's requesting stuff to cp1043's frontend [12:20:45] 10.64.0.171 - - [30/Mar/2016:12:20:06 +0000] "GET http://maps.wikimedia.org/_info HTTP/1.0" 200 129 "-" "Twisted PageGetter" [12:21:40] that's pybal [12:21:44] it checks depooled servers too [12:21:46] aha! [12:27:23] bblack: https://phabricator.wikimedia.org/P2833 [12:27:42] that's a HEAD request for https://maps.wikimedia.org/osm-intl/4/4/5.png [12:28:06] cp1043's results come from curl -I http://localhost:80/osm-intl/4/4/5.png on cp1043 itself [12:28:19] cp1044 is https://maps.wikimedia.org/osm-intl/4/4/5.png from my machine [12:29:33] interesting that ETag varies there, but probably not directly related to v4 [12:29:53] or just because varnish makes up its own ETag when it stores/mods the headers, etc [12:30:23] I wonder if the ETag is constantly inconsistent in the maps-test backends heh [12:33:13] yeah it does. totally separate from v4, but might be interesting to pursue with the maps team [12:33:16] bblack@cp1008:~$ curl -sv http://maps-test2001.codfw.wmnet:6533/osm-intl/4/4/5.png 2>&1 >/dev/null|egrep -i '^< (etag|content-length|date)' [12:33:19] < ETag: "edeec35efc4f3984ddb92c78659ca48d" [12:33:22] < Content-Length: 33105 [12:33:24] < Date: Wed, 30 Mar 2016 12:32:47 GMT [12:33:27] bblack@cp1008:~$ curl -sv http://maps-test2001.codfw.wmnet:6533/osm-intl/4/4/5.png 2>&1 >/dev/null|egrep -i '^< (etag|content-length|date)' [12:33:30] < ETag: "d9f88a159c00346aa52d1bf257f659f9" [12:33:32] < Content-Length: 33105 [12:33:35] < Date: Wed, 30 Mar 2016 12:32:49 GMT [12:33:41] I see some values repeat though. it's not completely random. [12:33:51] interesting [12:34:38] bblack: do we care about Via: varnish-v4 vs. Via: varnish? [12:35:20] I don't think we care that I'm aware of. it's useful for debugging half-deployed cluster issues anyways [12:36:18] so with 100x requests to maps-test2001, I get this distribution of unique ETags for that PNG: [12:36:21] 6 < ETag: "163e8ef3e5f70e44f5a2aef2099d846c" [12:36:23] 6 < ETag: "b340082b9f1e93db49f69032b448e6b0" [12:36:26] 8 < ETag: "155a739979a33bbb5f7d018aada15675" [12:36:28] 8 < ETag: "d9f88a159c00346aa52d1bf257f659f9" [12:36:31] 11 < ETag: "a96dca353d70a4b2c931886ce76908ec" [12:36:33] 11 < ETag: "d8067e1d71fff39789aafe4c38394082" [12:36:36] 25 < ETag: "dc5de3a8e4572ed478f5f6d0ef3b05e0" [12:36:38] 25 < ETag: "edeec35efc4f3984ddb92c78659ca48d" [12:36:47] beats me, but again something for the maps time to puzzle about :) [12:37:23] right :) [12:37:25] s/time/team/ [12:37:51] the png md5sum match [12:38:09] *md5sums [12:38:30] it works, ship it! [12:39:00] alright! [12:41:21] bblack: repooled, staring at the logs [12:41:49] looks fine! [12:43:23] wooooo [12:46:00] mmmh ganglia went bananas again https://ganglia.wikimedia.org/latest/?c=Maps%20caches%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [12:46:12] bblack: what did you do last time to fix this? Restart some aggregator somewhere? [12:47:28] ema: trying "service ganglia-monitor restart" on the hosts first I think [12:47:35] it's often just that [12:48:09] I'm not sure if ulsfo even goes through an aggregator, looking [12:48:23] eqiad you mean? [12:48:34] oh right eqiad [12:48:49] eqiad is ganglia_aggregators: carbon.wikimedia.org:9649 [12:48:49] cp1043 shows up again after ganglia-monitor restart [12:48:53] but all of eqiad would be missing [12:49:08] ok [12:49:16] probably 1044 needs the ganglia-monitor restart too [12:49:32] I've tried already, no luck [12:49:39] also now cp1043 disappeared again [12:49:43] heh [12:49:48] is there more than one gmond process? [12:49:56] nope [12:50:24] well, ganglia also picks up varnish stats, could be related to v4 things changing? [12:50:39] eqiad text is fine, so it's not the aggregator on carbon I don't think [12:50:43] I thought about it, but then why both? [12:50:56] quick update - @cp1043:~$ top -p `pgrep varnishkafka` - RES keeps growing, very slowly but it does [12:51:01] well they have different last-heartbeat times too [12:51:29] 1043 "Last heartbeat -7258s ago" seems very odd [12:52:12] I was going to see how long I could sit here and do/say productive things without plugging in my yubikey today, I think I reached my limit heh [12:52:21] :) [12:52:52] I've now tried to stop ganglia on cp1043 to see if for some reason cp1044 comes back [12:54:06] so about 10:48 (a little over 2h ago) is when 1044 dropped off of ganglia stats [12:54:40] it was more like 09:18 for cp1043, but then it has intermittently sometimes sent little stats bursts since [12:56:03] oh right, so there are multiple aggregator daemons per-site [12:56:35] hmmmm [12:57:51] I've restarted ganglia-monitor on cp1043 in the meantime [12:58:04] I restarted the specific maps-eqiad aggregator instance on carbon [12:58:16] oh look, stats again [12:58:27] so it was the aggreagator [12:58:37] at least, it's not showing them as heartbeat-dead [12:58:40] I didn't see real stats yet [12:58:57] may have to restart gmond's on the hosts as well for all I know [12:59:10] done [12:59:53] aaand cp1044 is gone again [12:59:58] heh [13:01:15] FTR: [13:01:15] root@carbon:~# grep -i maps /etc/ganglia/aggregators/* [13:01:15] /etc/ganglia/aggregators/1051.conf: name = "Maps caches eqiad" [13:01:16] root@carbon:~# stop ganglia-monitor-aggregator-instance ID=1051 [13:01:28] then repeat with 'start', is what I did [13:01:44] and then you can see the effect in the ps list [13:03:18] elukey: it looks like we're sending wrong stuff to kafka [13:03:27] {"hostname":"cp1043.eqiad.wmnet","sequence":13851,"dt":"-","time_firstbyte":0.0,"ip":"-","cache_status":"-","http_status":"-","response_size":0,"http_method":"-","uri_host":"-","uri_path":"-","uri_query":"","content_type":"-","referer":"-","x_forwarded_for":"-","user_agent":"-","accept_language":"-","x_analytics":"-","range":"-","x_cache":"-"} [13:03:52] that's from oxygen? [13:04:06] stat1002 [13:04:08] kafkacat -C -b kafka1012.eqiad.wmnet:9092 -t webrequest_maps -c 10 [13:04:12] ah [13:04:18] mmmmm [13:04:30] do we have to change the format string? [13:04:58] what package are we running anyways, one built from the logline cache delete commit? [13:05:31] ema: shouldn't be the case, but we really tested only webrequest logs [13:05:36] oh right I didn't package the latest changes [13:05:54] well yeah but the only problem before was the leak [13:05:55] let me check the format [13:05:56] I think? [13:06:22] bblack: yep on mediawiki-vagrant it was working fine [13:06:54] commit 149c8612c557326d1132ddaeb454b2509a2a7aa9 [13:07:06] Date: Wed Mar 23 18:45:19 2016 +0000 [13:07:53] I wouldn't even try to package the later changes yet if this one isn't working right, debug that first [13:08:03] right [13:08:06] maybe elukey's logline cache removal patch, if we think that may have solved bugs [13:09:03] ema: what is the change-id of the commit? [13:09:20] elukey: Id2bcd3944bee565071db6d96e103906288780098 [13:09:44] ema: thanks [13:10:21] elukey: does the same package that's on cp1043 test ok on vagrant for kafka output? [13:11:07] ema: [13:11:08] elukey@stat1002:~$ kafkacat -C -b kafka1012.eqiad.wmnet:9092 -t webrequest_maps -c 10 [13:11:11] {"hostname":"cp1044.eqiad.wmnet","sequence":42573592,"dt":"2016-03-30T13:10:48","time_firstbyte":0.062302589,"ip":"182.74.62.26","cache_status":"miss","http_status":"200","response_size":1328,"http_method":"GET","uri_host":"maps.wikimedia.org","uri_path":"/osm-intl/10/14/412.png","uri_query":"","content_type":"image/png","referer":"-","x_forwarded_for":"182.74.62.26","user_agent":"JTileDownloader [13:11:17] /0.6.1","accept_language":"-","x_analytics":"https=1;nocookies=1","range":"-","x_cache":"cp1044 miss(0), cp1044 frontend miss(0)"} [13:11:17] elukey: that's cp1044 [13:11:34] yeah sorry seen only after the paste [13:11:50] should have not pasted the IP though, my bad [13:12:31] FWIW https://grafana-admin.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes looks ok, if varnishlog4.py wasn't working I'd think we'd lose half our stats on stuff like that [13:12:43] awesome [13:14:00] bblack: I can try to test it with mw-vagrant, but that commit was working with the config that Andrew gave me.. I am checking now if there is something weird in the conf file [13:15:00] elukey: if you want to take a look at json output on cp1043, varnishkafka -S /home/ema/varnishkafka.conf [13:15:42] that's the webrequest config but with the outputter switched? [13:16:17] no, they're very different [13:16:49] yours uses format.key to just show everything in the raw I guess [13:18:06] yeah sorry [13:18:13] -S /home/ema/webrequest.conf [13:18:36] that's webrequest with the outputter switched to stdout [13:19:56] seems that it doesn't read any tag for some reason [13:21:17] is it as simple/dumb as a change to the _.vsm path or something? [13:21:45] mmh but the requests are displayed [13:23:29] uh same output on my labs instance [13:23:44] wow I guess I've never looked at varnishlog output for v4 before. it's newer and in some ways better, but different [13:24:02] I don't know how I feel about e.g.: [13:24:02] - RespUnset X-Analytics: [13:24:02] - RespHeader X-Analytics: ;https=1 [13:24:02] - RespUnset X-Analytics: ;https=1 [13:24:02] - RespHeader X-Analytics: ;https=1;nocookies=1 [13:24:18] it's nice to see a trace of what VCL is doing, but it would also be nice for varnishlog just to show the final output header set, too. [13:24:59] ema: what about with -N ? [13:25:11] I meant, -N path/to/_vsm [13:25:29] elukey: same [13:26:16] have the varnish packages been updated since the vk ones have, in ways that might matter for success? [13:26:33] (I mean as in, rebuilding vk against newer libvarnishapi1 or whatever) [13:27:50] well we're using 4.1.2-1wm1 on cp1043, not sure what elukey was using in his tests [13:29:10] not just the test, but the package build too [13:29:23] if there's any chance headers changed that it pulls in [13:29:55] oh I see [13:30:28] ema: I was using my version built via make locally [13:30:56] oh yes for the build we used 4.0.2-1! [13:31:32] also, the installed varnishkafka on cp1043 still has the 100ms sleep [13:31:42] it may be quite old in terms of revisions to the patches [13:31:58] alright let me build a new version then [13:32:07] which commit should I use? [13:32:38] I'd try https://gerrit.wikimedia.org/r/#/c/276439/ [13:32:40] ema: I'd say Change-Id: I40c04ca91d9d31b4d0d735298f874ed6b883120d (remove loglines) [13:32:49] yeah same one [13:33:59] alright [13:34:07] oh and I was wrong about 100ms sleep, I mis-counted zeros [13:34:10] but still [13:35:22] merge conflicts \o/ [13:35:31] buuuuu [13:36:15] merge conflicts with what? [13:36:48] I mean, it's not with itself, it has a clean history back to origin/master [13:37:03] between that commit and the debian-v4 branch [13:37:25] what's the debian-v4 branch? [13:37:27] ah, yeah good point bblack :) [13:37:34] in our repo? [13:37:35] it's me being stupid [13:37:42] ok [13:38:40] debian-v4 is Id2bcd3944bee565071db6d96e103906288780098 + debian packaging [13:38:46] ok [13:46:11] should we de-pool cp1043 in the meantime to avoid pollution in kafka? [13:46:32] it is not a huge deal but we might need to do more tests [13:51:27] it works \o/ [13:52:08] I40c04ca91d9d31b4d0d735298f874ed6b883120d rebuilt against libvarnishapi 4.1.2-1wm1 [13:52:14] nice [13:52:26] however we need to remove the logline options [13:52:35] which options? [13:52:41] hash_size and such? [13:52:45] yup [13:53:00] if we try packaging my later commits, there's more stuff like that [13:53:07] logline.hash.max and logline.hash.size [13:53:39] data.copy gets removed as a feature, format.key (maybe), and the default scratch.size jumps by a factor of 1024 (but I donno if our config sets it back to the old value) [13:54:33] (it does, we hardcode the 4096 default - could just remove that so the config works fine on both, ahead of time) [13:54:58] * elukey hugs ema [13:56:47] ema: whenever you have time, let's double check together timings differences between 1044 and 1043 [13:57:04] * elukey goes on stat1002 [13:58:16] they seem sound (ttfb, date) [14:01:17] elukey: nice [14:01:42] OK so, removing logline.hash.{size,max} and friends means touch the varnishkafka puppet submodule [14:02:07] joy [14:02:14] I'll do it in a minute [14:04:57] cool in the meantime I've disabled puppet on cp1043 and removed those lines by hand [14:04:57] elukey: ja, i doubt the missing x-analytics are a varnishkafka issue, but they could be i suppose [14:04:57] invoking our kafka master ottomata :) [14:04:57] hmm, lets see if vanrishncsa shows them... [14:04:57] or varnishlog [14:04:57] varnishlog -n frontend [14:04:57] does show me [14:04:57] - RespUnset X-Analytics: [14:04:57] - RespHeader X-Analytics: ;https=1 [14:04:57] - RespUnset X-Analytics: ;https=1 [14:04:57] - RespHeader X-Analytics: ;https=1;nocookies=1 [14:04:57] - RespUnset X-Analytics: ;https=1;nocookies=1 [14:04:57] - RespHeader X-Analytics: https=1;nocookies=1 [14:04:57] that header is set by vcl logic, afaic remember [14:05:48] varnishncsa -n frontend -F '%{X-Analytics}i' [14:05:49] though [14:05:50] so this is what I was complaining about with varnishlog in general (not just for vk purposes) [14:05:51] just shows '-' [14:06:08] varnishlog's output has changed: it now traces all the steps in set/unset/reset of a single header [14:06:14] instead of just showing the final value [14:06:16] hm maybe ma doing that wrong [14:06:42] this has probably affected how VK looks at shm too: it may need to search several copies of the same tag-data until it gets to the final RespHeader or RespUnset [14:06:50] aw it is o [14:06:53] varnishncsa -n frontend -F '%{X-Analytics}o' [14:06:53] you can also use the grouping, like -g request [14:06:55] ok that works [14:07:00] (not sure if it is the default) [14:07:06] I've tried with e.g. " [14:07:24] all the -g groupings, they always still split up the header (re-)sets [14:08:24] hm elukey i made a /tmp/vk.conf that is the same as the main one, except it spits to stdout [14:08:31] x_analytics is not set [14:08:45] because it's probably stopping on that first tag data when it looks for it [14:08:48] - RespUnset X-Analytics: [14:08:50] OH [14:08:52] right [14:08:53] hm [14:09:05] hmm, but varnishncsa works. [14:09:14] so varnishncsa is doing something else to figure it out that varnishkafka is not doing [14:09:15] well varnishncsa-v4 may be smarter [14:09:19] aye [14:10:25] if you squint your eyes while looking at the varnishncsa source, you can see how vk.c was derived from it heh [14:10:30] hmm, yall should merge your changes onto a branch in gerrit... :) [14:10:39] now that there are more than one of them [14:10:45] ha, yeah [14:11:05] so varnishncsa-v4 has: [14:11:05] if (tag == SLT_ReqHeader) [14:11:05] process_hdr(&CTX.watch_reqhdr, b, e); [14:11:06] if (tag == SLT_RespHeader) [14:11:06] process_hdr(&CTX.watch_resphdr, b, e); [14:12:05] which uses those context objects to track through them [14:13:48] maybe processing RespUnset? [14:14:17] anyways, this is going to require some further fundamental-ish work on varnishkafka-v4 [14:14:23] hm, well the vcl logic amends x-analytics multiple times [14:14:28] which I think assumes a tag only appears once usefully [14:14:31] yaeh [14:14:48] whereas varnishncsa-v4 knows it has to keep watching the header evolve into its final state [14:15:09] but, clearly this is looking good as is. what do you all think about doing further work on a branch? so we can merge what exists and make future reviews/rebases easier? [14:15:17] i mean, good work as is [14:15:20] still more todos [14:15:53] i can help wrangle gerrit and patches if you like [14:17:34] it's probably not necessary to branch out for this, we can merge what exists regardless [14:17:42] well, up through the logline cache killer [14:17:55] assuming that's not part of the problem heh [14:20:12] bblack: I am watching vk on cp1043 with top and it looks good [14:20:45] RES is not climbing anymore, but of course we'd need to wait a bit [14:23:31] bblack: i'd just branch while deving, i guess if you guys are ok with merging into master now its fine [14:23:38] or, at least, i'd like freeze a v3 branch [14:23:50] not sure how long it will take us to upgrade everywhere [14:23:59] but if we have to patch something it'd be nice to make that easier [14:24:02] we can always branch from the last v3 point regardless [14:24:07] indeed [14:24:09] the power of git compels you [14:24:16] ooook fiIIiIine you are right [14:25:04] right now my concern is that, until we have a firm grip on the fundamentals of the resp header problem, I'm not 100% convinced we should really merge the logline cache removal patch [14:25:34] elukey and/or ema might have a better idea about that: whether it's fundamentally possible that we do still need to look at more than one "logline" at a time [14:26:35] I *think* it's all one logline and just several distinct matches for the tag, which one has to comb through and take the last of. not sure how Unset interacts there either [14:28:33] bblack: I don't think that we'd need more loglines, need to check the tag_* functions to see how tags are stored [14:28:59] yeah I'm looking at that now [14:29:15] just as a crazy experiment (I suspect we need to be more subtle for all corner cases, but you never know!).... [14:29:34] in tag_match, I think if we comment out the block that says: [14:29:35] /* Value already assigned */ [14:29:35] if (lp->match[tag->fmt->idx].ptr) [14:29:35] continue; [14:29:51] it might Just Work (it will keep overwriting the tag data as new ones appear in the stream) [14:30:16] which will waste some scratch space, but if we also merge up all the scratch/tmpbuf stuff, that's not a practical issue. [14:31:30] might try a build with that commented out anyways, and see if it makes x-analytics show up in some test [14:31:42] (and doesn't muck up anything else) [14:32:11] bblack: so basically we wouldn't care about the unsets but just the latest value for a tag right? [14:32:19] yeah [14:33:38] trying to think about corner cases [14:33:38] I have no idea how the RespUnset thing plays into this, whether it shows up the same as the Sets but with an empty/null value, or is completely different, or? [14:33:38] but probably, if we're logging a header, we're not setting it and then intentionally unsetting it at the end just before we log it, in practice [14:33:39] we could always test that scenario once explicitly with the new code (do a vcl_deliver that sets then unsets a header that's being logged in the format line) [14:34:07] yep yep [14:35:01] in any case, the block mentioned above is what's stopping it from looking at all at a second instance of the same response header tag [14:35:20] not skip things because they're 'already assigned' seems like step 1 of unraveling the mystery from a black box POV [14:37:12] if the simple approach (don't skip) seems to work ok (which is questionable!), then the pragmatic issue is it sucks we're invoking the full parse-to-scratch code for all the throwaway copies before the final one. in which case the trivial answer would be: put the skip back in, and process the transactions in reverse order :) [14:37:33] err s/transactions/tags/ [14:37:54] :D [14:38:51] of course the tag iterator is VSL_Next() magic, so I don't know how easy that would be [14:58:13] back on other tracks: we still have no ganglia data for maps and no clear reason why [14:58:21] I doubt it's a v4 problem, though [14:58:31] just sucks that it blocks seeing those stats while testing this [14:58:45] yep [15:05:45] mem leak is not showing up atm (FYI) [15:05:57] nice [15:06:17] ema: https://gerrit.wikimedia.org/r/280452 - let me know if I am missing something [15:11:32] elukey: so you removed logline.hash.size and logline.hash.max, right? [15:11:51] bblack was mentioning other directives that need to be removed [15:12:19] well, if we applied my cleanup patches on top [15:12:28] IMHO if there's no leak, there's no pressing need to take that risk [15:12:59] alright then! [15:14:17] the million-dollar question right now is: even on a test VM or whatever, does killing the value-already-assigned continue check make a scenario like x-analytics (resp header that's set/unset multiple times, with a final set) work? [15:14:27] (and not break other things?) [15:14:57] if it *doesn't*, then no variant of vk.c we currently have available is going to work for the long term, and we're looking at hairy new work on top of what we have to make it happen. [15:15:17] vs just dropping in vkncsa.py and seeing how that plays out [15:16:42] re: ganglia. I've stopped gmond on cp1043, restarted it on cp1044 and restarted the aggregator on carbon. cp1044 started reporting data again [15:16:47] so cp1044 is messing things up [15:17:01] you mean cp1043 is messing things up? [15:17:18] yes s/cp1044/cp1043/ [15:17:23] I suspect the new `varnishstat -1 -j` output is the culprit [15:17:38] I bet we've got some changes to account for in the ganglia data it pulls from varnishstat, and somehow that's throwing bad data at the aggregator and making it buggy, then [15:18:40] I'm out for an hour or two, will be back! [15:18:54] cya! [15:27:42] ema: vk-module updated, you can use the new template now :) [15:30:57] nice, thanks elukey [15:46:26] the ganglia problem is certainly due to the new varnishstat output and /usr/lib/ganglia/python_modules/varnish.py [15:47:28] I've disabled the python ganglia module on cp1043 (/etc/ganglia/conf.d/varnish.pyconf.disabled) and cp1043/cp1044 look good again [16:47:35] bblack: going offline, I'll try to remove the "duplicate" check on vk tomorrow and test it. If you manage to make it work before me let me know :) [16:48:18] byyeeeee [17:05:41] bye! [17:12:57] 10netops, 6Operations, 10ops-eqiad: Update the port description on the network switch for analytics 1017 and 1021 - https://phabricator.wikimedia.org/T131218#2161277 (10Cmjohnson) 5Open>3Resolved a:3Cmjohnson updated switches [17:39:43] bblack: is there a way I can fake a zero request? so I can test if my code that reads X-Carrier works [17:45:40] legoktm: yes, but not easily, as we don't let users fake it normally :) [17:45:58] the carriers database has test entries that we've used for that before [17:46:57] you can have someone that manages the zero portal add the IPs you're testing from to an existing or new test-carrier, basically [17:47:44] gotcha [17:47:46] usually they have a carrier name like "TEST2" or something instead of a real MCC-MNC code [17:47:53] will ask :) [17:48:04] but if you get your IP added to one, your requests will come through with e.g. X-Carrier: TEST2 [17:55:13] legoktm: in beta cluster, you could also hack through that kinda [17:55:56] legoktm: I was just remembering that. so on a beta varnish you could disable the cronjob that periodically updates /var/netmapper/carriers.json (which has different/limited test data compared to prod anyways), and manually edit that JSON file and put your test source IPs in it [17:56:08] for a quick check anyways [17:57:25] ahh, ok, I'll try that once someone merges my patch [20:22:48] Hi all, this channel was recommended to me on the mediawiki-l list in response to a Varnish hit ratio problem I'm having after upgrading from MW 1.24 to 1.26. I'm looking for help in diagnosing why my hit ratio dropped from ~85% to ~37% after the upgrade. [20:41:35] 7HTTPS, 10Traffic, 6Operations, 13Patch-For-Review: irc.wikimedia.org talks HTTP but not HTTPS - https://phabricator.wikimedia.org/T130981#2162597 (10Dzahn) @Krenair what do you think about removing the Apache from that entirely, also HTTP? [20:42:45] 7HTTPS, 10Traffic, 6Operations, 13Patch-For-Review: irc.wikimedia.org talks HTTP but not HTTPS - https://phabricator.wikimedia.org/T130981#2162599 (10Krenair) Fine with me. @Krinkle? [20:49:50] 7HTTPS, 10Traffic, 6Operations, 13Patch-For-Review: irc.wikimedia.org talks HTTP but not HTTPS - https://phabricator.wikimedia.org/T130981#2162646 (10Krinkle) Sounds fine. I think it'd be nice if we can find a way to serve the redirect from misc-web-lb, but that's probably not feasible. [21:09:58] 7HTTPS, 10Traffic, 6Operations, 13Patch-For-Review: irc.wikimedia.org talks HTTP but not HTTPS - https://phabricator.wikimedia.org/T130981#2162699 (10Krenair) The only way I can think of to get out of having to put a certificate on the box would be to forward traffic on ports 80 and 443 to misc-web-lb, but... [22:27:39] 7HTTPS, 10Traffic, 6Operations, 13Patch-For-Review: irc.wikimedia.org talks HTTP but not HTTPS - https://phabricator.wikimedia.org/T130981#2163530 (10Dzahn) a:3Dzahn