[08:01:48] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=13&fullscreen shows no varnishkafka seqno restart after ema's patch of yesterday [08:01:52] \o/ [08:58:44] elukey: nice! [09:28:04] 20:07 < bblack> "make it efficient" covers a lot of ground I know :P [09:28:15] less than "make it right" though! [09:29:41] TIL: commonswiki's jobrunner is responsible for about 75% of our PURGE traffic [09:29:47] https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?panelId=6&fullscreen&from=1480490749032&to=1480497150431&var-site=All&var-cache_type=text&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4&var-status_type=5 [09:33:03] 66%, not 75% heh [09:38:42] 10Traffic, 06Operations: several 502 Bad Gateway - https://phabricator.wikimedia.org/T151686#2824625 (10ema) @doctaxon still happening? [09:44:11] 10Traffic, 06Operations, 13Patch-For-Review: Huge increase in cache_upload 404s due to buggy client-side code from graphiq.com - https://phabricator.wikimedia.org/T151444#2834370 (10ema) p:05Normal>03Low The amount of upload 404s decreased significantly, we're almost back to normal: https://grafana.wikim... [10:22:02] 10netops, 06Operations: Thorium (new stat1001) needs to communicate with the Analytics VLAN - https://phabricator.wikimedia.org/T151990#2834399 (10elukey) [10:22:16] 10netops, 06Operations: Thorium (new stat1001) needs to communicate with the Analytics VLAN - https://phabricator.wikimedia.org/T151990#2834411 (10elukey) p:05Triage>03Normal [10:33:45] 10netops, 06Operations: Thorium (new stat1001) needs to communicate with the Analytics VLAN - https://phabricator.wikimedia.org/T151990#2834422 (10elukey) Had a chat with Alex on IRC about what stat1001 does and what level of access it should have. For the Apache VHosts point of view it would be better to have... [10:37:23] 10netops, 10Analytics, 06Operations: Thorium (new stat1001) needs to communicate with the Analytics VLAN - https://phabricator.wikimedia.org/T151990#2834430 (10elukey) [11:08:46] 10Traffic, 06Operations, 13Patch-For-Review: Huge increase in cache_upload 404s due to buggy client-side code from graphiq.com - https://phabricator.wikimedia.org/T151444#2834521 (10ema) Leaving this ticket open though given that they still haven't fixed the javascript bug. The decrease in 404s is probably d... [12:40:21] 10Traffic, 06Operations, 13Patch-For-Review: python-varnishapi daemons seeing "Log overrun" constantly - https://phabricator.wikimedia.org/T151643#2834702 (10ema) I've confirmed with stap that the overruns come from vslc_vsm_next, which in turn calls vslc_vsm_check: https://github.com/varnishcache/varnish-ca... [13:06:11] it's still not very clear to me what the difference between -l and vsl_space is [13:10:14] anyways, running varnish-fe on cp3032 with -l512M,2M [13:11:12] ema: do we need both set to increase the VSL space? [13:11:25] elukey: that's what I'm confused about [13:11:25] I don't recall from your tests what was the right combination [13:11:29] now I've just changed -l [13:11:52] we still get the overruns, but less [13:12:15] we were getting one overrun every 4 seconds before [13:12:24] maybe -l overrides vsl_space and allows also the vsm stats counter space setting? [13:12:39] now it's one every 33 seconds [13:13:56] (give or take) [13:14:52] another side of the investigation could be to check if the python script could be improved to speed it up [13:18:02] yep [13:18:05] now running with -l 768M,2M [13:18:37] one overrun every > 50 seconds [13:18:59] for example (and I am not super expert so I might say something not true): if checks on lists like 'bla in [XYZ]' might be unfolded to see if the code generated speeds up [13:19:41] lunch, feel free to keep an eye on cp3032 if you like :) [13:19:57] will try :) [13:20:29] volans: hola! Do you have 5 minutes for a python question? [13:20:55] elukey: a bit at lucnh but sure [13:21:11] shoot [13:22:38] wanted to profile varnishreqstats on a cp host and wondered if your have any experience with stuff like cProfile etc.. (like no this one is horrible, use XYZ) [13:22:53] s/your/you [13:23:00] but nothing urgent :) [13:27:06] is it a daemon or a script that you run and ends? [13:27:21] daemon [13:27:41] but it can be run as script [13:28:28] ok, good, easier to debug [13:29:37] I am trying something like elukey@cp1008:~$ python -m cProfile prova.py [13:29:46] that works really well [13:29:52] (prova just emits "hello!") [13:31:40] * volans brb, sorry [13:32:24] sure! [13:42:00] in the meantime, I am testing "sudo python -m cProfile /usr/local/bin/varnishreqstats --varnish-name=frontend --metric-prefix=varnish.eqiad.text.frontend.request --interval=60 --verbose" on pink unicorn [13:42:10] (removed the statsd option) [13:43:17] elukey: also if you can tell me what issue are you looking for I can have a look both at the results and the code [13:44:33] volans: so basically varnishreqstats reads from the varnish shm log, that is limited and circular (so whenever it gets full, varnish starts overriding from the beginning) [13:45:28] ok [13:45:37] each consumer of the shmlog (like varnishreqstats) might be slow to consume and get to the point in which the Varnish API lib says "sorry the log has been overrun, you are risking to get inconsistent data so bye!" [13:45:57] so one approach is to increase the shmlog [13:46:02] and ema is testing it [13:46:19] but I was wondering if varnishreqstats had some bottlenecks that could improve the situation [13:46:59] is this happening because varnishreqstats is "doubled" by the varnish shmlog meaning that it restarted from 0 and already reached varnishreqstats position? [13:47:49] from the docs [13:47:51] "Cursor initialization failure: This error happens when varnishd is writing to the Shared Memory Log faster then the program that is trying to read from it. Since the VSL is a circular buffer Varnish can overrun it and if it happens, the program reading the VSL will end up with inconsistent data." [13:48:16] ok so probably it alarms a bit before with some threshold, I guess [13:48:24] nope [13:48:29] it just kicks you out [13:48:44] yes, s/alarms/raise/ [13:49:04] like before actually overrunning [13:49:22] * volans taking a quick look at varnishreqstats code [13:49:47] I think that the Varnish VSL library (that all the consumer use to read from shm) detects the overrun and quits [13:49:54] thanks :) [13:50:13] but quits only the consumer that is slow [13:50:53] yep, the consumer uses the library [13:51:21] basically each consumer copies data from the varnish shm log to its memory address space to do whatever it wants [13:51:29] using the library [13:52:16] * volans cover his eyes over the globals [13:53:03] btw if it's a problem of speed a larger shmlog will just delay it, how could it fix it? [13:53:14] yes exactly [13:53:40] it can improve [14:00:15] elukey: could this be moved to python3? [14:00:57] so https://github.com/xcir/python-varnishapi supports it (that is the lib) [14:01:10] I was just checking that [14:01:16] not sure about the statsd part [14:01:18] checking [14:01:42] yeah statsd is the only ? [14:02:28] because in python3 they rewrote the collections.Counter from python to C and seems that is quicker than doing it with dictionaries and you'll save the deepcopy() and the initialization [14:04:23] it is worth a try [14:04:28] Python is not exactly the quickest for those kind of things, but if we see that the slowness is in the count_vsl_entry_callback() function aside from the statsd publishing, it might be rewritten in CPython or using Cython [14:04:51] or in Rust! [14:04:54] :) [14:05:29] yeah, if there is a lower level library for consuming the log then you can do it directly in that language ;) [14:06:09] do you have any results of the debugger I can look at? [14:06:48] statsd dep - Depends: python:any (<< 2.8), python:any (>= 2.7.5-5~) [14:07:24] I tried to just execute and Control+C [14:07:30] to get some data [14:08:16] elukey: https://pypi.python.org/pypi/statsd is py3 also [14:08:17] anyhow, if cProfile is good enough in your opinion we can get data and ping you in the task [14:11:10] cProfile is a good start, you could also refine adding direct call to it in the code, but for starting a single run make sense, you can also use a smaller log to be quicker [14:11:23] super [14:11:50] I had more trouble in the past to profile a tornado daemon, but that's another story :-P [14:12:27] :D [14:14:06] although unless something obviously wrong, count_vsl_entry_callback() should win the profile contest anyway and you might need a line profiler [14:17:34] elukey: let's see the first run and decide, a line profiler is much slower but might be useful, for some example see also https://zapier.com/engineering/profiling-python-boss/ [14:18:31] ahhaah nice [14:20:41] in the meanwhile I'll look if I can spot anything that could be speed up [14:22:46] do you have a task already? [14:22:56] https://phabricator.wikimedia.org/T151643 [14:23:12] thx [14:36:09] another thing we could look into is reducing the amount of VSL messages being logged [14:36:28] the default vsl_mask is -VCL_trace,-WorkThread,-Hash,-VfpAcct [14:36:44] of course :D I'm seeing some potential improvements, doing some timing locally [14:37:05] for instance, I don't remember ever having used VCL_acl [14:40:47] not yet! :P [14:43:55] when do you think to have the data elukey? [14:45:39] * volans bets that deepcopy() is somehow involved, it's performances for large objects are horrible, but with a small dict I'm unsure [14:45:46] I am not currently work on it since I have other analytics tasks to do atm, but might work on it tomorrow. Do you need cProfile data + possibly others? [14:47:38] note also that it's not only varnishreqstats suffering of perf issues, it's pretty much everything depending on varnishlog/varnishapi [14:47:40] elukey: if we have some indication from the cProfile will be great, falling back to gut otherwise :-P [14:48:05] ema: all of them python? [14:48:10] y [14:48:35] so might be the library is not super-optimized or just it cannot cope with out traffic :D [14:49:05] yeah I wouldn't spend too much time blaming the language to be fair [14:49:30] yeah it could well be the varnishlog.py/python-varnishapi [14:50:49] volans: ./modules/varnish/files/varnishapi.py ./modules/varnish/files/varnishlog.py [14:52:01] ok thanks [14:52:07] well thank you :) [15:00:55] 10netops, 10Analytics, 06Operations: Thorium (new stat1001) needs to communicate with the Analytics VLAN - https://phabricator.wikimedia.org/T151990#2835067 (10Ottomata) Uh OH! This is SUPPOSED to be in the analytics VLAN! https://phabricator.wikimedia.org/T149911 Re-opening that ticket. Sorry, I shoulda... [15:02:03] 10netops, 10Analytics, 06Operations: Thorium (new stat1001) needs to communicate with the Analytics VLAN - https://phabricator.wikimedia.org/T151990#2835074 (10Ottomata) [15:25:25] 10Traffic, 06Operations: several 502 Bad Gateway - https://phabricator.wikimedia.org/T151686#2835098 (10doctaxon) @ema * In a window of 2016-11-29 19:00 until 23:59 CET I couldn't receive any 502 Bad Gateway doing a lot of API queries using a permanent query loop. * I could receive the last 502 Bad Gatewa... [15:26:20] 10Traffic, 06Operations, 13Patch-For-Review, 05Prometheus-metrics-monitoring: Error collecting metrics from varnish_exporter on some misc hosts - https://phabricator.wikimedia.org/T150479#2787072 (10ema) The errors mentioned in the ticket description seem to be gone from cp4001. [15:29:55] 10Traffic, 06Operations: several 502 Bad Gateway - https://phabricator.wikimedia.org/T151686#2835114 (10ema) 05Open>03Resolved a:03ema @doctaxon great, thanks for confirming that the problem is solved. The default value for an internal varnish setting was not large enough and that was causing crashes wit... [15:45:24] so, I've checked the code a bit, actually the deepcopy stuff should not hurt, is done only at each interval, so 60 seconds by default. Moving to python3 should have also other performance improvements like decode() and others [15:45:38] ema: what was the reason behing the time.sleep(0.01) in varnishlog.py? [15:47:32] s/behing/behind/ [15:49:27] from my quick tests 24% of the time is spent there [15:51:04] IIRC that had something to do with avoiding burning up CPU spinning when there was no record immediately available. I'm not sure though. [15:51:58] we do it only if ret=0 I need to check the API doc/code what that means [15:52:31] what brandon said [15:53:00] looks like we're reinventing the varnishkafka wheel again but in python btw [15:53:08] lol [15:53:16] I mean with all the python stats daemons [15:53:40] the (solved) problem we're trying to solve is reading from the vsm fast and doing something with what we've read [15:54:07] are we pretty sure the problem lies in python not being able to keep up with the rate? [15:54:34] as opposed to, some deeper problem where VSL transaction have long periods between their initial record and completion or something, due to the 2-minute timeout requests we've seen, etc? [15:54:45] (causing us to need to buffer 2+ minutes of TXNs to make this work, which is insane) [15:54:54] varnishncsa/log with the same queries doesn't seem to cause overruns IIRC [15:54:58] ok [15:56:00] we could go back to the varnishncsa wrapper stuff [15:56:06] we talked about eventually ending up there anyways [15:56:06] or https://code.uplex.de/uplex-varnish/varnishevent [15:59:56] also vk seems not suffering from the overrun problem [16:00:30] to be clear, it's not that I think python is necessarily too slow for this, what I'm saying is that the problem of reading vsl fast is solved already [16:01:12] yep yep [16:01:52] same opinion, I didn't want to blame python but only suggest to check that option (so perf related, rather than vsl-size related) [16:03:31] in particular varnishevent seems better than varnishncsa for our stats programs because of: [16:03:36] - varnishncsa only logs client transactions, while varnishevent may also log backend or raw transactions. [16:03:42] - varnishevent is designed to keep pace reading the Varnish shared memory log while varnishd is writing to it rapidly under heavy load, and output channels are slow. [16:03:50] (from https://code.uplex.de/uplex-varnish/varnishevent/blob/master/doc/varnishevent.rst) [16:04:47] even though varnishncsa was doing pretty well with our load while we were evaluating it as a possible replacement for vk [16:11:24] also we need to verify that varnishevent can support our load [16:29:37] it might be also nice to write varnishreqstats in Go and test https://github.com/phenomenes/vago [16:29:41] or similar [16:30:15] the script is short enough that a prototype could be quick to write [16:44:31] ok the time.sleep() was a red herring, is relevant only on pinkunicorn, so is working as expected. On a real host 63% of the time is inside varnishapi.py:879(_callBack) that has 2 nested while 1:, I'll try to get the line profiler [16:54:49] https://www.imperialviolet.org/2016/11/28/cecpq1.html [16:55:25] ^ apparently the cecpq1 experiment was pretty successful, and it's nice to see Google taking the approach of withdrawing it completely post-experiment instead of ramming it through as a pseudo-standard [16:56:45] 10Traffic, 10Analytics, 10Analytics-Cluster, 06Operations: Enable Kafka native TLS in 0.9 and secure the kafka traffic with it - https://phabricator.wikimedia.org/T121561#2835376 (10Ottomata) [18:07:56] <_joe_> I <3 agl [18:11:51] 10Traffic, 06Operations, 13Patch-For-Review, 05Prometheus-metrics-monitoring: Error collecting metrics from varnish_exporter on some misc hosts - https://phabricator.wikimedia.org/T150479#2835710 (10fgiunchedi) @ema indeed the new upstream version should fix the errors. What's left to figure out is what to... [18:14:13] ema, elukey, so I've done some line profiling on varnishapi.py and most of the time is spent calling functions in the .so (as expected) [18:16:28] for the 2 nested while 1; for 100k calls of varnishlog.py:execute() the first line of the inner while is called 3M times [18:18:59] let me know your intentions, if you want to try the road py3 + other optimizations to skim some time and see if it's enough to cope with our traffic [18:19:31] or go directly to something compiled in a different language [18:22:20] thanks a lot volans :) [18:22:30] I'll add a summary to the task too [18:22:31] let's re-sync tomorrow [18:22:33] super thanks [18:22:33] sure [19:29:09] 10netops, 06Operations: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#2835986 (10faidon) JTAC thinks this may be [[ https://prsearch.juniper.net/InfoCenter/index?page=prcontent&id=PR957108 | PR957108 ]]: ``` Title: IPv6 neighbor... [19:39:57] 10Traffic, 06Operations, 13Patch-For-Review: python-varnishapi daemons seeing "Log overrun" constantly - https://phabricator.wikimedia.org/T151643#2823504 (10Volans) Quick summary from my tests of today: - most of the time is spent calling `_callBack()` in varnishapi.py:879` - regarding the 2 `while 1:`... [20:50:19] 07HTTPS, 10Traffic, 06Operations, 06WMF-Communications, 07Security-Other: Server certificate is classified as invalid on government computers - https://phabricator.wikimedia.org/T128182#2836315 (10Florian) [21:24:43] 10Domains, 10Traffic, 06Operations, 06WMF-Legal: Register nlwikipedia.org to prevent squatting - https://phabricator.wikimedia.org/T128968#2836472 (10CRoslof) 05Invalid>03Resolved Update: The Wikimedia Foundation has acquired nlwikipedia.org