[01:30:08] <wikibugs>	 7HTTPS, 10Traffic, 10Wikimedia-Shop: Canonical URL in Store points to HTTP address, should be HTTPS - https://phabricator.wikimedia.org/T131131#2156832 (10Volker_E)
[02:50:56] <wikibugs>	 10Traffic, 6Operations, 6Performance-Team, 13Patch-For-Review: Update CP cookie VCL once HTTP/2 support lands - https://phabricator.wikimedia.org/T118892#2156942 (10BBlack) 5Open>3Resolved a:3BBlack
[02:51:00] <wikibugs>	 10Traffic, 6Operations, 6Performance-Team, 13Patch-For-Review: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2156944 (10BBlack)
[03:29:13] <wikibugs>	 10Traffic, 6Operations, 10ops-eqiad: investigate radon crash - https://phabricator.wikimedia.org/T131053#2157006 (10BBlack) Still up, turning traffic back on for now...
[10:55:23] <elukey>	 update for varnish-kafka: while testing the last change for the cache removal in a Vagrant image with Jessie I found a that RES, %MEM were increasing in top (for the vk process)
[10:55:50] <elukey>	 so I thought to have a leak, and started a valgrind run to figure out where it was
[10:57:51] <elukey>	 valgrind didn't come back with straight leaks, but only 9K of memory that is 'probably' a leak. Turned out to be one time allocations that were also present in vk3. After a chat with ema I created another VM, with wikimedia's repos and pinning for varnish 3. The leak is still there.
[10:58:36] <elukey>	 this is probably something related to VirtualBox or possibly I am missing something big and stupid.
[10:58:52] <elukey>	 but cp1052 doesn't show a big memory consumption for vk
[11:06:34] <elukey>	 valgrind extended out: https://dpaste.de/p2f1
[11:07:51] <elukey>	 there is only one single "blocks are possibly lost" entry but AFAIK it is code that I haven't touched
[11:30:11] <elukey>	 and finally, testing mediawiki-vagrant leads to a much smaller memory pressure
[11:30:38] <elukey>	 (but I can see a steady increase)
[11:32:44] <elukey>	 (that ends up in stable position, ~0.8/1% MEM used..)
[11:33:04] <elukey>	 all right, my testing environments are probably busted, waiting for suggestions :)
[12:53:48] <bblack>	 good morning :)
[12:54:05] <bblack>	 taking a peek at your valgrind output, sometimes I can track these things down looking at the source, on a good day :)
[13:13:01] <elukey>	 bblack: gooood morning :)
[13:13:24] <elukey>	 https://gerrit.wikimedia.org/r/#/c/276439 has been updated with the new commit but I am not sure if I did the right thing
[13:13:32] <elukey>	 the topic contains both changes though
[13:14:10] <elukey>	 I think the VirtualBox's hypervisor is somehow tricking me
[13:17:31] <bblack>	 I donno
[13:17:43] <bblack>	 I think the vk code is pretty bad at cleanly managing memory too :)
[13:20:13] <bblack>	 I'm starting from 276439, I think I may make some commits on top that clean up the code and make it easier to understand
[13:20:21] <bblack>	 we'll see how that plays out
[13:22:52] <elukey>	 sure, let me know if I can help
[13:26:28] <bblack>	 what are you guys using for a basic build environment for vk?
[13:26:38] <bblack>	 I used to have a labs vm for it with all the pre-reqs installed
[13:28:05] <ema>	 bblack: as a build environment I'm using git-pbuilder on my workstation or copper
[13:28:11] <bblack>	 ok
[13:28:44] <ema>	 APT_USE_BUILT=yes GIT_PBUILDER_OUTPUT_DIR=/var/cache/pbuilder/result/jessie-amd64 ARCH=amd64 DIST=jessie WIKIMEDIA=yes git-buildpackage -j8 -us -uc -sa --git-builder=git-pbuilder
[13:28:52] <bblack>	 yeah I guess I meant more playground.  I use git-pbuilder on copper to build packages, but my workstation isn't right for this at all
[13:29:05] <ema>	 oh I see
[13:29:13] <bblack>	 like, what's the easy path to get a shell somewhere, where I can manually build and patch and try compiler flags and valgrind and blah blah
[13:29:33] <ema>	 I have a self-hosted puppetmaster for that
[13:29:37] <bblack>	 I can figure out something, but it's interesting to see if you guys have found easier ways since you've just gotten set up here :)
[13:29:46] <bblack>	 in labs?
[13:29:53] <bblack>	 or as a VM on your own box?
[13:29:57] <ema>	 in labs
[13:30:03] <bblack>	 ok
[13:30:07] <ema>	 I can give you access to it if you like
[13:30:29] <bblack>	 nah that's ok I'll just set another one up, so we don't mess up setups
[13:30:33] <elukey>	 I set up VMs on my laptop.. mediawiki-vagrant was handy to check kafka
[13:30:47] <bblack>	 I've had a few over time, but I tend to delete them when I'm not using them much
[13:32:27] <bblack>	 oh wait, I still have my last one :)
[13:32:34] <bblack>	 I guess it never got deleted
[13:32:44] <ema>	 good!
[13:38:36] * elukey is reading about git-pbuilder 
[13:39:20] <elukey>	 and copper too.. site.pp has always an unexplored corner
[13:40:10] <ema>	 elukey: https://phab.wmfusercontent.org/file/data/lc3hsm76j6fvudrytno7/PHID-FILE-zq2be526yd3rxbnp74vs/vrgxsvad4tcctg2f/README.md
[13:43:25] <bblack>	 yeah this code needs some cleanup, probably errors and leaks will become apparent during that...
[13:43:56] <bblack>	 I built it on my labs VM with my usual pedantic gcc warnings flags, that I usually aim to get mostly-clean, and there's 260 warnings :P
[13:45:33] <elukey>	 bblack: what flags do you use? I used -Werror -Wall 
[13:46:36] <bblack>	 well I copied them from gdnsd's configure.ac, but basically:
[13:46:40] <bblack>	 -Wall -Wextra -Wbad-function-cast -Wcast-align -Wcast-qual -Wendif-labels -Wfloat-equal -Wfloat-conversion -Wformat=2 -Winit-self -Wlogical-op -Wmissing-declarations -Wmissing-include-dirs -Wmissing-prototypes -Wold-style-definition -Wpointer-arith -Wredundant-decls -Wshadow -Wsign-conversion -Wstrict-overflow=5 -Wstrict-prototypes -Wswitch-default -Wundef -Wunused -Wwrite-strings
[13:46:56] * elukey hides
[13:47:16] <bblack>	 and I haven't updated those in a while, that's just from the last time I ran through the compiler documentation and tested out which ones are useful vs annoying
[13:47:56] <bblack>	 technically you can violate them all and your code can run fine, but I tend to think code that doesn't trip them up is cleaner and easier to reason about :)
[13:48:22] <bblack>	 if I can get those close to clean, next step is static analysis...
[13:50:10] <moritzm>	 and once you've squashed these all, gcc 6 comes along :-) https://gnu.wildebeest.org/blog/mjw/2016/02/15/looking-forward-to-gcc6-many-new-warnings/
[13:50:43] <bblack>	 :)
[14:02:05] <ema>	 in other news it is now possible to upgrade nodes to v4 (and downgrade back to v3) without anything major exploding 
[14:02:31] <ema>	 1) toggle varnish_version4 2) service varnish-frontend stop ; service varnish stop ; apt-get -y remove libvarnishapi1 ; puppet agent -tv
[14:10:07] <bblack>	 nice!
[14:10:09] <elukey>	 \0/
[14:11:54] <godog>	 \o/ nice job!
[14:39:42] <bblack>	 meh the cleanliness problems run deep.  I'm not sure I want to get into making such dramatic changes
[14:40:06] <bblack>	 I think I'll keep going with local cleanup commits, but only to eventually reach the point where I find a real problem, then rewind and backport that to the existing code :P
[14:40:45] <bblack>	 there's a lot of slop with integer data types that makes it confusing. mixing up appropriate uses of int vs ssize_t vs size_t
[14:40:57] <bblack>	 but fixing them all could break some other existing fragile code, too
[14:41:48] <bblack>	 yet another reason to eventually replace this with python :)
[14:44:37] <elukey>	 yep we all agree :)
[14:44:56] <elukey>	 we could also explore something in Go
[14:45:39] <elukey>	 but from what we saw python should be good enough with CTypes
[16:02:41] <ema>	 bblack: are we stopping varnish simply with 'service varnish stop' or do we also use the stop cli command?
[16:02:48] <ema>	 https://www.varnish-cache.org/trac/ticket/819
[16:03:25] <ema>	 while testing persistent storage I've confirmed that objects are not persisted unless you do varnishadm -S /etc/varnish/secret -T 127.0.0.1:6083 stop
[16:03:55] <ema>	 in my limited testing segments don't get full of course :)
[16:08:05] <ema>	 but yeah I guess we don't care about losing objects that are not persisted because their segment is not full yet
[16:14:55] <bblack>	 right
[16:15:01] <bblack>	 we don't use cli stop
[16:15:15] <bblack>	 but persistent storage has silos.  we expect to lose the latest silo (the one still open)
[16:15:33] <bblack>	 does cli stop actually save all of the latest silo too?
[16:15:43] <ema>	 it looks like, yeah
[16:17:01] <bblack>	 I wonder if that makes stop much slower in practice, though (enough to care)
[16:17:40] <bblack>	 we could test it at some point, make a task about improving stop for persistent
[16:18:13] <bblack>	 (also, usually varnishadm doesn't need the other args, just "varnishadm X" or "varnishadm -n frontend X"
[16:18:55] <bblack>	 )
[16:20:07] <ema>	 cli stop takes about 1s longer than service stop, but that's with an empty silo basically
[16:20:27] <ema>	 without much of everything really, perhaps 2 objects? :P
[16:22:49] <bblack>	 yeah
[16:23:02] <bblack>	 I just tested the effect via systemd on exit codes and all that on cp1008, it seems sane
[16:23:13] <bblack>	 just need to see how bad the timing is and maybe add a timeout parameter to systemd too
[16:23:40] <ema>	 alright
[16:23:43] <bblack>	 I'll depool a ulsfo upload node and see what happens there, they have pretty full/busy storage
[16:37:08] <wikibugs>	 10Traffic, 7Varnish, 6Operations: Improve varnish stop for backend instances - https://phabricator.wikimedia.org/T131163#2157924 (10ema)
[16:41:46] <bblack>	 ema: I think that bug is outdated, or our reading of what he writes there is incorrect, at least in v3-land
[16:42:06] <bblack>	 I think sending SIGINT to the master process or doing cli stop first both end up doing the same thing
[16:42:24] <bblack>	 sending SIGINT/TERM directly to the *child* process might make things unclean-er
[16:42:45] <ema>	 bblack: I've reproduced the following: 1) backend miss 2) backend hit 3) service varnish restart 4) backend miss
[16:42:56] <bblack>	 on v3 or v4?
[16:42:58] <ema>	 v4
[16:43:15] <elukey>	 https://news.ycombinator.com/item?id=11379985
[16:43:18] <ema>	 replacing 3) with varnishadm stop ; service varnish restart I get a hit in 4)
[16:43:39] <bblack>	 so in v3:
[16:43:42] <bblack>	 bin/varnishd/mgt_cli.c: { CLI_SERVER_STOP,      "", mcf_server_startstop, cli_proto },
[16:43:52] <bblack>	 ^ cli stop -> invoke mcf_server_startstop
[16:44:21] <bblack>	 and in that case (invoked from that cli callback), mcf_server_startstop does:
[16:44:24] <bblack>	                 mgt_stop_child();
[16:44:53] <bblack>	 elsewhere in the same file as mcf_server_startstop ( bin/varnishd/mgt_child.c ):
[16:45:04] <bblack>	 mgt_sigint does:
[16:45:04] <bblack>	         REPORT0(LOG_ERR, "Manager got SIGINT");
[16:45:05] <bblack>	         (void)fflush(stdout);
[16:45:05] <bblack>	         if (child_pid >= 0)
[16:45:05] <bblack>	                 mgt_stop_child();
[16:45:39] <bblack>	 I only went looking because the behaviors seemed odd in practice on v3 when testing
[16:47:20] <ema>	 oh so it should do the same thing
[16:47:24] <bblack>	 I'm not 100% sure, even though that sounds convincing
[16:48:16] <bblack>	 the first thing I ran into is that stop didn't seem to be acting synchronous
[16:48:27] <bblack>	 but it could just be fast too, I don't know
[16:48:45] <bblack>	 hmmmm
[16:49:53] <bblack>	 I don't that it's worth worrying about in any case
[16:50:18] <bblack>	 when we stop varnish backends in practice, it's either isolated, or it's carefully spaced out to minimize impact and/or the frontends are up taking the bulk load
[16:50:52] <bblack>	 in the ugly scenarios where persistence "saves" us, it's because all but the last silo are still ok after an unclean shut (e.g. powerfail), so execstop never even runs
[16:51:53] <bblack>	 what may be confusing me, is that in the default scenario systemd may be sending sigint to both the master and the child in rapid succession
[16:52:00] <bblack>	 so the master never gets to cleanly shutdown the child
[16:52:14] <bblack>	 or sigterm
[16:52:43] <ema>	 sounds likely yeah
[16:52:58] <ema>	 otherwise service stop should take longer than varnishadmin stop
[16:53:05] <bblack>	 right
[16:54:35] <bblack>	 I'll mess around on cp1008 a bit more during the meeting and see if I can make ExecStop + ExecStopTimeout or whatever actually wait a bit to do the signals after a varnishadm stop
[16:54:56] <bblack>	 or figure out whatevers going on with races/signals there
[17:55:35] <bblack>	 ema: can you try your reproduction, re: persistent+stop, with adding "KillMode=process" to the [Service] section of the unit file (and then don't forget systemctl daemon-reload)
[17:55:56] <bblack>	 I *think* that might make "service varnish restart" work and not drop your cache entry
[17:56:42] <bblack>	 it seems to DTRT in some limited testing on v3 anyways
[18:02:01] <ema>	 bblack: it does the right thing!
[18:04:08] <ema>	 I'm off for today, see you tomorrow for our first maps node upgrade to v4 :)
[18:16:34] <elukey>	 going off too, will read in here for updates on vk.. Tomorrow I have time to work on it if needed!
[18:22:37] <bblack>	 ok, see you :)
[19:17:59] <wikibugs>	 10Traffic, 10DNS, 10Fundraising-Backlog, 6Operations, 10fundraising-tech-ops: Updating DNS records for Major Gifts subdomain (benefactors.wikimedia.org) - https://phabricator.wikimedia.org/T130937#2158625 (10DStrine) a:3Jgreen
[19:51:43] <bblack>	 AFAICS, we've never used varnishkafka's "format.key" config, and thus never use FMT_CONF_KEY and all that entails for the code
[19:52:19] <bblack>	 quite a bit of looping/indirection/etc can be killed by removing that feature
[20:46:01] <wikibugs>	 10Traffic, 7Varnish, 6Operations: Improve varnish stop for backend instances - https://phabricator.wikimedia.org/T131163#2158870 (10BBlack) 5Open>3Resolved a:3BBlack https://gerrit.wikimedia.org/r/#/c/280268/
[21:57:25] <bblack>	 really the most suspicious part of vk for any kind of persistent runtime-loop memory leak is all the lp scratch/tmpbuf allocation stuff
[21:57:43] <bblack>	 I mean on the surface it looks "right", but it's overly-complex
[21:58:05] <bblack>	 (in some attempt to keep all of the allocation of 1x "lp" as a contiguous memory block as much as possible, I guess)
[21:58:30] <bblack>	 I'm thinking about maybe just simplifying all of that, it's probably premature optimization
[22:00:16] <bblack>	 lp->match is actually allocated as part of lp, even though it's a separate array in C terms.  lp is allocated to include its size on the end, and then lp->match is set to (lp + some offset off the end)
[22:00:40] <bblack>	 and then similarly lp->scratch is an array that runs off the end and is allocated with lp (before the lp->match memory)
[22:01:00] <bblack>	 and then separately there's lp->tmpbufs which is an array of overflow scratch that's malloc'd on demand and they all get freed after every line anyways
[22:01:34] <bblack>	 and then I'm pretty sure the string contents of lp->match end up being pointers into tmpbufs and/or scratch
[22:08:12] <bblack>	 when you think about it, how much data can there really be?
[22:08:34] <bblack>	 scratch_size is 4K by default (we don't override), tmpbufs are dynamically allocated to fit
[22:08:46] <bblack>	 we only have 1x "lp" anymore anyways since we're not caching multiple loglines
[22:08:55] <bblack>	 (for the life of the daemon)
[22:09:12] <bblack>	 we could just bump the default scratch size to something like 1MB and get rid of tmpbufs at least
[22:09:25] <bblack>	 who cares if a daemon takes 1MB of extra static memory from the get-go?
[22:09:45] <bblack>	 (and surely the formatted logs of one request, even a crazy one, can't go past that)
[22:45:38] <wikibugs>	 7HTTPS, 10Traffic, 6Operations: irc.wikimedia.org talks HTTP but not HTTPS - https://phabricator.wikimedia.org/T130981#2152741 (10Dzahn) That redirect only exists because it used to be an "It works!" Apache site in the past and i thought it was ugly so redirected it to that meta page a long time ago. So it's...
[23:19:05] <wikibugs>	 10Traffic, 10domains, 6Operations: Register nlwikipedia.org to prevent squatting - https://phabricator.wikimedia.org/T128968#2159456 (10Dzahn) We could do this, but minus the redirect. That would mean nlwikipedia.org would simply be not found, like for example http://www.wikipedia.es/  but it would still be...