[01:30:08] 7HTTPS, 10Traffic, 10Wikimedia-Shop: Canonical URL in Store points to HTTP address, should be HTTPS - https://phabricator.wikimedia.org/T131131#2156832 (10Volker_E) [02:50:56] 10Traffic, 6Operations, 6Performance-Team, 13Patch-For-Review: Update CP cookie VCL once HTTP/2 support lands - https://phabricator.wikimedia.org/T118892#2156942 (10BBlack) 5Open>3Resolved a:3BBlack [02:51:00] 10Traffic, 6Operations, 6Performance-Team, 13Patch-For-Review: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2156944 (10BBlack) [03:29:13] 10Traffic, 6Operations, 10ops-eqiad: investigate radon crash - https://phabricator.wikimedia.org/T131053#2157006 (10BBlack) Still up, turning traffic back on for now... [10:55:23] update for varnish-kafka: while testing the last change for the cache removal in a Vagrant image with Jessie I found a that RES, %MEM were increasing in top (for the vk process) [10:55:50] so I thought to have a leak, and started a valgrind run to figure out where it was [10:57:51] valgrind didn't come back with straight leaks, but only 9K of memory that is 'probably' a leak. Turned out to be one time allocations that were also present in vk3. After a chat with ema I created another VM, with wikimedia's repos and pinning for varnish 3. The leak is still there. [10:58:36] this is probably something related to VirtualBox or possibly I am missing something big and stupid. [10:58:52] but cp1052 doesn't show a big memory consumption for vk [11:06:34] valgrind extended out: https://dpaste.de/p2f1 [11:07:51] there is only one single "blocks are possibly lost" entry but AFAIK it is code that I haven't touched [11:30:11] and finally, testing mediawiki-vagrant leads to a much smaller memory pressure [11:30:38] (but I can see a steady increase) [11:32:44] (that ends up in stable position, ~0.8/1% MEM used..) [11:33:04] all right, my testing environments are probably busted, waiting for suggestions :) [12:53:48] good morning :) [12:54:05] taking a peek at your valgrind output, sometimes I can track these things down looking at the source, on a good day :) [13:13:01] bblack: gooood morning :) [13:13:24] https://gerrit.wikimedia.org/r/#/c/276439 has been updated with the new commit but I am not sure if I did the right thing [13:13:32] the topic contains both changes though [13:14:10] I think the VirtualBox's hypervisor is somehow tricking me [13:17:31] I donno [13:17:43] I think the vk code is pretty bad at cleanly managing memory too :) [13:20:13] I'm starting from 276439, I think I may make some commits on top that clean up the code and make it easier to understand [13:20:21] we'll see how that plays out [13:22:52] sure, let me know if I can help [13:26:28] what are you guys using for a basic build environment for vk? [13:26:38] I used to have a labs vm for it with all the pre-reqs installed [13:28:05] bblack: as a build environment I'm using git-pbuilder on my workstation or copper [13:28:11] ok [13:28:44] APT_USE_BUILT=yes GIT_PBUILDER_OUTPUT_DIR=/var/cache/pbuilder/result/jessie-amd64 ARCH=amd64 DIST=jessie WIKIMEDIA=yes git-buildpackage -j8 -us -uc -sa --git-builder=git-pbuilder [13:28:52] yeah I guess I meant more playground. I use git-pbuilder on copper to build packages, but my workstation isn't right for this at all [13:29:05] oh I see [13:29:13] like, what's the easy path to get a shell somewhere, where I can manually build and patch and try compiler flags and valgrind and blah blah [13:29:33] I have a self-hosted puppetmaster for that [13:29:37] I can figure out something, but it's interesting to see if you guys have found easier ways since you've just gotten set up here :) [13:29:46] in labs? [13:29:53] or as a VM on your own box? [13:29:57] in labs [13:30:03] ok [13:30:07] I can give you access to it if you like [13:30:29] nah that's ok I'll just set another one up, so we don't mess up setups [13:30:33] I set up VMs on my laptop.. mediawiki-vagrant was handy to check kafka [13:30:47] I've had a few over time, but I tend to delete them when I'm not using them much [13:32:27] oh wait, I still have my last one :) [13:32:34] I guess it never got deleted [13:32:44] good! [13:38:36] * elukey is reading about git-pbuilder [13:39:20] and copper too.. site.pp has always an unexplored corner [13:40:10] elukey: https://phab.wmfusercontent.org/file/data/lc3hsm76j6fvudrytno7/PHID-FILE-zq2be526yd3rxbnp74vs/vrgxsvad4tcctg2f/README.md [13:43:25] yeah this code needs some cleanup, probably errors and leaks will become apparent during that... [13:43:56] I built it on my labs VM with my usual pedantic gcc warnings flags, that I usually aim to get mostly-clean, and there's 260 warnings :P [13:45:33] bblack: what flags do you use? I used -Werror -Wall [13:46:36] well I copied them from gdnsd's configure.ac, but basically: [13:46:40] -Wall -Wextra -Wbad-function-cast -Wcast-align -Wcast-qual -Wendif-labels -Wfloat-equal -Wfloat-conversion -Wformat=2 -Winit-self -Wlogical-op -Wmissing-declarations -Wmissing-include-dirs -Wmissing-prototypes -Wold-style-definition -Wpointer-arith -Wredundant-decls -Wshadow -Wsign-conversion -Wstrict-overflow=5 -Wstrict-prototypes -Wswitch-default -Wundef -Wunused -Wwrite-strings [13:46:56] * elukey hides [13:47:16] and I haven't updated those in a while, that's just from the last time I ran through the compiler documentation and tested out which ones are useful vs annoying [13:47:56] technically you can violate them all and your code can run fine, but I tend to think code that doesn't trip them up is cleaner and easier to reason about :) [13:48:22] if I can get those close to clean, next step is static analysis... [13:50:10] and once you've squashed these all, gcc 6 comes along :-) https://gnu.wildebeest.org/blog/mjw/2016/02/15/looking-forward-to-gcc6-many-new-warnings/ [13:50:43] :) [14:02:05] in other news it is now possible to upgrade nodes to v4 (and downgrade back to v3) without anything major exploding [14:02:31] 1) toggle varnish_version4 2) service varnish-frontend stop ; service varnish stop ; apt-get -y remove libvarnishapi1 ; puppet agent -tv [14:10:07] nice! [14:10:09] \0/ [14:11:54] \o/ nice job! [14:39:42] meh the cleanliness problems run deep. I'm not sure I want to get into making such dramatic changes [14:40:06] I think I'll keep going with local cleanup commits, but only to eventually reach the point where I find a real problem, then rewind and backport that to the existing code :P [14:40:45] there's a lot of slop with integer data types that makes it confusing. mixing up appropriate uses of int vs ssize_t vs size_t [14:40:57] but fixing them all could break some other existing fragile code, too [14:41:48] yet another reason to eventually replace this with python :) [14:44:37] yep we all agree :) [14:44:56] we could also explore something in Go [14:45:39] but from what we saw python should be good enough with CTypes [16:02:41] bblack: are we stopping varnish simply with 'service varnish stop' or do we also use the stop cli command? [16:02:48] https://www.varnish-cache.org/trac/ticket/819 [16:03:25] while testing persistent storage I've confirmed that objects are not persisted unless you do varnishadm -S /etc/varnish/secret -T 127.0.0.1:6083 stop [16:03:55] in my limited testing segments don't get full of course :) [16:08:05] but yeah I guess we don't care about losing objects that are not persisted because their segment is not full yet [16:14:55] right [16:15:01] we don't use cli stop [16:15:15] but persistent storage has silos. we expect to lose the latest silo (the one still open) [16:15:33] does cli stop actually save all of the latest silo too? [16:15:43] it looks like, yeah [16:17:01] I wonder if that makes stop much slower in practice, though (enough to care) [16:17:40] we could test it at some point, make a task about improving stop for persistent [16:18:13] (also, usually varnishadm doesn't need the other args, just "varnishadm X" or "varnishadm -n frontend X" [16:18:55] ) [16:20:07] cli stop takes about 1s longer than service stop, but that's with an empty silo basically [16:20:27] without much of everything really, perhaps 2 objects? :P [16:22:49] yeah [16:23:02] I just tested the effect via systemd on exit codes and all that on cp1008, it seems sane [16:23:13] just need to see how bad the timing is and maybe add a timeout parameter to systemd too [16:23:40] alright [16:23:43] I'll depool a ulsfo upload node and see what happens there, they have pretty full/busy storage [16:37:08] 10Traffic, 7Varnish, 6Operations: Improve varnish stop for backend instances - https://phabricator.wikimedia.org/T131163#2157924 (10ema) [16:41:46] ema: I think that bug is outdated, or our reading of what he writes there is incorrect, at least in v3-land [16:42:06] I think sending SIGINT to the master process or doing cli stop first both end up doing the same thing [16:42:24] sending SIGINT/TERM directly to the *child* process might make things unclean-er [16:42:45] bblack: I've reproduced the following: 1) backend miss 2) backend hit 3) service varnish restart 4) backend miss [16:42:56] on v3 or v4? [16:42:58] v4 [16:43:15] https://news.ycombinator.com/item?id=11379985 [16:43:18] replacing 3) with varnishadm stop ; service varnish restart I get a hit in 4) [16:43:39] so in v3: [16:43:42] bin/varnishd/mgt_cli.c: { CLI_SERVER_STOP, "", mcf_server_startstop, cli_proto }, [16:43:52] ^ cli stop -> invoke mcf_server_startstop [16:44:21] and in that case (invoked from that cli callback), mcf_server_startstop does: [16:44:24] mgt_stop_child(); [16:44:53] elsewhere in the same file as mcf_server_startstop ( bin/varnishd/mgt_child.c ): [16:45:04] mgt_sigint does: [16:45:04] REPORT0(LOG_ERR, "Manager got SIGINT"); [16:45:05] (void)fflush(stdout); [16:45:05] if (child_pid >= 0) [16:45:05] mgt_stop_child(); [16:45:39] I only went looking because the behaviors seemed odd in practice on v3 when testing [16:47:20] oh so it should do the same thing [16:47:24] I'm not 100% sure, even though that sounds convincing [16:48:16] the first thing I ran into is that stop didn't seem to be acting synchronous [16:48:27] but it could just be fast too, I don't know [16:48:45] hmmmm [16:49:53] I don't that it's worth worrying about in any case [16:50:18] when we stop varnish backends in practice, it's either isolated, or it's carefully spaced out to minimize impact and/or the frontends are up taking the bulk load [16:50:52] in the ugly scenarios where persistence "saves" us, it's because all but the last silo are still ok after an unclean shut (e.g. powerfail), so execstop never even runs [16:51:53] what may be confusing me, is that in the default scenario systemd may be sending sigint to both the master and the child in rapid succession [16:52:00] so the master never gets to cleanly shutdown the child [16:52:14] or sigterm [16:52:43] sounds likely yeah [16:52:58] otherwise service stop should take longer than varnishadmin stop [16:53:05] right [16:54:35] I'll mess around on cp1008 a bit more during the meeting and see if I can make ExecStop + ExecStopTimeout or whatever actually wait a bit to do the signals after a varnishadm stop [16:54:56] or figure out whatevers going on with races/signals there [17:55:35] ema: can you try your reproduction, re: persistent+stop, with adding "KillMode=process" to the [Service] section of the unit file (and then don't forget systemctl daemon-reload) [17:55:56] I *think* that might make "service varnish restart" work and not drop your cache entry [17:56:42] it seems to DTRT in some limited testing on v3 anyways [18:02:01] bblack: it does the right thing! [18:04:08] I'm off for today, see you tomorrow for our first maps node upgrade to v4 :) [18:16:34] going off too, will read in here for updates on vk.. Tomorrow I have time to work on it if needed! [18:22:37] ok, see you :) [19:17:59] 10Traffic, 10DNS, 10Fundraising-Backlog, 6Operations, 10fundraising-tech-ops: Updating DNS records for Major Gifts subdomain (benefactors.wikimedia.org) - https://phabricator.wikimedia.org/T130937#2158625 (10DStrine) a:3Jgreen [19:51:43] AFAICS, we've never used varnishkafka's "format.key" config, and thus never use FMT_CONF_KEY and all that entails for the code [19:52:19] quite a bit of looping/indirection/etc can be killed by removing that feature [20:46:01] 10Traffic, 7Varnish, 6Operations: Improve varnish stop for backend instances - https://phabricator.wikimedia.org/T131163#2158870 (10BBlack) 5Open>3Resolved a:3BBlack https://gerrit.wikimedia.org/r/#/c/280268/ [21:57:25] really the most suspicious part of vk for any kind of persistent runtime-loop memory leak is all the lp scratch/tmpbuf allocation stuff [21:57:43] I mean on the surface it looks "right", but it's overly-complex [21:58:05] (in some attempt to keep all of the allocation of 1x "lp" as a contiguous memory block as much as possible, I guess) [21:58:30] I'm thinking about maybe just simplifying all of that, it's probably premature optimization [22:00:16] lp->match is actually allocated as part of lp, even though it's a separate array in C terms. lp is allocated to include its size on the end, and then lp->match is set to (lp + some offset off the end) [22:00:40] and then similarly lp->scratch is an array that runs off the end and is allocated with lp (before the lp->match memory) [22:01:00] and then separately there's lp->tmpbufs which is an array of overflow scratch that's malloc'd on demand and they all get freed after every line anyways [22:01:34] and then I'm pretty sure the string contents of lp->match end up being pointers into tmpbufs and/or scratch [22:08:12] when you think about it, how much data can there really be? [22:08:34] scratch_size is 4K by default (we don't override), tmpbufs are dynamically allocated to fit [22:08:46] we only have 1x "lp" anymore anyways since we're not caching multiple loglines [22:08:55] (for the life of the daemon) [22:09:12] we could just bump the default scratch size to something like 1MB and get rid of tmpbufs at least [22:09:25] who cares if a daemon takes 1MB of extra static memory from the get-go? [22:09:45] (and surely the formatted logs of one request, even a crazy one, can't go past that) [22:45:38] 7HTTPS, 10Traffic, 6Operations: irc.wikimedia.org talks HTTP but not HTTPS - https://phabricator.wikimedia.org/T130981#2152741 (10Dzahn) That redirect only exists because it used to be an "It works!" Apache site in the past and i thought it was ugly so redirected it to that meta page a long time ago. So it's... [23:19:05] 10Traffic, 10domains, 6Operations: Register nlwikipedia.org to prevent squatting - https://phabricator.wikimedia.org/T128968#2159456 (10Dzahn) We could do this, but minus the redirect. That would mean nlwikipedia.org would simply be not found, like for example http://www.wikipedia.es/ but it would still be...