[04:00:06] <mutante>	 bblack: gotcha, ok, will do that next time, thx
[07:06:51] <wikibugs__>	 10netops, 06Operations: Interface errors on cr2-eqiad:xe-4/3/1 - https://phabricator.wikimedia.org/T163542#3216758 (10ayounsi) p:05Triage>03Lowest
[07:33:44] <_joe_>	 any chance we can move that switchover dynamic for traffic to etcd sooner than later?
[07:49:29] <wikibugs__>	 10netops, 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3216866 (10ayounsi)
[08:30:10] <ema>	 bblack: 4.1.5-1wm3 now supports -p exp_thread_rt / exp_lck_inherit, false by default. I've added a hiera flag to enable the experiment too, cache::exp_thread_rt, and set it on cp2024
[08:36:53] <elukey>	 hello people, we received some alerts for hadoop data loss consistency checks (~1%) for upload-2017-4-26-12 and upload-2017-4-26-13
[08:37:39] <elukey>	 IIRC yesterday there were some issues with upload, just wondering if these alerts are due to that or if I need to worry about other things :D
[08:37:47] <ema>	 elukey: "some" :)
[08:38:02] <ema>	 what's the 12 and the 13 in there?
[08:38:08] <elukey>	 UTC hours
[08:38:21] <ema>	 yeah, that's yesterday's fire
[08:38:29] <elukey>	 ahhh okok
[08:38:33] <elukey>	 good :)
[08:38:40] <elukey>	 thanks ema!
[08:39:56] <ema>	 elukey: thank you, here is my ugly-but-useful dashboard. See 'Failed Fetches' https://grafana-admin.wikimedia.org/dashboard/db/varnish-swift-connections?orgId=1&from=now-24h&to=now
[11:22:50] <ema>	 looks like we created up to ~18k worker threads yesterday when the codfw upload issues happened 
[11:23:00] <ema>	 see the bottom of https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&from=1493208679934&to=1493235830432&var-server=cp2024&var-datasource=codfw%20prometheus%2Fops
[11:24:21] <ema>	 also, notice the difference between busy_sleep and busy_wakeup (requests sent to sleep without a worker thread because they found a busy object / requests taken of the busy object sleep list and and rescheduled)
[11:26:51] <ema>	 my current theory, discussed with godog earlier this morning: by giving RT priority to the expiry thread it could happen that workers don't get a chance to acquire locks, new workers are created to cope with incoming requests hence backend connections pile up (but the number backend requests doesn't increase) 
[11:27:07] <ema>	 *number of
[11:41:35] <ema>	 also interesting: n_waitinglist
[11:45:05] <ema>	 again according to the theory: a certain worker needs to nuke, tries to access expiry mailbox but doesn't manage to acquire the lock, other requests come in for that object and waiting list grows
[11:47:49] <ema>	 > When the lookup operation does not match any hash, it creates an object with a busy flag and inserts it in cache. Then, the request is sent to the vcl_miss subroutine. The busy flag is removed once the request is handled, and the object is updated with the response from the backend.
[11:47:54] <ema>	 http://book.varnish-software.com/4.0/chapters/VCL_Subroutines.html#vcl-vcl-hash
[11:50:30] <ema>	 > Subsequent similar requests that hit busy flagged objects are sent into a waiting list. 
[12:03:02] <bblack>	 yeah that's too bad
[12:03:56] <bblack>	 my working theory for the mutex acquisition queue was that since the expiry thread only operates on one item per lock (and then does CPU work outside the lock), that the worst case under pressure would be an alternation of expthread / random_workthread / expthread / random_workthread / ...
[12:04:39] <bblack>	 but perhaps given its RT scheduling priority, it manages to schedule itself multiple times in a row, too
[12:05:08] <bblack>	 (it would be nice if it could be given lock priority without giving it such heavy CPU priority at the same time)
[12:05:51] <bblack>	 (but of course it would be nicer if people just didn't write software with this ancient invalidated mindest of thousands of threads contending on complex mutex schemes)
[12:12:00] <bblack>	 well, even if the locking worked ideally like that (alternation), it could still be the case that the expthread takes too long per iteration and slows down the queue of pending workthread lockers too much
[12:13:11] <bblack>	 which means it can't really keep up with things in general.  So the RT scheme just squeezes the problem into a different shape: flatten the mailbox lag and workthread queueing pops up.
[12:14:54] <bblack>	 if you can assume consistent decisions about object:bin mapping, you could split up the mailbox and expiry binheap per-storage-file (like the LRU already is)
[12:15:03] <bblack>	 and have 1x expiry thread per storage file
[12:15:27] <bblack>	 but probably varnish in general can't assume that.  really I'm not sure we can either, but it depends on your definition of object identity
[12:34:37] <wikibugs>	 10Traffic, 06Operations, 13Patch-For-Review: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661#3217281 (10BBlack) Some general updates on bin-sizing estimates: Based on the graph data for available bytes in each bin and comparing how fast they initially re...
[12:47:21] <ema>	 bblack: +1 to storage wipe. Also, I think I've read once that using dd to allocate the files reduces FS fragmentation and is in general recommended
[12:49:10] <bblack>	 yeah back when I first got here, when a lot of this was *way* more manual and tricky than it is today, I often did a dd-recreation anytime we reinstalled a node or wiped the storage FS
[12:49:26] <bblack>	 but fallocate() is supposed to fix that problem more-automagically
[12:49:46] <bblack>	 (we used to patch in fallocate() for that reason, but now upstream does it)
[12:50:11] <ema>	 http://www.varnish-cache.org/docs/trunk/users-guide/storage-backends.html#file
[12:50:19] <ema>	 > Note that if varnishd has to create or expand the file, it will not pre-allocate the added space, leading to fragmentation, which may adversely impact performance on rotating hard drives. Pre-creating the storage file using dd(1) will reduce fragmentation to a minimum.
[12:51:14] <ema>	 so maybe they've forgotten to update the docs after introducing fallocate?
[12:56:33] <bblack>	 well, more like their fallocate isn't general-purpose
[12:56:43] <bblack>	 the code seems to do it iff linux && ext4
[12:57:20] <bblack>	 it's interesting they ran into the same XFS issue we did with our old patch and the comment is similar I think heh
[12:59:39] <bblack>	 our old one for persistent: https://github.com/wikimedia/operations-debs-varnish/blob/3.0.6-plus-wm/debian/patches/0010-varnishd-persistent-fallocate.patch
[12:59:53] <bblack>	 our old one for file: https://github.com/wikimedia/operations-debs-varnish/blob/3.0.6-plus-wm/debian/patches/0010-varnishd-file-fallocate.patch
[13:00:14] <ema>	 we still ship the patch for persistent apparently debian/patches/0002-varnishd-persistent-fallocate.patch
[13:00:42] <bblack>	 their new one for file (but not persist): https://github.com/wikimedia/operations-debs-varnish4/blob/master/lib/libvarnish/vfil.c#L193
[13:01:29] <ema>	 EXT4_SUPER_MAGIC is a cool constant name
[13:01:50] <bblack>	 lol
[13:02:12] <bblack>	 well SUPER in this context means superblock, but yeah it also sounds like "super magical!" :)
[13:02:21] <ema>	 exactly :)
[13:02:56] <bblack>	 luckily, we happen to use ext4 :)
[13:05:06] <bblack>	 of course, technically neither dd nor fallocate gaurantees the allocated space is contiguous I guess
[13:05:28] <bblack>	 but the theory is if you start with an empty fs and fallocate or dd the files sequentially it should come out that way
[13:09:54] <bblack>	 I wonder if using posix_fadvise (or posix_madvise on the mmap, or just madvise) to indicate random access would lower our i/o load a bit
[13:10:33] <bblack>	 the default uses a "normal" readahead strategy, if you advise sequential it reads ahead more aggressively, and if you advise random it disables readahead
[13:11:49] <bblack>	 oh varnish4 has an madvise() call in the file storage layer, but it's unclear at a glace at what scope it's used
[13:12:41] <bblack>	 ah whole-file, unless varnish has to split it
[13:13:35] <bblack>	 that's interesting, varnish splitting...
[13:14:17] <bblack>	 heh
[13:14:42] <bblack>	 so varnish file storage, it splits up a storage file into what it considers reasonable "chunks" that are mmap'd separately
[13:15:20] <bblack>	 if your total file size is small enough it's all one chunk, but if it's "too big" it cuts your storage size in half and tries to allocate 2x half-sized chunks (and recurses until it can get small-enough chunks)
[13:15:47] <bblack>	 when is it "too big", invoking the splitting behavior?
[13:16:10] <bblack>	 1) if mmap() fails with MAP_FAILED, it will try splitting and retrying the mmap of the splits (ok)
[13:16:27] <bblack>	 2) if the requested size is >= SSIZE_MAX, it will split
[13:16:57] <bblack>	 3) it tracks sizes that failed before due to whatever above, and anything > "fail size" will auto-split (I guess attempt to detect that the first time and not re-probe)
[13:17:10] <bblack>	 the "fail size" initializes to 1 << 30
[13:18:33] <bblack>	 so on x86_64, SSIZE_MAX is the expected 64-bit-ish value
[13:18:39] <bblack>	 but 1 << 30 becomes the limiter
[13:19:05] <ema>	 pmap -x $varnish-be-child-pid seems interesting
[13:19:24] <bblack>	 so it's splitting up our files into 1GB chunks by default
[13:19:51] <bblack>	 now I wonder what the negative impact of over-chunking huge storage files is....
[13:21:33] <bblack>	         off_t fail = 1 << 30;   /* XXX: where is OFF_T_MAX ? */
[13:22:02] <bblack>	 if there was an OFF_T_MAX, of course it would be like 2^63 for us, not 2^30 :P
[13:25:03] <ema>	 oh, does it assume we're on a 32 bit arch?
[13:26:01] <bblack>	 basically
[13:26:20] <bblack>	 so bin4 == 29GB, and pmap has 32 separate mappings for it
[13:26:34] <ema>	 so s/30/63/ and see what happens? :)
[13:26:47] <bblack>	 bin2 == 320GB, and pmap has 512 mappings
[13:27:03] <bblack>	 because it just halves the size until it fits, so that's why the per-chunk size is <1GB
[13:28:28] <bblack>	 I suspect (but the code is hard to follow) that these separate mappings stay separate and then space is allocated/freed/merged within each
[13:28:43] <bblack>	 meaning they increase our fragmentation and make it harder to allocate new space, etc...
[13:29:24] <bblack>	 I'm not sure if 63 or 62 is the right value, but even 62 is way more than we'll have on disk in any reasonable future
[13:29:40] <bblack>	 but now I'm hunting down why there isn't an OFF_T_MAX to do it properly, or what the standards intend there
[13:31:40] <ema>	 there's COMPAT_LOFF_T_MAX in arch/x86/include/asm/compat.h
[13:31:44] <ema>	 #define COMPAT_LOFF_T_MAX   0x7fffffffffffffffL
[13:32:14] <bblack>	 I kind of like this for a general portable-ish answer within C99: http://stackoverflow.com/a/33643558
[13:32:17] <ema>	 which is 2^63 - 1
[13:32:42] <bblack>	 I think for varnish we can shorten that down to just assuming either 32-bit or 64-bit though :)
[13:36:08] <bblack>	 and skip the assert too, good enough
[13:36:52] <bblack>	 const off_t OFF_T_MAX = sizeof(off_t) == sizeof(int32_t) ? INT32_MAX : INT64_MAX;
[13:39:58] <bblack>	 (and include stdint.h)
[13:40:27] <bblack>	 also since we're not using persistent, we could should just dump those patches
[13:40:34] <bblack>	 we can always resurrect them from git history if we need
[13:41:01] <ema>	 yep
[13:41:17] <ema>	 while we're at it, should we change vsm file perms for good?
[13:42:36] <bblack>	 yeah there's no other sane way than via patching right?
[13:42:40] <ema>	 nope
[13:42:57] <ema>	 (void)umask(027); -> (void)umask(077); in mgt_vcc.c I guess
[13:43:56] <bblack>	 ok
[13:44:26] <bblack>	 so kill 3x persist patches, add vsm perms + off_t patches
[13:44:38] <bblack>	 I donno about umask, but I guess that's how they're handling it....
[13:44:46] <bblack>	 umask is going to affect the whole process and all future calls
[13:45:21] <bblack>	 oh it's right before execl(), duh
[13:45:33] <ema>	 yeah, I'll dig a bit deeper and fix the perms too, the amount time spending chmodding ... :)
[13:45:37] <ema>	 s/spending/spent/
[13:46:00] <bblack>	 nah fixing the umask there is fine
[13:46:17] <bblack>	 execl() halts execution of varnish code in that process, basically
[13:48:21] <bblack>	 but the change above is backwards re: 027 -> 077
[13:48:49] <bblack>	 you want 022 instead
[13:49:02] <ema>	 right
[13:51:17] <bblack>	 I imagine that off_t / 1 << 30 thing really hurts on especially the higher-numbered bins
[13:52:36] <bblack>	 bin4's size range is 64MB -> 1GB objects, and its 29GB of storage is allocated in 3.6GB chunks  :P
[13:54:29] <bblack>	 maybe should just see the effect of fixing this before playing with storage size tweaks
[13:54:54] <bblack>	 it may dramatically alter at least the larger bins on how many total objects they fit
[13:55:35] <bblack>	 also my math seems to be wrong somewhere.  the code implies it will split storage until it gets chunks <= 1GB, but pmap is showing ones that are <= 4GB
[13:56:18] <bblack>	 oh duh, I assumed the pmap output was in pages, but it's in KB
[13:57:03] <bblack>	 so our 29GB storage file for 64MB -> 1GB objects is split into 922MB chunks heh
[13:57:23] <ema>	 nice!
[13:57:43] <bblack>	 our 216GB storage file for 16MB -> 64MB objects is split into a crapload of 864MB chunks
[13:58:25] <ema>	 double-doh, the umask thing had nothing to do with vsm file perms, it's AZ(fchmod(fd, 0640)); in mgt_jail_unix.c instead
[13:58:37] <ema>	 I'll make it 0644 and we can live happy 
[13:58:51] <bblack>	 ok :)
[14:04:12] <bblack>	 I should poke around storage_file some more and make sure there aren't other 32-bit assumptions that will break when this one issue is fixed
[14:07:58] <bblack>	 nope, everything looks right
[14:08:25] <bblack>	 just that one XXX comment about OFF_T_MAX and the assumption of 32-bit for lack of it
[14:15:08] <bblack>	 https://github.com/wikimedia/operations-debs-varnish4/blob/master/bin/varnishd/storage/stevedore_utils.c#L198
[14:15:29] <bblack>	 ^ apparently they already have code elsewhere to detect small off_t, but then don't re-use the result here heh
[14:15:56] <bblack>	 well not that it quite has the right result, but still.  you'd think they'd have handled this case similarly
[14:16:15] <ema>	 2nd wm release for today :) https://gerrit.wikimedia.org/r/#/c/350561/
[14:20:00] <bblack>	 crossing fingers that this profoundly improves our mailbox lag situation
[14:20:25] <bblack>	 it should considerably reduce fragmentation and make is easier to allocate space with fewer ejected objects, anyways
[14:21:44] <bblack>	 should really improve the whole nuke-searching thing too, where it repeatedly kills objects until it can find enough contiguous free space
[14:22:08] <ema>	 building
[14:22:20] <paravoid>	 I'm super amazed by the work you guys have been doing
[14:22:55] <paravoid>	 you should probably mail the varnish people or even blog about it or something when you're done
[14:27:20] <bblack>	 thanks!
[14:27:36] <bblack>	 ema: so yesterday afternoon/evening when I wanted to restart/downgrade all the codfw-upload nodes
[14:28:15] <bblack>	 I used:
[14:28:18] <bblack>	 sudo cumin -b 1 -s 1800 'R:class = role::cache::upload and cp*.codfw.wmnet' 'run-no-puppet varnish-backend-restart'
[14:28:37] <bblack>	 put that in a screen session and it restarts 1 every 30 minutes over 5 hours
[14:28:43] <bblack>	 nice and easy, and no real impact
[14:28:49] <ema>	 cool
[14:29:09] <volans>	 lol
[14:29:22] * volans discovers new usages of cumin
[14:32:10] <ema>	 meanwhile, cp1008 upgraded (but I forgot to run puppet agent afterwards, fixing)
[14:34:00] <ema>	 bblack: OK to upgrade upload-codfw to wm4 and then start the restarts?
[14:34:07] <bblack>	 yeah +1
[14:53:21] <ema>	 first varnish-be restarted was cp2022, looks good
[14:53:33] <ema>	 let's see the pmapping
[14:53:56] <ema>	 pmap -x 45622|grep bin | wc -l
[14:53:56] <ema>	 11
[14:53:59] <ema>	 ha!
[14:54:39] <bblack>	 yeah "bin" matches the CLI at the top too
[14:54:47] <bblack>	 so it's really 10, which is 2 lines per file, which is 1 mapping per file
[14:55:41] <bblack>	 cp2002:
[14:55:42] <bblack>	 root@cp2002:~# pmap -x 46829|grep bin|wc -l
[14:55:42] <bblack>	 2177
[14:55:47] <ema>	 yeah :)
[14:57:59] <bblack>	 wow cp2022 in varnish-machine-stats has a lot of variance from our past norms
[14:58:16] <bblack>	 I guess I should expand the timeframe a bit and compare to past restarts, but still...
[15:08:01] <ema>	 it usually takes ~5h to see the first allocator failures, let's see if that changes :)
[16:04:05] <ema>	 varnish 4.1.6 was released yesterday https://github.com/varnishcache/varnish-cache/blob/4.1/doc/changes.rst#varnish-cache-416-2017-04-26
[16:22:18] <bblack>	 https://github.com/varnishcache/varnish-cache/commit/14ce48044a680032adc51244f58dac03c391cea1
[16:22:24] <bblack>	 https://github.com/varnishcache/varnish-cache/commit/14ce48044a680032adc51244f58dac03c391cea1
[16:22:29] <bblack>	 https://github.com/varnishcache/varnish-cache/commit/adb8de06f305aa21003a7e65aab0083e2216d89f
[16:22:34] <bblack>	 https://github.com/varnishcache/varnish-cache/commit/5dd70aff3e968f173d3b747b1d1a2bb805be5f6f
[16:22:48] <bblack>	 those 3 look particularly interesting, but all the bugfixes are of course useful
[16:23:03] <bblack>	 go-cold one may be related to mem leaks on lots of reloads
[16:23:15] <bblack>	 the age<max-age one looks very relevant
[16:23:26] <bblack>	 (to our old problems with ttl=0 and such...)
[16:23:45] <bblack>	 and the client workspace / chunked one sounds plausible in some scenarios here too
[16:23:52] <bblack>	 so I guess we should rebase soon :)
[16:50:41] <ema>	 for varnish I've been using gbp import-orig (yeah, we use got a different method for every package)
[16:50:46] <ema>	 https://wikitech.wikimedia.org/wiki/Varnish#Package_a_new_Varnish_release
[16:51:12] <ema>	 s/use got/use/
[16:52:07] <ema>	 the code review of course in this case is completely useless, it's more a sort of "I'm upgrading to the new upstream FYI"
[16:53:22] <bblack>	 I'm out for a bit, bbl probably after you're gone :)
[16:53:40] <bblack>	 (and I'll check up if cumin finished upgrades)
[16:54:01] <ema>	 16:53:49 ema@neodymium.eqiad.wmnet:~
[16:54:02] <ema>	 $ screen -li
[16:54:02] <ema>	 There is a screen on: 18858.codfw-upgrade	(04/27/2017 02:29:25 PM)	(Detached)
[16:54:04] <ema>	 1 Socket in /var/run/screen/S-ema.
[16:55:49] <ema>	 excuse the pastefail, but that's the screen session :) 
[17:36:36] <wikibugs__>	 10netops, 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3218491 (10ayounsi) All servers have been moved, confirmed no more interfaces are up on asw. and no more traffic (other than multicast) on the as...
[17:37:12] <wikibugs__>	 10netops, 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3218493 (10ayounsi)
[18:10:08] <mutante>	 bblack: they want the ORES switch reverted now. https://gerrit.wikimedia.org/r/#/c/350622/   should i do the "middle step" [5~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~
[18:10:16] <mutante>	 sorry, about that
[18:22:17] <volans>	 mutante: AFAIK the clean way is with 2 commits, going active/active first and then switching. It can be done with one commit like in switchdc disabling puppet and enabling+run it first in the dc_to and then in the dc_from, I have the commands in an etherpad if needed
[18:22:43] <volans>	 but first check with brandon that is ok this way also for misc, there may be subtle differences that I ignore
[18:47:45] <mutante>	 volans: ok, thanks. now i'm wondering if we cant' just make it active/active and "the end"
[18:48:14] <volans>	 the idea was to be codfw only during the switchover
[18:48:22] <mutante>	 11:50 < Amir1> "ores (active/active)"
[18:48:32] <volans>	 and then active active for all the services that supports it
[18:48:48] <mutante>	 ok, so i should wait until we switch back?
[18:50:08] <volans>	 I'd move it back to codfw only, but this is my opinion. Also I think eqiad is still depooled gdnsd wise, isn't it?
[18:50:10] <mutante>	 first do active/active for 30 min, then codfw-only.. wait until switchdc happened again.. active-active and keep it like this
[18:50:19] <volans>	 (on mobile, cannot check now)
[18:50:50] <mutante>	 ok, i'll just switch it back to how i found it before it broke after the deploy
[18:50:58] <mutante>	 but in 2 steps
[18:51:48] <volans>	 sounds good to me, but again, I cannot speak for traffic folks :-P
[18:53:00] <volans>	 gotta go now, sorry
[19:04:07] <mutante>	 me too. on the way to WMF office 
[19:04:10] <mutante>	 will do it from there
[19:27:53] <paravoid>	 so are we now on the 10G lvses?
[19:56:50] <wikibugs__>	 10netops, 06Operations, 10ops-eqiad, 13Patch-For-Review: Spread eqiad analytics Kafka nodes to multiple racks ans rows - https://phabricator.wikimedia.org/T163002#3219063 (10Cmjohnson) @Ottomata @elukey  Do you still want to spread out the nodes or okay to resolve this task?
[19:58:26] <wikibugs__>	 10netops, 06Operations, 10ops-eqiad, 13Patch-For-Review: Spread eqiad analytics Kafka nodes to multiple racks ans rows - https://phabricator.wikimedia.org/T163002#3219076 (10Ottomata) I'd like hear what elukey thinks, but I think we can resolve this.  This Kafka hardware is slated to be decommed anyway, bu...
[20:41:13] <wikibugs__>	 10Traffic, 06Operations, 10ops-eqiad: cp1066.mgmt.eqiad.wmnet is unreachable - https://phabricator.wikimedia.org/T149217#3219303 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson cable was not seated correctly. Fixed
[21:29:46] <wikibugs__>	 10netops, 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3219432 (10jcrespo) No more scheduled downtime? Can T162681 be closed?
[22:16:09] <bblack>	 mutante: yeah two-step is safer, and yeah I'd agree ideally we go back to codfw-only for now (and then back to a/a when everything else does)
[22:16:36] <bblack>	 paravoid: no, I've been waiting for the switch work to die down before I start picking up whever we left off on the 10G lvs project however long ago
[22:19:08] <paravoid>	 ok, makes sense :)
[22:28:50] <mutante>	 bblack: it's back to codfw-only and i did it in 2 steps