[08:58:41] <wikibugs>	 10Traffic, 10MediaWiki-General-or-Unknown, 06Operations, 06Services: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093#2578360 (10Jhernandez) It would be interesting to know if query parameters end up being serialized in alphabetical order, which would ma...
[09:49:45] <wikibugs>	 10Traffic, 10MediaWiki-General-or-Unknown, 06Operations, 06Services: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093#2578479 (10BBlack) My largest concern is still whether we can guarantee our application layer is insensitive to query parameter order, a...
[10:30:46] <wikibugs>	 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 2 others: Tune WDQS caching headers - https://phabricator.wikimedia.org/T137238#2578560 (10Gehel) 05Open>03Resolved a:03Gehel New caching headers deployed. Checked with chrome, cache-control headers are showing up.
[15:43:08] <_joe_>	 ema: we restarted the etcd cluster yesterday, so maybe check that conftool changes are picked up by pybal
[15:43:15] <_joe_>	 I did some testing, but one never knows
[15:43:40] <ema>	 _joe_: thanks, traffic is coming in so the changes have been picked up by pybal properly
[15:43:57] <_joe_>	 yeah pybal is designed to survive etcd failures
[15:44:29] <_joe_>	 like, until you keep it running, it won't crash if etcd is unavailable, and should reconnect once it's fixed
[15:45:00] <mark>	 twisted and callbacks are nice that way eh ;p
[15:45:14] <_joe_>	 mark: come on, it's not about callbacks :P
[15:45:22] <_joe_>	 it's about having a client that reconnects 
[15:45:32] <mark>	 which twisted does for you!
[15:45:46] <_joe_>	 and also, I'd like to use the same reconnection logic there is in python-etcd, which is much better *cough*
[15:51:41] <wikibugs>	 10Traffic, 10MediaWiki-General-or-Unknown, 06Operations, 06Services: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093#2579264 (10Jhernandez) AFAIK I don't think that'd be a problem. Maybe @dr0ptp4kt has more insight.
[15:53:28] <wikibugs>	 10Domains, 10Traffic, 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138#2579269 (10Dzahn) >>! In T143138#2561115, @Mjohnson_WMF wrote: > @MZMcBride, I've asked Katy Love about https://grants.wikimedia.org/....
[15:53:58] <bblack>	 yeah but the same twisted code has also done things like kill just the etcd-fetching thread while leaving pybal up and running, silently no longer responding to depools :P
[15:56:17] <mark>	 what, you didn't enable the watchdog? ;)
[16:04:38] <bblack>	 ah yes, twisted as the Windows of network programming.  The recourse of last resort is always to restart :)
[16:11:46] <bblack>	 ema: cp4005 vk died?
[16:12:35] <ema>	 bblack: yup
[16:12:41] <bblack>	 oh the vanrishd child died
[16:13:07] <ori>	 are there bugs in pybal's etcd integration?
[16:13:18] <ema>	 bblack: but got restarted by the parent
[16:13:32] <bblack>	 yeah, the death looks ugly though, and there were log overruns before it too
[16:13:57] <bblack>	 Aug 24 16:04:52 cp4005 varnishd[23996]: Child (23998) said Could not destroy object 1783971 in EXP_NukeOne
[16:14:02] <bblack>	 ^ x bajillions
[16:14:19] <ema>	 bblack: should we depool and look into the logs?
[16:14:53] <bblack>	 yeah probably
[16:15:14] <ema>	 done
[16:15:29] <bblack>	 it might be interesting if ganglia had good stats during that, to look at how storage was doing when the nuke problems started
[16:15:44] <bblack>	 if that was near the moment where it first filled the malloc storage and had to start evicting, I mean
[16:15:46] <ema>	 but why is storage involved at all?
[16:15:57] <bblack>	 (not disks)
[16:15:58] <ema>	 oh storage != file storage
[16:18:00] <ema>	 that's our patch logging that message BTW
[16:18:05] <ema>	 debian/patches/0003-varnishd-nukelru.patch
[16:19:01] <bblack>	 /wikipedia/commons/2/2d/Thomas_Keene_in_Ma
[16:19:06] <bblack>	 /wikipedia/commons/2/2d/Thomas_Keene_in_Ma
[16:19:07] <bblack>	 cbeth_1884_Wikipedia_crop.png
[16:19:16] <bblack>	 heh paste error, but that was the URL it was one
[16:19:34] <bblack>	 0x432a22: varnishd(ObjGetattr+0x92) [0x432a22]#012  0x433340: varnishd(ObjGetU32+0x20) [0x433340]#012  0x4
[16:19:37] <bblack>	 333d0: varnishd(ObjGetXID+0x20) [0x4333d0]#012  0x4222a9: varnishd(EXP_NukeOne+0x1d9) [0x4222a9]#012  0x4315a8: varnishd() [0x4315a8]#012  0x431f47: varnishd(ObjGetSpace+0xc7) [0x431f47]#012  0x425e61: varnishd(VFP_GetStorage+0x61)
[16:19:41] <bblack>	 ^ interesting bits of stacktrace in the crash
[16:20:15] <bblack>	 also notable is this was a do_stream+do_gunzip according to flags.  does that mean the png was thought to be gzipped? or is that flag just always set
[16:20:28] <ema>	 have we had similar problems in the past? debian/patches/0003-varnishd-nukelru.patch seems to be there to debug exactly this
[16:20:58] <bblack>	 yeah, but I think/thought that was persistent-specific
[16:21:08] <bblack>	 needs some digging
[16:21:21] <bblack>	 could just be that our own debug output is now both not necessary and causing problems heh
[16:22:41] <bblack>	 https://www.varnish-cache.org/trac/ticket/1670
[16:23:31] <bblack>	 https://varnish-cache.org/trac/ticket/634
[16:23:45] <bblack>	 our insane rate of purges could make it easier to trigger related rare races
[16:26:06] <wikibugs>	 10netops, 06Operations, 10ops-codfw: codfw: rack/setup/deploy puppetmaster200[12]switch configuration - https://phabricator.wikimedia.org/T143800#2579430 (10Papaul)
[16:27:36] <bblack>	 anyways, back to my earlier theory, it is notable that we have a crash in code we touch with a forward-ported old patch we don't necessarily understand well
[16:28:20] <wikibugs>	 10netops, 06Operations, 10ops-codfw: codfw: rack/setup/deploy puppetmaster200[12]switch configuration - https://phabricator.wikimedia.org/T143800#2579449 (10mark) 05Open>03Resolved Done, put in private vlans.
[16:28:33] <bblack>	 it just be that the condition we chose to log there is now common and a non-issue, and at our reqs and/or purges rates the debug log output can spam at insane rates, and attempting to emit those stdout or syslog lines at thousands/sec ends up slowing down a critical section of code, leading to crash
[16:28:45] <bblack>	 s/it just be/it could just be/
[16:29:05] <ema>	 so a possible approach would be dropping the patch and see if the crash happens again
[16:30:28] <bblack>	 yes
[16:30:36] <bblack>	 also the crash is clearly inside the arguments of our debug print heh
[16:30:54] <bblack>	 the stock code just does: (void)HSH_DerefObjCore(wrk, &oc);
[16:31:15] <bblack>	 we're the ones invoking ObjGetXID in the crash trace, in:
[16:31:17] <bblack>	 +       oc2 = oc;
[16:31:17] <bblack>	 +       if (HSH_DerefObjCore(wrk, &oc) != 0)
[16:31:18] <bblack>	 +               printf("Could not destroy object %u in EXP_NukeOne\n", ObjGetXID(wrk, oc2));
[16:32:10] <bblack>	 it would be interesting first to understand better the reason for the original patch, and what EXP_NukeOne really means in the modern code and whether it should be failing the deref regularly, and whether that implies failing to destroy a cache item to clear up space...
[16:32:38] * Platonides would do wrk ? ObjGetXID(wrk, oc2) : 0
[16:32:57] <Platonides>	 looks clearly as a null dereferencing
[16:33:15] <Platonides>	 not knowing at all how that function works
[16:33:30] <bblack>	 watch out, digging too far in varnish source code is known to cause brain damage :)
[16:33:34] <Platonides>	 xD
[16:33:43] <Platonides>	 it's probably oc2 the one being accessed
[16:33:47] <ema>	 don't do this at home!
[16:33:59] <Platonides>	 that 0x4 is clearly a member being accessed
[16:34:45] <bblack>	 oc is gauranteed non-null earlier
[16:34:50] <Platonides>	 wouldn't HSH_DerefObjCore also be affecting oc2?
[16:34:52] <bblack>	 and we copy to oc2 because deref can modify oc
[16:35:03] <Platonides>	 what's oc?
[16:35:09] <Platonides>	 a pointer to a struct?
[16:35:12] <bblack>	 an object core object
[16:35:14] <bblack>	 yeah
[16:35:35] <bblack>	 wrk seems to be validated non-null before we get here too
[16:35:37] <Platonides>	 HSH_DerefObjCore is probably destroying the underlying storage
[16:35:53] <bblack>	 but still, HSH_DerefObjCore could be modifying the oc object's contents in a way that make a later GetXID call invalid
[16:35:57] <bblack>	 right
[16:36:17] <Platonides>	 long xid =  ObjGetXID(wrk, oc);
[16:36:23] <Platonides>	 if (HSH_DerefObjCore(wrk, &oc) != 0)
[16:36:31] <Platonides>	    printf("Could not destroy object %u in EXP_NukeOne\n", xid);
[16:36:41] <bblack>	 also notable in the tail of the unmodified current source it's:
[16:36:42] <bblack>	         VSLb(wrk->vsl, SLT_ExpKill, "LRU x=%u", ObjGetXID(wrk, oc));
[16:36:43] <bblack>	         (void)HSH_DerefObjCore(wrk, &oc);
[16:36:43] <bblack>	         return (1);
[16:36:43] <bblack>	 }
[16:36:50] <Platonides>	 or unsigned int, whatever that is returned
[16:37:00] <bblack>	 so (a) they're already logging a broader condition than we are, if we turn on varnishlog filtering to catch it
[16:37:11] <bblack>	 and (b) they chose to log before not after the Deref :)
[16:37:30] <Platonides>	 yes, that's interesting
[16:37:41] <Platonides>	 copying the integer would be completely safe
[16:37:45] <bblack>	 yeah
[16:38:02] <bblack>	 but again, we don't really know for sure what that log emission even means in modern varnish4 source (much less v3!)
[16:38:13] <bblack>	 it may be meaningless noise that slows down a critical section
[16:38:39] <bblack>	 we could at least convert it to a VSLb() instead of a printf() to eliminate slowdown concerns if we want to keep it
[16:38:52] <bblack>	 (and also copy the xid like Platonides says)
[16:40:17] <bblack>	 basically change the printf in Platonides version to: VSLb(wrk->vsl, SLT_ExpKill, "Could not destroy object %u in EXP_NukeOne", xid)
[16:40:37] <bblack>	 I'd still like to understand better whether it's even worth keeping that patch/output at all
[16:42:16] <bblack>	 another interesting point about this, the description of EXP_NukeOne says:
[16:42:19] <bblack>	  * Attempt to make space by nuking the oldest object on the LRU list
[16:42:22] <bblack>	  * which isn't in use.
[16:42:24] <bblack>	  * Returns: 1: did, 0: didn't, -1: can't
[16:42:42] <bblack>	 there are failure return cases earlier in the function, but the stock bottom two lines where this patch goes are:
[16:42:44] <bblack>	         (void)HSH_DerefObjCore(wrk, &oc);
[16:42:47] <bblack>	         return (1);
[16:42:59] <bblack>	 which implies if we even reach that point, it *does* get nuked from storage effectively.
[16:43:32] <bblack>	 meaning "Could not destroy" may not be what's going on here at all, although maybe it was back in varnish3, whence that patch came from
[16:44:52] <ema>	 I've gotta go, leaving puppet enabled on cp4005 and all services depooled for now
[16:45:01] <ema>	 see you tomorrow :)
[16:46:03] <bblack>	 ema: ok, bye :)
[16:46:31] <bblack>	 looking at HSH_DerefObjCore itself, apparently the return value is the new refcount of the object, which is only decremented by one
[16:46:48] <bblack>	 so if an object has 3 open references, HSH_DerefObjCore isn't meant to actually destroy it, and will return "2" instead.
[16:48:13] <Platonides>	 it makes sense
[16:48:34] <bblack>	 also, both of the consumers of EXP_NukeOne use it like this:
[16:48:37] <bblack>	                 /* no luck; try to free some space and keep trying */
[16:48:37] <Platonides>	 and also means that when HSH_DerefObjCore() != 0, it should be safe to call ObjGetXID
[16:48:37] <bblack>	                 for (i = 0; j == 0 && i < cache_param->nuke_limit; i++) {
[16:48:40] <bblack>	                         if (EXP_NukeOne(wrk, stv->lru) == -1)
[16:48:42] <bblack>	                                 break;
[16:48:45] <bblack>	                         j = stv->allocobj(stv, oc, wsl);
[16:49:20] <bblack>	 so basically, it's ok that NukeOne doesn't always free space, that's what nuke_limit is for (how hard to keep search for one that's nukeable)
[18:07:39] <bblack>	 paravoid: that google per-country map is interesting.  I wonder if it would be worth doing a one-off config change to map parts of Africa to v4-only in gdnsd?
[18:08:43] <bblack>	 and peru maybe
[18:09:23] <bblack>	 a few other isolated cases
[18:09:34] <bblack>	 we'd have to keep up with it in the long run and slowly pull them back out of that list over time
[20:04:02] <wikibugs>	 10Domains, 10Traffic, 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138#2580357 (10Dereckson) Poulpy from fr.wikipedia suggests to use the abbreviation pgc.wikimedia.org. That would let the other concurrent n...