[08:58:41] 10Traffic, 10MediaWiki-General-or-Unknown, 06Operations, 06Services: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093#2578360 (10Jhernandez) It would be interesting to know if query parameters end up being serialized in alphabetical order, which would ma... [09:49:45] 10Traffic, 10MediaWiki-General-or-Unknown, 06Operations, 06Services: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093#2578479 (10BBlack) My largest concern is still whether we can guarantee our application layer is insensitive to query parameter order, a... [10:30:46] 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 2 others: Tune WDQS caching headers - https://phabricator.wikimedia.org/T137238#2578560 (10Gehel) 05Open>03Resolved a:03Gehel New caching headers deployed. Checked with chrome, cache-control headers are showing up. [15:43:08] <_joe_> ema: we restarted the etcd cluster yesterday, so maybe check that conftool changes are picked up by pybal [15:43:15] <_joe_> I did some testing, but one never knows [15:43:40] _joe_: thanks, traffic is coming in so the changes have been picked up by pybal properly [15:43:57] <_joe_> yeah pybal is designed to survive etcd failures [15:44:29] <_joe_> like, until you keep it running, it won't crash if etcd is unavailable, and should reconnect once it's fixed [15:45:00] twisted and callbacks are nice that way eh ;p [15:45:14] <_joe_> mark: come on, it's not about callbacks :P [15:45:22] <_joe_> it's about having a client that reconnects [15:45:32] which twisted does for you! [15:45:46] <_joe_> and also, I'd like to use the same reconnection logic there is in python-etcd, which is much better *cough* [15:51:41] 10Traffic, 10MediaWiki-General-or-Unknown, 06Operations, 06Services: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093#2579264 (10Jhernandez) AFAIK I don't think that'd be a problem. Maybe @dr0ptp4kt has more insight. [15:53:28] 10Domains, 10Traffic, 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138#2579269 (10Dzahn) >>! In T143138#2561115, @Mjohnson_WMF wrote: > @MZMcBride, I've asked Katy Love about https://grants.wikimedia.org/.... [15:53:58] yeah but the same twisted code has also done things like kill just the etcd-fetching thread while leaving pybal up and running, silently no longer responding to depools :P [15:56:17] what, you didn't enable the watchdog? ;) [16:04:38] ah yes, twisted as the Windows of network programming. The recourse of last resort is always to restart :) [16:11:46] ema: cp4005 vk died? [16:12:35] bblack: yup [16:12:41] oh the vanrishd child died [16:13:07] are there bugs in pybal's etcd integration? [16:13:18] bblack: but got restarted by the parent [16:13:32] yeah, the death looks ugly though, and there were log overruns before it too [16:13:57] Aug 24 16:04:52 cp4005 varnishd[23996]: Child (23998) said Could not destroy object 1783971 in EXP_NukeOne [16:14:02] ^ x bajillions [16:14:19] bblack: should we depool and look into the logs? [16:14:53] yeah probably [16:15:14] done [16:15:29] it might be interesting if ganglia had good stats during that, to look at how storage was doing when the nuke problems started [16:15:44] if that was near the moment where it first filled the malloc storage and had to start evicting, I mean [16:15:46] but why is storage involved at all? [16:15:57] (not disks) [16:15:58] oh storage != file storage [16:18:00] that's our patch logging that message BTW [16:18:05] debian/patches/0003-varnishd-nukelru.patch [16:19:01] /wikipedia/commons/2/2d/Thomas_Keene_in_Ma [16:19:06] /wikipedia/commons/2/2d/Thomas_Keene_in_Ma [16:19:07] cbeth_1884_Wikipedia_crop.png [16:19:16] heh paste error, but that was the URL it was one [16:19:34] 0x432a22: varnishd(ObjGetattr+0x92) [0x432a22]#012 0x433340: varnishd(ObjGetU32+0x20) [0x433340]#012 0x4 [16:19:37] 333d0: varnishd(ObjGetXID+0x20) [0x4333d0]#012 0x4222a9: varnishd(EXP_NukeOne+0x1d9) [0x4222a9]#012 0x4315a8: varnishd() [0x4315a8]#012 0x431f47: varnishd(ObjGetSpace+0xc7) [0x431f47]#012 0x425e61: varnishd(VFP_GetStorage+0x61) [16:19:41] ^ interesting bits of stacktrace in the crash [16:20:15] also notable is this was a do_stream+do_gunzip according to flags. does that mean the png was thought to be gzipped? or is that flag just always set [16:20:28] have we had similar problems in the past? debian/patches/0003-varnishd-nukelru.patch seems to be there to debug exactly this [16:20:58] yeah, but I think/thought that was persistent-specific [16:21:08] needs some digging [16:21:21] could just be that our own debug output is now both not necessary and causing problems heh [16:22:41] https://www.varnish-cache.org/trac/ticket/1670 [16:23:31] https://varnish-cache.org/trac/ticket/634 [16:23:45] our insane rate of purges could make it easier to trigger related rare races [16:26:06] 10netops, 06Operations, 10ops-codfw: codfw: rack/setup/deploy puppetmaster200[12]switch configuration - https://phabricator.wikimedia.org/T143800#2579430 (10Papaul) [16:27:36] anyways, back to my earlier theory, it is notable that we have a crash in code we touch with a forward-ported old patch we don't necessarily understand well [16:28:20] 10netops, 06Operations, 10ops-codfw: codfw: rack/setup/deploy puppetmaster200[12]switch configuration - https://phabricator.wikimedia.org/T143800#2579449 (10mark) 05Open>03Resolved Done, put in private vlans. [16:28:33] it just be that the condition we chose to log there is now common and a non-issue, and at our reqs and/or purges rates the debug log output can spam at insane rates, and attempting to emit those stdout or syslog lines at thousands/sec ends up slowing down a critical section of code, leading to crash [16:28:45] s/it just be/it could just be/ [16:29:05] so a possible approach would be dropping the patch and see if the crash happens again [16:30:28] yes [16:30:36] also the crash is clearly inside the arguments of our debug print heh [16:30:54] the stock code just does: (void)HSH_DerefObjCore(wrk, &oc); [16:31:15] we're the ones invoking ObjGetXID in the crash trace, in: [16:31:17] + oc2 = oc; [16:31:17] + if (HSH_DerefObjCore(wrk, &oc) != 0) [16:31:18] + printf("Could not destroy object %u in EXP_NukeOne\n", ObjGetXID(wrk, oc2)); [16:32:10] it would be interesting first to understand better the reason for the original patch, and what EXP_NukeOne really means in the modern code and whether it should be failing the deref regularly, and whether that implies failing to destroy a cache item to clear up space... [16:32:38] * Platonides would do wrk ? ObjGetXID(wrk, oc2) : 0 [16:32:57] looks clearly as a null dereferencing [16:33:15] not knowing at all how that function works [16:33:30] watch out, digging too far in varnish source code is known to cause brain damage :) [16:33:34] xD [16:33:43] it's probably oc2 the one being accessed [16:33:47] don't do this at home! [16:33:59] that 0x4 is clearly a member being accessed [16:34:45] oc is gauranteed non-null earlier [16:34:50] wouldn't HSH_DerefObjCore also be affecting oc2? [16:34:52] and we copy to oc2 because deref can modify oc [16:35:03] what's oc? [16:35:09] a pointer to a struct? [16:35:12] an object core object [16:35:14] yeah [16:35:35] wrk seems to be validated non-null before we get here too [16:35:37] HSH_DerefObjCore is probably destroying the underlying storage [16:35:53] but still, HSH_DerefObjCore could be modifying the oc object's contents in a way that make a later GetXID call invalid [16:35:57] right [16:36:17] long xid = ObjGetXID(wrk, oc); [16:36:23] if (HSH_DerefObjCore(wrk, &oc) != 0) [16:36:31] printf("Could not destroy object %u in EXP_NukeOne\n", xid); [16:36:41] also notable in the tail of the unmodified current source it's: [16:36:42] VSLb(wrk->vsl, SLT_ExpKill, "LRU x=%u", ObjGetXID(wrk, oc)); [16:36:43] (void)HSH_DerefObjCore(wrk, &oc); [16:36:43] return (1); [16:36:43] } [16:36:50] or unsigned int, whatever that is returned [16:37:00] so (a) they're already logging a broader condition than we are, if we turn on varnishlog filtering to catch it [16:37:11] and (b) they chose to log before not after the Deref :) [16:37:30] yes, that's interesting [16:37:41] copying the integer would be completely safe [16:37:45] yeah [16:38:02] but again, we don't really know for sure what that log emission even means in modern varnish4 source (much less v3!) [16:38:13] it may be meaningless noise that slows down a critical section [16:38:39] we could at least convert it to a VSLb() instead of a printf() to eliminate slowdown concerns if we want to keep it [16:38:52] (and also copy the xid like Platonides says) [16:40:17] basically change the printf in Platonides version to: VSLb(wrk->vsl, SLT_ExpKill, "Could not destroy object %u in EXP_NukeOne", xid) [16:40:37] I'd still like to understand better whether it's even worth keeping that patch/output at all [16:42:16] another interesting point about this, the description of EXP_NukeOne says: [16:42:19] * Attempt to make space by nuking the oldest object on the LRU list [16:42:22] * which isn't in use. [16:42:24] * Returns: 1: did, 0: didn't, -1: can't [16:42:42] there are failure return cases earlier in the function, but the stock bottom two lines where this patch goes are: [16:42:44] (void)HSH_DerefObjCore(wrk, &oc); [16:42:47] return (1); [16:42:59] which implies if we even reach that point, it *does* get nuked from storage effectively. [16:43:32] meaning "Could not destroy" may not be what's going on here at all, although maybe it was back in varnish3, whence that patch came from [16:44:52] I've gotta go, leaving puppet enabled on cp4005 and all services depooled for now [16:45:01] see you tomorrow :) [16:46:03] ema: ok, bye :) [16:46:31] looking at HSH_DerefObjCore itself, apparently the return value is the new refcount of the object, which is only decremented by one [16:46:48] so if an object has 3 open references, HSH_DerefObjCore isn't meant to actually destroy it, and will return "2" instead. [16:48:13] it makes sense [16:48:34] also, both of the consumers of EXP_NukeOne use it like this: [16:48:37] /* no luck; try to free some space and keep trying */ [16:48:37] and also means that when HSH_DerefObjCore() != 0, it should be safe to call ObjGetXID [16:48:37] for (i = 0; j == 0 && i < cache_param->nuke_limit; i++) { [16:48:40] if (EXP_NukeOne(wrk, stv->lru) == -1) [16:48:42] break; [16:48:45] j = stv->allocobj(stv, oc, wsl); [16:49:20] so basically, it's ok that NukeOne doesn't always free space, that's what nuke_limit is for (how hard to keep search for one that's nukeable) [18:07:39] paravoid: that google per-country map is interesting. I wonder if it would be worth doing a one-off config change to map parts of Africa to v4-only in gdnsd? [18:08:43] and peru maybe [18:09:23] a few other isolated cases [18:09:34] we'd have to keep up with it in the long run and slowly pull them back out of that list over time [20:04:02] 10Domains, 10Traffic, 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138#2580357 (10Dereckson) Poulpy from fr.wikipedia suggests to use the abbreviation pgc.wikimedia.org. That would let the other concurrent n...