[02:15:48] 10Traffic, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 06Operations, and 3 others: CN: Stop using the geoiplookup HTTPS service (always use the Cookie) - https://phabricator.wikimedia.org/T143271#2581179 (10AndyRussG) Here's how things work in the proposed patch: - The `mw.centralNotice.ge... [02:36:28] 10Traffic, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 06Operations, and 3 others: CN: Stop using the geoiplookup HTTPS service (always use the Cookie) - https://phabricator.wikimedia.org/T143271#2581207 (10AndyRussG) P.S. Thanks to @Krinkle for the idea of using config to name a RL module... [10:06:42] repooled cp4005, running without the nukelru patch [10:07:02] no crashes so far [12:00:04] ema: awesome btw [12:00:26] to finish up the thread from yesterday where Platonides left off with "and also means that when HSH_DerefObjCore() != 0, it should be safe to call ObjGetXID".... [12:01:27] I think probably the cause of the crash is that there's a race when e.g. HSH_DerefObjCore() returns 1, but a concurrent thread drops the object's refcnt to zero and starts destroying the object before the subsequent ObjGetXID call based on the "1" response. [12:01:58] there is no lock around the combination of the HSH_DerefObjCore -> ObjGetXID-for-printf [12:03:12] if necc we still could revive a patch like that if we cared, but we'd need to get the xid before the deref (as an integer) and save it for deref-retval-based print afterwards (and then use VSLb() instead of printf() for efficiency) [12:03:25] but it's probably not worth it [12:53:00] bblack: varnishd[32500]: : Invalid conf pair: lg_chunk_size:17 [12:53:12] looks like it's been renamed to lg_chunk, according to jemalloc(3) [12:53:37] "opt.lg_chunk" (size_t) r- [12:53:37] Virtual memory chunk size (log base 2). If a chunk size outside the supported [12:53:40] size range is specified, the size is silently clipped to the minimum/maximum [12:53:43] supported size. The default chunk size is 4 MiB (2^22). [13:05:50] nice [13:06:29] so another v4 config bit flip? [13:06:44] or is this happening on text too? they're running the same debian jemalloc lib I think... [13:07:07] surely they haven't broken that during jessie? I'm sure my testing was within jessie [13:08:04] or my testing was basically never testing that parameter because I never saw the output, that's a possibility too [13:15:37] it's happening on cp1008 too [13:17:32] everywhere, really [13:17:52] yeah [13:18:10] only question now is whether it changed since I tested it, or the testing was invalid (on that one of the params, anyways) [13:19:55] it might be that the option never existed :) [13:20:04] seems like it [13:20:11] well, at least in versions modern enough to matter [13:20:26] hmmm, or ever lol [13:20:52] jemalloc is unchanged in jessie so far [13:20:55] so, if that never had effect, we should be careful [13:21:00] the default is 22 [13:21:03] I think [13:21:13] so the <22 values are untested now [13:22:38] but still, the tuning was assumed-reasonable based on our statistical estimates, the concern was about perf blowing up if our stats thoughts are off [13:23:10] 2^22 = 4MB chunks. 2^17/16 in current config is 128K and 64K chunks [13:23:54] so we should start back at something better-than-default, but not that extreme without re-validating that it doesn't cause negative perf fallout [13:24:15] I'd propose 19 for text and 20 for the others? [13:24:43] 512KB for text, 1MB for the others. it's still better than default 4MB for all, but not extreme. [13:24:52] sounds good [13:25:16] we can re-experiment over time (again) at lowering it [13:27:14] I guess this explains why the testing was frustrating, and why over the very long term (longer than the tests) the mem values didn't end up much better than before [13:27:26] although I'm sure the dirty_mult thing helped too [13:28:04] at some future time after more longer-ish term tests, this might let us raise the frontend memory back up some more, but we'll see [13:49:47] bblack: I'll prepare a patch for the lg_chunk thing [13:53:11] ema: thanks :) [13:54:51] bblack: in your comments you were using the right name for the option, I've just noticed [13:54:54] https://phabricator.wikimedia.org/T135384#2328291 [13:55:35] were you testing by setting the env variable manually? [13:57:10] well whatever way I was testing, I was doing things manually, on single machines [13:57:29] but I don't recall well enough to be sure, we should just assume testing was invalid [13:57:34] ok [14:50:25] ema: cp4005 still ok? [14:51:23] bblack: looks like it's still ok, yes [14:54:22] hasn't crashed, nothing weird in the logs [14:54:55] good :) [14:55:21] the kernel occasionally informs us that Process accounting resumed, but that's unrelated and not new [14:55:26] yeah [14:55:38] so I'm kinda torn on our next best step [14:55:45] either turn up v4 frontends in all of ulsfo [14:55:55] or turn on v4 backend on cp4005 [14:56:01] or turn up the backend on cp4005 first, then follow up with all of ulsfo at both layers [14:56:22] well, I guess we can't do the first option, that was silly [14:56:35] right, some backends need to be there :) [14:56:36] we can't co-install v3+v4, and splitting the cluster to do it would be messy [14:57:29] so yeah I'd say pool up the backend next (maybe now?), either way we can still depool the whole machine, and in the very very worst case of something that infects consuming caches, we can wipe the other cp4 frontends with restarts post-depool. [14:57:38] 10netops, 06Operations: configure port for frdb1001 - https://phabricator.wikimedia.org/T143248#2582634 (10Jgreen) a:05Jgreen>03None [14:58:10] ok, I've built 4.1.3-1wm2 on copper and was about to upload it to carbon [14:58:17] ok [14:58:18] after that I'll pool the backend [14:58:23] sounds great [14:58:27] cool :) [15:14:53] cp4005 pooled [15:18:48] looks good [15:21:01] some 501s for a specific UA: "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)" [15:21:06] 98? [15:23:55] heh [15:24:00] 501? [15:24:20] what method are they using? [15:24:27] GET [15:24:34] that's... odd [15:25:23] http://www.httrack.com/ [15:25:55] I'd ignore it [15:26:04] well I mean for "omg something's broke purposes" [15:26:12] right [15:26:14] it would be interesting to know why varnish is calling that a 501 [15:26:39] something strange about the request, maybe some malformation of the request line, or sending a body with the GET, or? [15:27:07] I'll try to get a varnishlog of those [15:27:27] TIL: there are real users on the modern internet using the AOL Browser (a build of IE7) on Windows XP, reading Wikipedia :P [15:27:31] Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.1; AOLBuild 4334.5009; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729) [15:28:44] see cp4005.ulsfo.wmnet:/root/501s.log [15:33:02] http://www.varnish-cache.org/trac/wiki/HTTPFeatures [15:33:09] return 501 and close the connection when client entity-bodies are received that squid doesn't understand the transfer-codings of. [15:33:13] squid! [15:39:29] I'm tracing through the v4 code, and I have yet to find the exact path that causes a 501 [15:39:38] it comes from somewhere deep, some error in request parsing [15:40:12] in direct terms it comes from cache_req_fsm.c 's lines that say: [15:40:13] if (req->err_code < 100 || req->err_code > 999) [15:40:14] req->err_code = 501; [15:40:25] yeah I've seen that too [15:40:38] but then we're left searching everywhere else for how err_code can fall outside that range, which mostly seems to be about http parsing [15:40:52] non-3-digit status codes [15:44:20] somewhere within HTTP1_DissectRequest() is where it happens, but it's hard to follow [15:44:37] as it calls into various other http1_* calls to parse various bits [15:45:21] anyways, broken UA is broken [16:19:52] 10Traffic, 06Operations: Push gdnsd metrics to graphite and create a grafana dashboard - https://phabricator.wikimedia.org/T141258#2582822 (10elukey) a:03elukey [16:49:03] 10netops, 06Operations: configure port for frdb1001 - https://phabricator.wikimedia.org/T143248#2582916 (10faidon) 05Open>03Resolved a:03faidon Done! [17:20:37] 10netops, 06Operations: Upgrade cr1-esams & cr2-knams to JunOS 13.3 - https://phabricator.wikimedia.org/T143913#2583037 (10faidon) [17:22:25] 10netops, 06Operations: Upgrade cr1-ulsfo & cr2-ulsfo to JunOS 13.3 - https://phabricator.wikimedia.org/T143914#2583053 (10faidon) [17:22:49] 10Traffic, 10MediaWiki-General-or-Unknown, 06Operations, 06Services: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093#2583073 (10Mholloway) >>! In T138093#2578360, @Jhernandez wrote: > I'm not sure how apps end up serializing parameters, going to ping @M... [17:22:58] 10netops, 06Operations: Network ACL rules to allow traffic from Analytics to Production for port 9160 - https://phabricator.wikimedia.org/T138609#2405243 (10faidon) @elukey, ping? [17:31:20] 10Traffic, 10netops, 06Operations: Fix static IP fallbacks to Pybal LVS routes - https://phabricator.wikimedia.org/T143915#2583091 (10faidon) [17:31:51] bblack: ^ [17:38:31] 10Traffic, 10netops, 06Operations: Fix static IP fallbacks to Pybal LVS routes - https://phabricator.wikimedia.org/T143915#2583135 (10BBlack) Yes - I think in eqiad we only need to reshuffle git-ssh.wikimedia.org, ocg.svc.eqiad.wmnet, and our internal recdns IPs. I think it's likely in the other DCs the sit... [18:00:28] thanks :) [20:22:29] 10Traffic, 10MediaWiki-General-or-Unknown, 06Operations, 06Services: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093#2583763 (10Fjalapeno) On iOS parameters are constructed using a dictionary which is unordered. (There is no concept of an ordered dictio... [21:26:17] 10Traffic, 10MediaWiki-General-or-Unknown, 06Operations, 06Services: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093#2584068 (10dr0ptp4kt) I haven't come across a case where the name-value pair ordering in the URL or form data is material. @Anomie @tgr... [21:54:11] it's annoying that the "varnish_modules" vsthrottle token bucket filter doesn't have a burst param, might be worth a pull req [21:54:56] normally a TBF has two independent parameters, "rate" (N items per T time) and "burst" (capacity of the bucket). [21:55:33] theirs has two parameters "count" and "time", which are effectively used to set rate=count/time and burst=count [21:56:51] in the "bucket starts full" mental model that their code follows, a bucket starts with "count" tokens in it, a request steals 1 token (fails the filter if 0 remain), and then refills at count/time rate between requests. [21:57:57] but the burst param is nice to have, it lets you say things like "I want a long-term rate of 5/sec, but I want an initial burst of up to 1000 requests to succeed before the ratelimit really starts mattering" [21:58:36] it's all about the size of the bucket (and initial fill) being bigger than the count used in specifying the rate [22:00:54] of course I guess you can get creative with your units to work around it, it's just non-intuitive. [22:01:13] approximately work around it, anyways [22:04:20] if you wanted rough 5/sec and a burst of 1000, you could specify it as 1000 per 83min interval :) [22:04:36] 10Traffic, 10MediaWiki-General-or-Unknown, 06Operations, 06Services: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093#2584228 (10Tgr) The action API is not cached in Varnish unless the client sets the `smaxage` query parameter (in which case they probabl... [22:09:52] ema: I'm thinking about pushing out a new v3 package with libvmod-boltsort in it for the query param stuff. skips over waiting ~4 months to finish up v4 on text to see how things play out with the query param normalization ticket. [22:10:45] we can switch to vmod_std's querysort() for v4 easily [22:19:46] bblack: alright! [22:20:34] bblack: should we carry on with upgrading other upload nodes tomorrow? [22:22:51] ema: if there are zero new concerns overnight after doing some staring at cp4005 logs and various graphs, then yeah I'd say maybe try to convert up to half of them if you have time. [22:23:07] any more seems like asking for weekend trouble that's harder to fix [22:23:26] (we can probably depool half and live) [22:23:29] yup [22:24:25] I guess really, we always have the option to dns-depool ulsfo, assuming we don't also concurrently lose another DC and such :) [22:25:06] so maybe all of ulsfo is acceptable too, if (a) there's time to do them and watch them for a bit before you go and (b) you're willing to take that risk with your own weekend plans :P [22:25:56] right, I'd go for upgrading only a few machines and not the whole cluster given that I'll be offline this weekend [22:26:09] ok [22:26:19] always assuming that cp4005 doesn't explode overnight, it looks pretty good at the moment [22:27:04] honestly I haven't really looked at it much myself. at some point today I might try to stare at relevant data and see if anything looks markedly different from v3 behavior (e.g. in memory, load, hitrate, etc) [22:27:39] but my baseline assumption is that if anything was horribly wrong, 1/6 of asia and california would already be complaining to us and it would've bubbled to phabricator or #ops by now [22:28:54] yeah, I haven't compared any metrics with other v3 machines, but there have been no further crashes after removing the nukelru patch and varnishlog/ncsa looks fine [22:30:35] I can see what looks like probably the expected impact of wiping one node's backend storage around 10:00 and again around 15:15 UTC today [22:30:43] e.g. from depooling it or restarting it on -sfile [22:31:09] s/depooling/pooling/ [22:32:05] we've pooled the backend at 15:13 [22:32:18] hmm I wonder what was at 10:00 [22:32:21] and the frontend at 10 [22:32:23] :) [22:32:28] ah! [22:32:44] I bet vslp hashes differently than chash, which is why it looks like fractional loss of backend storage [22:33:18] probably reduces our efficiency in general so long as we're running mixed-chashing frontends heh [22:33:26] true on the vslp from ulsfo->codfw for codfw's storage as well [22:34:15] interesting [22:34:54] assuming we space out the conversions at a steady/reasonable pace while going through a DC, it's not awful. say no faster than ~15-30 minutes between starting each one, which might be hard anyways with verification and whatnot. [22:35:03] when thinking about it from 1x DC perspective [22:35:16] the remote-DC perspective matters less anyways, as normally remote-hit is low rate [22:35:40] it does make a case that it's not an awesome idea to switch half the cluster in one DC and let it sit like that for days, though. [22:36:28] yeah I was thinking whether it would make sense to start converting the backends first, but it sounds messy [22:36:34] it will basically cut our local-backend storage size in half, effectively, while sitting in that 50/50 split [22:36:54] well not quite half, but close [22:37:16] (n/2)+1/n, where n is 6 in ulsfo [22:37:44] which is half [22:37:55] -ish [22:38:06] you get a 1/n chance of luckily-hashing to the same backend under both directors :) [22:39:43] halfish :) [22:42:00] anyways, it's late, worry about it in the morning :) [22:59:11] good point! 'night