[03:58:50] 10Traffic, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 15 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#3008159 (10Tgr) I think the third-party MediaWiki concerns are somewhat understated and are not actually third party as we use the same arrangement via Sw... [08:17:21] _joe_: should we merge https://gerrit.wikimedia.org/r/#/c/335844/ and release 1.13.4? [08:32:27] <_joe_> ema: probably both your patches should be merged [08:33:04] yeah I've merged the other one already :) [08:36:07] <_joe_> oh ok, +1 [13:17:47] so while cache_upload reboots I'm playing a bit with https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats [13:17:59] I've added the request rate and hitrate [13:55:10] that's the varnishstat hitrate? [13:55:39] I've never really understood what it means, but I guess the code and/or docs should be able to elucidate [13:55:53] that's cache_hit / (cache_hit+cache_miss) [13:55:59] (whether or where it includes internal responses, passes, and especially misses that create hfp objects) [13:56:44] our header-based stuff puts the latter into the pass category [13:56:58] so there's another metric for hits for pass (MAIN.cache_hitpass) [13:57:18] yeah but that's probably hitting a hitpass and performing a pass as a result [13:57:40] missing -> creating a hitpass is different [14:07:10] right I imagine that cache_hitpass means number of times a hfp object was hit, meaning that (probably?) cache_hit doesn't include hfp hits [14:07:17] no idea about hfp creation [14:08:33] probably in a sane world that would be a pass [14:09:24] oh and there's a separate counter for passed requests: MAIN.s_pass [14:25:48] right, assuming the author(s) aren't insane, given they have explicit metrics for hit, miss, pass, and hitpass [14:26:35] the only two suspicious/questionable categories are: misses that create hit-for-pass objects (you could see them being counted as miss or hitpass, but I'm betting miss, which isn't what we want), and synthetics [14:26:53] synthetics could be a rather thorny problem, as they could happen after a lookup results in hit or miss, etc. [14:31:49] ema: you probably have a better feel for cache stats now than I do, so... I'm looking at making changes to cache hardware orders going forward... [14:33:27] so for comparison, today the "standard" (ulsfo aside) for cache_text/upload is ~256G of ram (affects fe size) and 2x400G SSDs (affects be size) [14:33:45] and 8x text machines and 10-12x upload machines [14:34:03] roughly anyways, some have smaller mem, etc [14:34:19] but that was roughly what our newest-gen ones were, e.g. all the esams text/upload [14:34:54] the other part of sizing all of this is that machine count affects chash data dropout in the backends on crash/depool [14:35:11] e.g. with 4 live nodes, when a node depools from chash we lose 25% of storage that remaps to elsewhere [14:36:07] I've looked at our total network in/out globally across the cache boxes too, and basically it's not a limiting factor before other things are (limiting on how small we can make the clusters) [14:37:27] my thinking is 6 cache boxes per cluster going forward serves the chash dropout needs [14:38:08] it allows the scenario where 1 host dies (and stays dead for a little while pending hw fix), leaving 5 hosts in the cluster, and then 1/5 are routinely depooling for maintenance or restart crons, meaning 20% of storage dropping out routinely. [14:38:41] or varnishd explosions :) [14:38:52] it's what we live with in ulsfo cache_text, and 20% of backend is "acceptable". worst case we have 2x hard fails and still ongoing maintenance, and we're losing 25% on backend dropouts. beyond that we'd probably depool the whole DC until hw is fixed. [14:39:38] (and with 5x DCs soon, I think we're in a better position to accept DC depools too) [14:39:47] right [14:39:57] (kind of mentally assuming it should be ok to have 2/5 out of service and things aren't crazy) [14:40:50] so, with that as a starting point and trying to minimize machine count (which drives cost more than anything else, ultimately), let's say the standard cache buildout is 6xText + 6xUpload machines [14:41:19] for text, I think we can spec them like they are today, ~256G ram + 2x400 SSDs, because stats-wise it seems like we have more storage than we need there for the data [14:41:34] (in terms of optimizing hitrate:cost ratio) [14:42:01] and then for upload, trying to upsize the backend for sure. We could put sanely up to 4x800 SSDs in there for upload. [14:42:37] so doing this conversion in say, esams, cache_text is basically losing two nodes permanently, and cache_upload loses half the node-count, but ends up with 4x the BE storage it had before in the net. [14:42:51] because cache_upload seems like the BEs are under-sized [14:42:56] (today) [14:43:24] I think that all makes intuitive sense to me as a win on cost and perf [14:44:03] but (a) you might think differently and know better at this point! and (b) I'm not sure whether it might be worth it to also bump ram on the cache_upload for a larger frontend cache, too (or is it maxed out for useful expansion given the size limits we place on objects there, etc)? [14:44:15] we could potentially go +50% on the upload nodes memory, too [14:45:12] cpu power seems to be the one thing we have more than enough of, and at 6x install nodes per cluster we have network port bandwidth to barely handle "all but one DC offline", so those don't seem like limiters [14:45:30] CPU power is certainly enough at the moment [14:46:01] yeah. having excess it good too, but I think our last orders all assumed way more would be wasted on SSL than what ended up actually happening [14:46:17] more memory seems like a good idea unless it drives costs up considerably [14:46:31] for upload, or for both? [14:46:58] I mean, in theory more memory or more storage always adds some benefit, but there's a curve beyond which it's not reasonable to add more linearly, because the benefit is small [14:47:23] it seems like cache_upload almost certainly benefits from expanding BE storage, and maybe benefits from expanding FE mem [14:47:40] text, I kind of assume is already at a reasonable point on those curves, and it's not worth $$ to chase small gains [14:48:59] anyways, things to think about as you stare at hitrates and object counts, etc :) [14:49:36] but I have to put some configs together soon for asia + ulsfo refreshes, and I want to do some kind of change along the above lines and use it as the model for all future refreshes until things change again. [14:51:06] (it's tempting to add >6 machines per cluster for more tolerance of long-fix-times on hw fail or rapid hwfail... but I figure in core DCs even though they're more-important the fix times should be smaller with local staff, and in remote DCs, we always have the option of DC depool in the unlikely event of multiple hwfail per cluster) [14:53:58] the 6x machines reasoning seems valid to me yeah [14:54:19] I'll stare at stats to answer the other questions :) [15:12:27] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?var-server=cp3048&var-datasource=esams%20prometheus%2Fops&from=1484385411683&to=1484433559798 [15:13:55] interesting stuff, our upload backends take ~7 hours to go from empty to the point where they start nuking (12:30 -> 19:30) [15:15:03] and the nuking doesn't seem to have a significant impact on hitrate (at first glance) [15:20:19] compare with a ulsfo machine where it takes less for the upload backend to start nuking, and the nuking rate is higher for the backend than for the frontend (whereas on esams it's the other way around) [15:20:23] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?var-server=cp4013&var-datasource=ulsfo%20prometheus%2Fops&from=1485389281456&to=1485443132221 [15:21:23] well is "nuking" in this case from expiry, purge, space-eviction to make room, or some combination? [15:21:40] space-eviction only I think [15:32:35] ulsfo is special though [15:32:51] ulsfo has less be storage (due to fewer nodes), but I think also has significantly less fe mem, too [15:33:10] right, it does have less fe mem [15:34:37] one of the things I was looking at was comparing fe/be hitrates too, how they stack up in the varnish-caching dashboard [15:35:35] e.g. for text esams, peak-ish values tend to be somthing like "frontend: 90 local: 95.8 overall: 96.8" [15:36:19] but esams upload is more like "frontend: 83.6 local: 96.6 overall 97.4" [15:36:48] the bigger fe<->local spread on upload to me means something about how undersized the FE mem is there, relatively [15:37:23] but then it may also just be saying something about the fact that we don't cache objectgs > X size in the FE at all as a rule [15:39:53] (but we also have some stats/estimates on size breakdowns somewhere, from the storage split work) [15:40:32] in a closed ticket I can't find :P [15:41:26] ah [15:41:27] T145661 [15:41:28] T145661: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661 [15:42:19] so our upload frontned hfp size limit is 256KB [15:43:09] the ticket says something like 98.4% of requests should fall within that size limit, so 83.6% is well under that [15:45:32] we also know from some of the data in that ticket, that to some approximation there's ~587GB of data in the 0-256K size range that's commonly accessed [15:45:47] so that's a more-direct indicator that, yes, more FE mem would be useful if we can swing the cost [15:47:55] well, all of those absolutes are from 1/1000 data though, over several days [15:48:13] it might miss a bunch of less-common objects [15:48:44] if we assumed the 1/1000 saw all of the useful objects, there's only 2.3TB of total unique data in the dataset in that ticket [15:49:24] and esams has ~8.6TB of storage today, yet doesn't achieve any kind of perfect hitrate or anything [15:50:02] I donno [15:50:30] I could see the data arguing that bumping FE mem from 256 to 384 is probably worth something, and that maybe doubling instead of quadrupling the backend storage is more-sane for cost, too [15:50:46] since we're already at a point where we only miss like 2.5% of reqs once the be storage is in play [15:55:44] the fe hit spread between ulsfo and esams is larger for text than for upload, which might suggest more memory would help text too? [15:55:56] 30-days average fe hitrates: [15:56:01] upload-esams 82.8% [15:56:01] upload-ulsfo 79.9% [15:56:01] upload-esams 82.8% [15:56:01] upload-ulsfo 79.9% [15:56:01] upload-esams 82.8% [15:56:04] upload-ulsfo 79.9% [15:56:06] pastefail [15:56:15] text-esams 88.6% [15:56:21] text-ulsfo 85.1% [15:57:03] well ulsfo's memory is very small I think [15:57:06] it's less than half of esams [15:57:31] err no, not that bad, but still [15:58:46] we malloc 76G in ulsfo, and 101G in esams, currently [15:58:55] for the big clusters, I think [15:59:07] 76/192 and 101/256, I think [15:59:30] ~0.4 [15:59:54] a 384GB node under the same rules would malloc 153G [16:00:17] (but we've talked about changing the sizing rule so that bigger-mem hosts get an even larger fraction. I wasn't entirely happy with my last commit on that front and never merged it) [16:06:21] the current rule is mem*0.4, proposed commit was (mem-32)*0.6, another good option is (mem-64)*0.75 [16:06:56] we could get 211 or 240 out of 384 under the latter two rules [16:08:34] but yeah, given upload does get ~97.5% hit using current BE storage, maybe quadrupling that isn't a useful way to spend money. there's only so much to gain there, and some miss rate from replacement/new underlying objects and such is unavoidable [16:08:50] (and ttl expiry vs long-tail accesses) [16:09:17] we could still double it, and shoot for going 384GB mem on both sets [16:09:35] +1 [16:09:45] (I mean: double upload's total BE storage, and reduce text's total BE storage slightly, and go for 384GB on both) [16:10:31] text's overall miss-rate is worse, but I don't think that's for lack of space. it's more like lots of crazy one-hit-wonders that are unpredictable, etc [16:10:44] (API reqs with short lifetimes, etc) [16:13:00] and I think, even with the mem/storage bumps, having 12 nodes per DC is going to save a ton of cost [16:13:21] post-asia under that plan once everything warranty-refreshes, we'd ahve 60 total caches nodes in 5 DCs, instead of our current ~100 in 4 [16:13:54] and increasing mem/storage in one node is ~10-20% price bump, vs the cost of whole new machines of this sort [16:16:51] pybal 1.13.4 with the logging changes seems to work fine on pybal-test2001, I've uploaded it to carbon and we can upgrade while rebooting the LVS hosts for T155401 [16:16:52] T155401: Integrate jessie 8.7 point release - https://phabricator.wikimedia.org/T155401 [16:18:14] awesome [16:29:45] uh we've still got varnish-be-rand stuff in etcd [16:29:49] that shouldn't be the case right? [16:31:45] yeah it's supposed to be deleted, and we're not accessing it [16:31:59] but conftool-sync or whatever didn't delete them when they died from the puppet repo data [16:32:35] for that matter I think we have other unused keys in etcd in general, because typos [16:32:52] (you can typo various labels in depool/repool commands and it creates the pointless keys for you automagically) [16:32:57] some cleanup is in order :) [16:33:45] <_joe_> ema: do you still have the objects or just the directories? [16:34:38] _joe_: objects I guess, try confctl select service=varnish-be-rand get [16:34:42] bblack@puppetmaster1001:~$ confctl select service=varnish-be-rand get 2>&1|wc -l [16:34:46] 109 [16:35:11] <_joe_> heh, this should all be fixed easily when we update conftool next week [16:35:26] we deleted varnish-be-rand from conftool-data and all our scripts that touch pooling, etc [16:35:31] it's just still lingering in etcd [16:35:42] <_joe_> bblack: yeah I remember the bug [16:36:13] I really hate our varnish-fe and nginx naming too, but I feel like at this point renaming a live service would be very painful [16:36:39] those should've been something else like "public-http" + "public-https" or something, since the underlying software will probably undergo multiple changes in the future [16:37:00] <_joe_> bblack: I think we might be able to do a renaming [16:37:27] <_joe_> and it shouldn't be too painful [16:37:28] yeah but all the moving parts, with confd and pybal watching those names for live traffic while you rename, etc [16:37:53] <_joe_> copy, update pybal/puppet, delete [16:37:59] <_joe_> that's what we could do [16:38:12] double-define backends in VCL through confd too? [16:38:15] maybe [16:38:29] oh right those partcular names don't affect VCL anyways [16:38:34] only backend stuff does [16:38:40] just pybal [16:38:51] <_joe_> we can just define the new service, change puppet accordingly, then remove the old one once we're done moving everything to the new name [16:39:35] probably "varnish-fe" (port 80) will move to nginx (software) sometime in the next several months [16:39:44] and probably both will become ATS sometime in the future [16:39:53] at which point neither current label will be appropriate heh [16:40:51] we've been holding up the nginx+port80 switch on the non-canonical redirect problem and stream.wm.o legacy [16:41:08] but I think we could also just move the host-header matching logic to nginx config too and jump past that [16:44:40] all cache nodes rebooted [16:45:58] moritzm: ^ [16:46:11] \o/ [17:41:56] 07HTTPS, 10Traffic, 10Collection, 06Operations, 13Patch-For-Review: Book collections communicate with pediapress using http: - https://phabricator.wikimedia.org/T157398#3009925 (10Ckepper) We have installed letsencrypt/certbot. You can now start testing on https://tools.pediapress.com/ [17:51:14] 10netops, 06Analytics-Kanban, 06Operations: Review ACLs for the Analytics VLAN - https://phabricator.wikimedia.org/T157435#3010000 (10elukey) Completed the AQS work due to T157533 (under Brandon's supervision). I am going to keep working on this task during the next days to fix the remaining items. Caveat:... [17:55:13] nice! [18:35:27] 10netops, 06Operations: Add firewall exception to get to wdqs[12]003.(codfw|eqiad).wmnet:8888 from analytics cluster - https://phabricator.wikimedia.org/T157593#3010219 (10Gehel) [18:36:35] 10netops, 06Operations, 05Goal: Decomission palladium - https://phabricator.wikimedia.org/T147320#3010238 (10Cmjohnson) [19:21:01] 07HTTPS, 10Traffic, 10Collection, 06Operations, 13Patch-For-Review: Book collections communicate with pediapress using http: - https://phabricator.wikimedia.org/T157398#3010423 (10Dzahn) @Ckepper very cool :) thank you looks good to me and gets A rating here https://www.ssllabs.com/ssltest/analyze.htm... [21:18:12] mutante: feel free to approve, I removed the -2 on the affected patch [21:19:59] Platonides: +1ed both of them, thanks [21:20:09] i just leave the +2 to mw-deployers [21:20:30] ok [21:20:38] should we add it to a swat? [21:20:39] np [21:20:52] probably [21:21:06] no special hurry, though [21:21:08] ok, let me see [21:23:18] Platonides: i'll put them on evening swat under my nick [21:37:44] 10Traffic, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 15 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#3011094 (10GWicke) @Tgr, the concerns you raise are primarily about the implementation, and not really about the API. I think it is important to separate... [23:55:04] 07HTTPS, 10Traffic, 10Collection, 06Operations, 13Patch-For-Review: Book collections communicate with pediapress using http: - https://phabricator.wikimedia.org/T157398#3011661 (10Dzahn) a:03Dzahn [23:55:30] 07HTTPS, 10Traffic, 10Collection, 06Operations, 13Patch-For-Review: Book collections communicate with pediapress using http: - https://phabricator.wikimedia.org/T157398#3003811 (10Dzahn) https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=1480106&oldid=1478688