[00:52:09] Somewhat related to traffic, I've been thinking lately about how cars could communicate with each other. [00:52:38] Like some kind of SMS for nearby cars? So you could communicate "turn on your goddamn headlights" or "sorry for cutting you off" more easily. [00:54:33] https://www.technologyreview.com/s/534981/car-to-car-communication/ [01:30:12] mutante, https://www.theguardian.com/technology/shortcuts/2017/sep/11/tesla-hurricane-irma-battery-capacity [02:06:59] that's what car horns are for. the longer it beeps and the more they can see you punching it with your first, the more negative the sentiment is :) [08:42:19] 10Traffic, 10Operations, 10monitoring: prometheus -> grafana stats for per-numa-node meminfo - https://phabricator.wikimedia.org/T175636#3599588 (10ema) p:05Triage>03Normal [08:57:24] hey ema, where are pybal stats! [09:05:37] mark: :) [09:06:05] * ema dodges the question with a swift smile [09:08:33] * mark sets UBN [10:15:50] 10Traffic, 10Operations, 10monitoring: prometheus -> grafana stats for per-numa-node meminfo - https://phabricator.wikimedia.org/T175636#3598644 (10fgiunchedi) AFAICT the `meminfo_numa` collector has been introduced in node_exporter 0.13 so it is already available across the fleet (we are running 0.14). It s... [12:05:53] bblack: I've published a minor edit to https://gerrit.wikimedia.org/r/#/c/376751/, curly bracket indentation OCD [12:08:16] ema: any thoughts on how to hook up the hiera that doesn't cause more (puppet, or logical for humans) scoping issues? [12:08:37] every time I start to look at it, 5 minutes later I'm halfway ready to rewrite the entire varnish-related class hierarchy :P [12:08:50] :) [12:09:08] ideally we want a hieradata we can set at a per-cluster+dc level (e.g. for exactly upload@ulsfo) [12:09:19] which then takes effect in this backend-only VCL shared by all cluster types [12:10:12] perhaps we should merge https://gerrit.wikimedia.org/r/#/c/376242/ first? How does it look like? [12:10:27] I've had some thoughts on how to refactor everything along the way, but they always fall afoul of our puppet coding standards, too [12:10:37] I think this may be a more-complicated edge case than they consider [12:12:09] ema: does it come out as zero-diff in the compiler for various cases? [12:12:20] well, zero functional diff on disk [12:12:57] the scoping changes there are hard to follow on paper, I was worried whether some of your lines cause the singleton-ization of a file that's deployed for both instances or whatever [12:13:27] so there are catalog changes of course, but no on-disk file changes http://puppet-compiler.wmflabs.org/7781/ [12:27:38] varnish maybe gets wrapped by some kind of "two-instance node" sort of abstraction (like role::cache::instances, sort-of) [12:28:26] the notion of the full scope of a cpNNNN node would be some kind of "cacheproxy" class at some layer that brings together tlsproxy+"two-instance-node" above (or perhaps without that, a more complicated and lengthy combination of tlsproxy+varnish) [12:29:05] and then you can break out clusters (text, upload) as variants of that [12:29:19] so clusters are clearly roles, and the cacheproxy node concept is a profile [12:29:34] but there's a need for more than just one layer beneath that profile somewhere [12:29:54] (or else things are insufficiently-abstracted and it gets messy and redundant) [12:30:20] (or in the tlsproxy case, we're missing the abstraction that makes tlsproxy useful outside of cpNNNN) [12:31:31] complicating the above is there's a bunch of cross-cutting concerns that tie them all together deeply: the pair of varnish instances and the nginx/tlsproxy share some data. they also share some control structures in the sense of ordering startup units and strange first-run stuff to avoid port 80 conflicts, etc [12:32:13] there's templates that are shared between the instancepair in two different ways: base templates use to template separate files for each, and files which are templated out to disk once but used by both. [12:34:40] there's even singleton services that exist at the instancepair level of abstraction (like vhtcpd, which has per-cluster config and deploys once per service, and connects to both local instances) [12:35:08] and when you start bringing analytics into it, there's conditionals that will vary from the outermost scope and change which classes are brought in [12:35:34] e.g. text-vs-upload hieradata will determine somewhere a few layers deeper whether the varnishmedia stats daemon is provisioned [12:36:45] and then all the config we currently pass around in vcl_config (and its temporary abstractions at certain layers as common_vcl_config, fe_vcl_config, be_vcl_config) is poorly-factored in terms of where things actually have functional effects at, and ditto all the variables we currently pass down into ::instances + varnish::instance [12:37:02] and the common and wikimedia vcl messes that your patch touches on [12:39:34] I donno, every time I run down this stuff and try to put together a mental layout of classes, I get hung up on something or other [12:40:30] I think we have to accept that "modules" can call other modules. Some modules are generic software abstractions (like the "varnish" class should be, and the "nginx" class actually is), and some modules are going to be for building more-wmf-specific abstractions out of those, which are still components underneath profiles. [12:56:58] re: misspass knob in hieradata, I imagine we'd use it mostly in case of DC repools, so do we really need to specify upload@eqiad? Perhaps setting the flag at the DC level would be good enough [12:57:43] well, we do sometimes want the capability. admittedly it's rarer, but if we can provide both solutions easily, it's nice to have the tools. [12:58:31] e.g. we might depool only one cluster and not the other in some DC, under attack or disfunction of some kind. or sometimes it's just to reduce bandwidth usage (e.g. some scan saturating esams link running through all the upload URLs) [12:59:03] I don't think we have role-based hieradata at the per-DC-for-all-clusters level anyways, do we? [12:59:20] (since the DC names don't compose part of a role name) [13:00:16] but I guess we can set a generic hieradata on the whole DC and only our cp code pays attention to it [13:00:45] yeah I was just going through the hieradata hierarchy and I think we've got support for either setting stuff at the DC-level or for a certain cache-type [13:00:51] e.g. cp_backend_warming: true in hieradata/eqiad.yaml ? [13:01:18] we obvious have dc+cache -level controls in some other senses, but it's not through role-based lookups. [13:02:27] oh, I think we have the capability, we just haven't ever created the files [13:02:49] we've been putting per-cluster stuff in hieradata/role/common/cache/text.yaml [13:03:04] but we can also set the same keys in hieradata/role/eqiad/cache/text.yaml [13:03:13] oh cool [13:03:20] or for a whole-DC repool, in hieradata/role/eqiad/cache/*.yaml I guess, x3 [13:03:42] I don't think the role-based hiera lookups use inherited roles, afaik, or we could use base there [13:04:42] r::c::base is probably on its way out anyways [13:05:16] something like r::c::base + r::c::instances sort of code is what belongs in the profile for "cacheproxy cluster", with the specific clusters being roles, I think [13:06:00] r::c::upload + ::text + ::misc are already fairly close. the last couple of refactors that was one of the goals, to get them close enough that they could be one class that varied on actual parameterization [13:09:23] one thing that ends up being odd about putting text and upload way out at the role level, is the existence of per-cluster templates (e.g. text-frontend.inc.vcl.erb) and extra_vcl files that are cluster-specific [13:09:40] given that way out at the role level, things should only be differentiating via hieradata really [13:10:33] I kind of presume that means that a correctly-factored role wouldn't normally end up having files/ or templates/ at the role level [13:11:46] (and it seems kinda hacky to push those VCLs down to modules/cacheproxy or profiles/cacheproxy as sets of per-cluster files switched by the clusters' names and/or hieradata) [13:37:02] 10Traffic, 10Operations, 10Patch-For-Review: Explicitly limit varnishd transient storage - https://phabricator.wikimedia.org/T164768#3600287 (10BBlack) We're still missing caps for the upload cluster, right? (well and misc, but that case isn't all that important here). I'm a little concerned about the inter... [13:44:14] https://grafana.wikimedia.org/dashboard/db/varnish-transient-storage-usage?orgId=1&var-datasource=ulsfo%20prometheus%2Fops&var-cache_type=varnish-upload&var-layer=frontend&from=1504832400616&to=1504834875705 [13:44:35] so some of the largest short spikes I find are actually in ulsfo upload [13:44:53] that one in particular is crazy. it's short enough that you don't see it even as a max value on the 30d view, but you do in the 7d view [13:45:08] in that zoom, we can see there's no correlated to shortlived or uncacheable graphs [13:45:15] s/correlated/correlation/ [13:47:15] it's cp4007 fe in that zoom that has the issue. it's fairly normal up until about 1:14, the ramps out to a huge value by ~1:19, then recovers from that plateau over about 1:23-1:27 [13:48:04] so about 13 minutes duration for the entire hill, or you can think of it as a 4 minute plateau, or a 9 minute period of growth->plateau before it starts shrinking again [13:48:17] I still have a hard time fathoming what exactly could be happening there [13:49:07] my best guess, given it's not shortlived, would be something like this: [13:51:30] a ulsfo users requests a Very Large File (let's say ~12GB?), and they can only download it at about 125Mbps, so it takes 13 minutes to download [13:52:13] but the ulsfo backend handling this file (really, at that size, it's streaming uncacheable all the way through from swift) streams it into the ulsfo frontend much more rapidly, causing it to spool up in transient there and only get de-allocated as the user downloads. [13:53:37] while the shortlived rate graph proves this isn't a spike in count of shortlived's, the "uncacheable" graph is similar (the rate of uncacheable objects), and thus doesn't preclude that, in that noise, there's not 1x huge uncacheable happening. [13:54:27] if the "lack of pacing" argument above is the real situation, what we want is some kind of varnish tunable to limit how much we'll buffer when streaming an uncacheable response through [13:54:52] thus pushing the pacing back down towards the real origin, instead of slurping as fast as possible and then buffering in transient [13:56:19] anyways, the scenario could be vastly different. it could be we have smaller versions of that scenario going on all the time (unpaced buffering in the frontend for, say, ~100MB-1GB-sized objects, and these spikes represent statistical anomalies when several such things collide perfectly in time [13:58:30] it could also be, rather than a pacing/buffering issue, a different sort of shortlived issue [13:59:03] the shortlived time is 10s (if calculated obj.ttl of a cacheable object is <10s, it's "shortlived" and goes in transient storage instead of main storage) [13:59:26] we don't intentionally have 10s TTLs, but maybe they're happening due to expiry on normal objects [13:59:56] e.g. a 1d-TTL object in the backend is fetched by the frontend when it's in its last 7s of life, causing it to go through shortlived/transient instead of normal malloc storage with evictions. [14:01:03] but we also have the frontends' 256KB object size limitation in play there too [14:01:12] so either way, a large file isn't going into the malloc storage there [14:03:44] I'm inclined to say we're safer setting some "reasonable" limit on upload transients, which captures most of the graph but avoids the tall tiny spike values [14:04:14] then we're protected from oom situations, and worst case what would've been those spikes convert into some kind of 503. they're rare, and perhaps a user report on it will give us info about what causes them. [14:06:38] everywhere but esams, 6G for upload-frontend seems a reasonable cutoff [14:06:52] esams would need a value more like 10 or 12, ugh [14:08:01] for the upload-backend case it's more like ~5G to cover them all, maybe do 6 just to be safer [14:08:28] I wonder why esams suffers from higher (both avg + peak) upload-frontend transient than the rest [14:08:40] more clients-per-frontend? higher latency to the next backend? [14:15:23] hmmm, I have another hint, fetch_maxchunksize? [14:15:56] apparently some were seeing crazy memory growth under early v4 (vs v3) due to a large default fetch_chunksize. but the default has by now (our builds) been reduced to 16k [14:16:08] but there's still fetch_maxchunksize defaulting to 0.25GB [14:16:57] I suspect in practice, in some scenarios, we're eating this much transient at minimum (regardless of other pacing concerns) per concurrent uncacheable (e.g. large) transfer through the frontend. [14:17:12] that might make sense as a theory for the statistical spikes, and esams' overall higher level of transient [14:17:38] a 12GB mem spike would only imply ~50 such transfers happening concurrently [14:18:17] the varnish params advice also says "Making this too large may cause delays and storage fragmentation." [14:18:41] I assume the penalty for making it too small is inefficiency in transferring/storing very large objects coming through very quickly [14:20:43] if we assume it's that kind of effect, then without understanding a bunch of other details, we can be kind of handwavy and say "we're not going to kill efficiency much so long as that number isn't unfairly limiting BDP" [14:21:02] and probably the edge-case BDP we're looking at there is a reasonable limit on a single fetch stream over our esams<->eqiad link [14:24:41] so the BDP there for the whole link is somewhere in the ballpark of 100MB (10Gbit/s @ 83ms, if my math is right, is about 98MB?) [14:26:05] 10Traffic, 10Operations, 10Patch-For-Review: Explicitly limit varnishd transient storage - https://phabricator.wikimedia.org/T164768#3600514 (10ema) >>! In T164768#3600287, @BBlack wrote: > We're still missing caps for the upload cluster, right? (well and misc, but that case isn't all that important here).... [14:27:18] the default fetch_maxchunksize is 256G. so let's say we cut it to 64G. By cutting down to 25% of the current value, if it was going to impact our transient graphs positively, you'd think that would be enough to show us. [14:27:41] and it's safely not far off from the whole link's BDP, so I don't think at that level if becomes a limiting factor for be->be transfers of single streams, either. [14:29:04] the note for the param within varnishadm says: [14:29:04] NB: We do not know yet if it is a good idea to change this [14:29:04] parameter, or if the default value is even sensible. Caution [14:29:04] is advised, and feedback is most welcome. [14:29:43] it's the frontend case we care about the most, and of course the BDP is very different there (to the local BE) [14:31:19] the average rtt locally is ~0.060ms (that includes the localhost one that's faster, yes), in a quick test [14:32:27] so the BDP there is closer to like 80KB [14:32:42] so that case isn't gonna be hurt by this either of course... [14:36:36] I'm gonna try dropping it to 64MB on some esams upload nodes using varnishadm, see what happens to graphs [14:37:46] ok :) [14:39:15] !log manually setting fetch_maxchunksize=64MB via varnishadm on cp3034 fe+be instances [14:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:21] 10Traffic, 10Operations, 10TechCom-RfC, 10Wikimedia-Developer-Summit-2016, and 5 others: RFC: API-driven web front-end - https://phabricator.wikimedia.org/T111588#3600637 (10Krinkle) [14:53:10] initially the results look promising [14:53:33] but even looking over the past 3h view, there's lots of random ups and downs, so the experiment could be a no-op and we're seeing randomly-positive variations :P [14:54:15] fe transient did drop off, by ~10m after application [14:54:34] but there's been similar dropoffs not far back in history, so meh [14:55:09] I'm gonna let this stew a little while just to be sure no problem reports and see the micro-trends develop a little more [14:55:27] then maybe try it on all of esams-upload, so that we can hope to see a broader significant effect in the averages over the next day or so (or not) [14:56:31] test case added to the miss2pass CR, works as advertised :) [14:56:40] ema: the naming bikesheds in that patch could maybe use massaging too [14:56:50] I like the X-MISS2PASS as it's functional and descriptive [14:57:00] the overloading of X-Next-Is-Cache is kinda confusing [14:57:21] technically, requests from fe->be (that we don't want to effect) are also requests for which the next origin is a cache [14:57:32] the frontends just don't happen to set X-Next-Is-Cache, only backends do [14:58:09] maybe X-Next-Is-Cache just needs a rename in general (which would have to go out in a couple of patches, to paper over the async rollout) [14:58:33] X-Next-Is-Applayer? [14:58:59] but then we have no signal that differentiates fe->be from be->be, they'd both lack that header [14:59:04] (which is what this patch needs) [14:59:45] really, all requests arriving at the be are cache<->cache :) [14:59:54] the distinction is whether the client is an fe or a be [15:00:29] and then we're also (aleady, before this patch) using X-Next-Is-Cache as an internal signal (within one daemon) for "next destination is another be or the applayer" [15:01:37] so we could leave X-Next-Is-Cache as an internal-only signal within a backend (like it is now, we just happened to not unset it...) [15:02:07] and have frontends set something like "X-Client-Is-Frontend" or whatever on their request to the local be, giving us the other cross-cache signal we need, separately [15:02:47] or another more abstract and logical version would be to have the frontend set something like "X-Client-DC: esams" indicating which DC the request entered the front edge through. [15:03:03] and make the conditional on whether X-Client-DC != current dc, in the backend when carring about miss->pass [15:03:10] s/carring/caring/ [15:03:38] we already sort of have that data in the X-DCPath header too, if we regexed for the first entry [15:04:59] e.g. in the backend code: if (X-DCPath !~ "^<%= @::site %>(,|$)") { this_is_remote_be_be_traffic; } [15:05:50] that's just trying to be general and make it more future-adaptive, I guess really "^<%= @::site %>$" would do, that's what fe->be traffic looks like on entry [15:06:21] oh, we don't set X-DCPath in frontends heh, only backends [15:06:27] I guess to avoid the duplication :P [15:06:58] so, another way to think of it is that in our current VCL the mere existence of X-DCPath on reception implies be->be traffic, too [15:09:56] back on the other topic, an esams upload cache has somewhere around 80K "active" connections in LVS at any time (although some of those are surely idle/gone, LVS lingers them in its accounting a while past their natural life I think) [15:10:57] maybe varnish-level number is more reliable [15:11:43] yeah so varnish "client connections" is more like half that value [15:11:49] in established state, anyways [15:13:03] the c_reqs number is more like 2.3K [15:13:46] it would only take about 5MB per client at 2.3K parallel reqs, to cause a 12GB memory spike [15:14:31] it's possible maybe fetch_maxchunk has to be a lot smaller before it has a notable effect on this [15:14:51] maybe we don't commonly use the whole max value presently anyways, as we know statistically most requests are smaller (sub-megabyte) by count. [15:19:22] 10Traffic, 10netops, 10Operations, 10ops-eqiad: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3600771 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` ['lvs1007.eqiad.wmnet'] ``` The log can be found in `/var/log/wm... [15:21:51] fe nuked_objects rate seems to have gone down since tweaking fetch_maxchunksize (at 14:39)? [15:21:58] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&panelId=20&fullscreen&var-server=cp3034&var-datasource=esams%20prometheus%2Fops&from=now-6h&to=now [15:22:21] !log manually setting fetch_maxchunksize=16MB via varnishadm on cp3034 fe+be instances [15:22:30] may as well go big to confirm whether this helps or not :P [15:22:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:21] we're still in the ballpark of 1/8 link bdp, so even if that's a buffering limit for that, it's not the end of the world for a single stream [16:41:58] 10Traffic, 10netops, 10Operations, 10ops-eqiad: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3601051 (10RobH) a:05RobH>03Cmjohnson Ok, now we are in a bad state. We are trying to remotely enter the Bios, and get the following when telling it to enter bios on vsp: ``` R... [16:44:11] 10Traffic, 10netops, 10Operations, 10ops-eqiad: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3601088 (10BBlack) Also, the NIC firmware update only applied to ports 2+3, but not ports 0+1. I don't suspect NIC firmware level was a leading candidate for the fix anyways, but ha... [17:13:38] ema: I don't think the fetch_maxchunksize is making any realistic difference, at least not down to 16M on 3034 as a single-point example [17:36:30] 10Traffic, 10netops, 10Operations, 10ops-eqiad: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3601285 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['lvs1007.eqiad.wmnet'] ``` Of which those **FAILED**: ``` set(['lvs1007.eqiad.wmnet']) ``` [17:51:11] !log cp3034 - reset fetch_maxchunksize to 0.25G (default) for fe+be [17:51:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:32] 10Traffic, 10Discovery, 10Maps, 10Maps-Sprint, 10Operations: Make maps active / active - https://phabricator.wikimedia.org/T162362#3601731 (10debt) [19:23:51] 10Traffic, 10netops, 10Operations, 10ops-eqiad: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3601816 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` ['lvs1007.eqiad.wmnet'] ``` The log can be found in `/var/log/wm... [19:27:51] 10Traffic, 10netops, 10Operations, 10ops-eqiad: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3601825 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` ['lvs1007.eqiad.wmnet'] ``` The log can be found in `/var/log/wm... [19:30:11] 10netops, 10DC-Ops, 10Operations, 10ops-eqiad: two switches have same serial in racktables - https://phabricator.wikimedia.org/T175737#3601841 (10RobH) [20:14:54] 10Traffic, 10Analytics, 10Operations: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#3602063 (10Jan_Dittrich) Is there some progress on this issue? The last experiments I was in contact with (and responsible for thinking up) were still hack-ish. I wonder if packaging one of the hac... [20:27:02] 10Traffic, 10Analytics, 10Operations: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#3602138 (10Nuria) We will not be doing more work towards this as statistically we did not find a more private conscoius way to do ab testing than the one event logging -based experiments offers. You...