[04:36:26] 10Traffic, 10netops, 10SRE, 10Performance-Team (Radar): experiment with reenabling compression between applayer's TLS terminators and edge caches - https://phabricator.wikimedia.org/T263288 (10Krinkle) [08:03:53] 10Traffic, 10SRE: cache_upload cache policy + large_objects_cutoff concerns - https://phabricator.wikimedia.org/T275809 (10ema) [08:04:09] 10Traffic, 10SRE, 10Patch-For-Review: Add exp cache admission policy parameters to hiera - https://phabricator.wikimedia.org/T279533 (10ema) 05Open→03Resolved After changing `exp_policy_rate` and `exp_policy_base` in hiera for traffic-cache-atstext-buster, the rendered VCL now looks like this: ` +// Incl... [09:29:49] 10netops, 10SRE, 10ops-codfw: Multiple host down alerts from rack C2 - https://phabricator.wikimedia.org/T279457 (10ayounsi) Ok, because of this RTF RMA we're going to replace the switch with a spare. @Papaul Let's chat on IRC to figure out what time would works best for you, then we can notify services owne... [10:46:57] 10netops, 10SRE, 10Patch-For-Review: automatically sample from all FPCs on core routers - https://phabricator.wikimedia.org/T257392 (10ayounsi) 05Open→03Resolved a:03ayounsi One more thing automated from Netbox. [11:02:58] 10Traffic, 10SRE, 10User-notice: Rate limit requests in violation of User-Agent policy more aggressively - https://phabricator.wikimedia.org/T224891 (10ayounsi) Even with the current rate limiting, some crawling are regularly causing issues, wasting precious SRE time. I'd like to revisit this task to be mor... [11:24:14] bblack: there's some interesting C behavior I just ran into :) [11:24:21] #define ADM_PARAM pow(0.0009765625, 0.1) / pow(2.0, -20.3) [11:24:29] int main() { double x = ADM_PARAM; printf("ADM_PARAM=%f x=%f %f %f\n", ADM_PARAM, x, -40.0/ADM_PARAM, -40.0/x); [11:24:32] } [11:24:38] (pastafails aside) [11:24:51] that code prints: [11:25:02] ADM_PARAM=645474.242184 x=645474.242184 -103275878.749405 -0.000062 [11:26:24] so essentially just dividing -40.0/ADM_PARAM I get some weird result, but if I first declare double x = ADM_PARAM then the result of the division is correct [11:29:31] like this: https://phabricator.wikimedia.org/P15260 [11:33:22] that's only true if ADM_PARAM is defined as the result of a computation though, there is no weirdness if I #define ADM_PARAM 645474.242184 [11:43:26] ema: what happens if you #define ADM_PARAM (pow(0.0009765625, 0.1) / pow(2.0, -20.3)) [11:43:29] ? [11:44:37] cdanis: then it does the right thing [11:44:52] the preprocessor is very literal :) [11:45:15] :) [11:45:47] cdanis: thank you! [11:46:11] np! I've stumbled over that one myself at some point long ago [11:46:41] the tricky part is that if you try to printf() the thing you see what you want to see [12:24:09] cdanis: it's probably smarter to declare the parameter as a global variable instead of as a define tho now that I think about it, isn't it [12:24:48] otherwise leaving it as a define we compute the parameter for each request for no valid reason [12:25:38] yeah C type rules are interesting (even for integers, but especially for floats) [12:27:31] and printf is variadic, which adds another layer of trickiness [12:28:29] there are still consequences of that the rules that I'm learning by trial and error to this day [12:30:47] e.g. this one surprised me not long ago, and I bet it's not the first time I've been surprised by the same thing: [12:32:39] unsigned short x = 16384; // or equivalently on x86_64 and most other platforms: "uint16_t x = 16384"; [12:33:24] unsigned y = x << 1; // undefined behavior, compiler may produce any result after aggressive optimizations or whatever [12:34:10] because "int" on common platforms is 32 bits wide, the C rules for arithmetic promote the 16-bit unsigned int "x" to a 32-bit *signed* int for math purposes. [12:34:24] and left shift on a signed int can produce undefined behavior [12:34:52] (actually not quite, in my simple example here, but if the "shift by one" was a shift by more bits and you were expecting it to work sanely, it won't always) [12:35:57] I've run into some different versions of this a few times over the years, but it never really fully clicked until recently that you can't trust that an unsigned type remains unsigned if it's narrower than the platform's "int" definition, for some reason. [12:38:20] ("for some reason" was re: it clicking for me. The reason in the C standard for the behavior is clear, of course) [12:43:17] what I've finally reduced this to for me is now: you really shouldn't use narrower-than-int types in C code, except when they're temporarily useful types for reading/writing network data from a buffer, and when you need to optimize a structure for space efficiency, and then be careful that you bring them back to int+ sized variables in normal code. [12:43:45] if you find yourself doing "math" with a narrow-typed variable directly, it's probably not a good code smell :) [12:51:19] ema: agreed re: global variable, although the compiler is probably smart enough there's no need to rely upon it being [12:53:01] bblack et al: I've enabled the exp admission policy on cp5001, just restarted varnish to start from a clean cache [12:53:18] you can follow the probabilities of stuff being cached with: [12:53:19] varnishncsa -b -n frontend -q 'BerespStatus eq 200 and BereqMethod eq "GET"' -F 'p=%{VCL_Log:Admission Probability}x r=%{VCL_Log:Admission Urand}x s=%b %{X-Cache-Int}o %s %r' [12:54:08] which is surprisingly entertaining to do, eg: https://upload.wikimedia.org/wikipedia/commons/0/05/Sokolniki_District%2C_Moscow%2C_Russia_-_panoramio_%2827%29.jpg only had 64% chances of getting cached [12:56:35] I've used some fairly generous parameters, see https://gerrit.wikimedia.org/r/c/operations/puppet/+/677820 [13:06:54] awesome :) [13:41:30] cool [13:49:08] 10Traffic, 10Performance-Team, 10serviceops: Decide on details of progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10Krinkle) [16:19:02] 10Traffic, 10SRE, 10Patch-For-Review: cache_upload cache policy + large_objects_cutoff concerns - https://phabricator.wikimedia.org/T275809 (10ema) Today I've added [[ https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/varnish/files/exp_policy.py | exp_policy.py... [16:21:06] ema: nice writeup! [16:21:23] I think wrt: caching strategies on upload specifically it would be interesting to look at both object hit rate and byte hit rate [16:21:45] we do have enough data in Hive / Turnilo to reconstruct the bytes numbers [16:34:23] 10Traffic, 10SRE, 10ops-eqiad: cp1087 powercycled - https://phabricator.wikimedia.org/T278729 (10Cmjohnson) updated the BIOS and submitted Dell ticket You have successfully submitted request SR1056516502. [17:03:48] yeah [17:03:54] nice stuff! [17:05:30] so: I think the slower-fill thing does make logical sense, and is ok [17:06:25] the total objects increase seems trickier to explain to myself at first glance (after all, we're allowing the caching of larger objects than we did before, so you'd think in the net there would be fewer total objects fitting) [17:08:10] but I think, it does make sense in the end. it's the effect of the probability-filter, basically. Some large objects will make it in if they're super-popular, but otherwise it's helping to suppress some others, some of the time, and in general it's reducing one-hit-wonders towards the upper end of the scale. [17:08:49] but still, it's not a huge change from where we were before on the <=256KB cases, as those are ~80% probability in the new exp settings [17:11:23] I think we could experiment with params that don't ramp off to zero as fast (by that I mean: that increase the odds for a 4MB file to something higher than 2%, and give more-reasonable odds for larger cases [17:11:34] because if you continue your script to higher values: [17:11:36] 4096.0 KB 2.78% [17:11:36] 8192.0 KB 0.0771% [17:11:36] 16384.0 KB 5.95e-05% [17:11:36] 32768.0 KB 3.54e-11% [17:11:38] 65536.0 KB 1.25e-23% [17:11:53] we may just need a gentler curve [17:13:01] the meta-question is: what is our metric for when the results look "right"? [17:14:05] you mentioned before in the ticket about "acceptable hitrate penalty", and I think that's certainly one of the metrics (so far, we don't see any real hitrate penalty, so maybe that means we can still tune further in the direction of allowing larger objects more-frequently) [17:15:07] another constraint we can look at here is: if an object of size X became super-popular, how much impact is likely to happen on average before it breaks through the probability-gate and gets cached? [17:17:10] basically the math will say that for an object of size X with probability P, we can be 90% certain it will be cached after F fetches of the object, and F fetches of that object in pass/miss has or doesn't have acceptable impact in terms of melting the cache software with the transient explosion or whatever. [17:21:13] bblack: the reason I was thinking about looking at both bytes and objects is that byte-hit-rate is a good proxy for overall cache 'efficiency' on upload -- I think in general you care more about backhaul-to-core bps than rps [17:31:25] cdanis: yeah - in general I agree, but in this particular case we're just looking at the frontend in isolation. There, we have less concern about backhaul bps, because it's just to another nearby cache on the same local network. [17:31:51] but as a *whole* with both layers together, then yeah the byte-hit-rate matters a lot for upload [17:31:58] mmm true, true [17:33:27] also missing from this conversational context, but e.ma and I had talked about it a bit since he's been back, was my misinterpration of the likely leading causes of the meltdown with the image fetch incident. [17:34:16] I was seeing it as likely to be because of the focus of all the frontends on the same backend and overwhelming it, but e.ma made a reasonable data-based point that really it was transient saturation in the frontends that caused the meltdown, not backend request overload. [17:35:20] where the term "transient saturation" here means that varnish-fe was failing requests because it couldn't allocate space for the pass-mode requests in its transient storage pool. [17:36:08] (on some level, it doesn't even make sense that it should be allocating transient storage for a pass-mode request, but for whatever Varnish Design Reasons, it does, and there are probably some tradeoffs involved in them making that call...)