[00:57:27] 10netops, 06Operations: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#3196333 (10faidon) This just got in: > The fix for PR 1238906 has implemented in the main Junos release and would be available 14.1X53-D43 onwards. This releas... [06:30:29] 10Traffic, 06Operations, 13Patch-For-Review: Huge increase in cache_upload 404s due to buggy client-side code from graphiq.com - https://phabricator.wikimedia.org/T151444#3196558 (10ema) 05Open>03Resolved a:03ema 404 rate [[https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-code... [07:12:04] I've added transient storage usage at the bottom of varnish-machine-stats [07:12:07] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&var-server=cp3041&var-datasource=esams%20prometheus%2Fops&from=now-24h&to=now [07:12:51] peaks of ~30G, both for frontend and for backend transient storage [07:14:38] in other news, we've had two brief 503 spikes this morning in esams text: https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?panelId=2&fullscreen&orgId=1&from=1492668677032&to=1492670910330&var-site=esams&var-cache_type=text&var-status_type=5 [07:15:31] which also seem to be due to the same old 'http first read error: EOF' issue [07:16:05] I've grabbed a few of those in ~ema/2017-04-20-503.log on all text-esams hosts for later inspection [07:18:43] cp3032 had a spike in backend connections to codfw at 6:30ish, which correlates with the first 503 spike [07:20:14] and cp3043 had a similar spike but at 6:40ish, correlating timewise with the second 503 spike in the graph above [07:22:03] so yeah, those two are the hosts that caused the two 503 spikes, and sure enough mailbox lag is involved [07:22:30] we didn't get alerted though as it was too brief [07:55:52] 10netops, 06Operations, 10ops-eqiad: Spread eqiad analytics Kafka nodes to multiple racks ans rows - https://phabricator.wikimedia.org/T163002#3196663 (10elukey) [07:56:41] 10netops, 06Operations, 10ops-eqiad: Spread eqiad analytics Kafka nodes to multiple racks ans rows - https://phabricator.wikimedia.org/T163002#3183025 (10elukey) >>! In T163002#3193747, @ayounsi wrote: >>if possible to migrate kafka1022 > I believe you mean kafka1020 Definitely, fixed the task's description... [08:31:15] 10Traffic, 10netops, 06DC-Ops, 06Operations, 13Patch-For-Review: Interface errors on asw-c-codfw:xe-7/0/46 - https://phabricator.wikimedia.org/T163323#3193493 (10akosiaris) Wrong patch above, please ignore. [08:59:21] uuh, wait! we got the varnishstat output wrong yesterday wrt transient storage [08:59:40] the first column is the sum of all allocation since startup, not the amount of bytes currently allocated [09:00:04] so on cp2002: [09:00:12] $ varnishstat -1 -n frontend | grep Transient.c_bytes [09:00:12] SMA.Transient.c_bytes 9875791789343 14852154.76 Bytes allocated [09:00:34] the second column is the amount of transient storage currently in use [09:00:54] 14M, in this case :) [09:02:15] also, a few lines above, cp3041 clearly doesn't use 30G at peak but rather 30M [09:04:15] and the idea of using random backend selection at the frontend layer to avoid serialization doesn't seem viable as we don't yet know the object CL when the request comes in [09:04:51] given all that, I'd say we can safely use hfp (and monitor transient storage growth) [09:44:03] 10Traffic, 06Operations, 13Patch-For-Review: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661#3197012 (10ema) The vast majority of objects ending up in bin0 (<16K) are actually very small: most are < 1.5K. Thus, they might be causing contention/fragmentat... [09:54:01] I've added our discussion to the phab ^ and prepped this using hfp: https://gerrit.wikimedia.org/r/#/c/349188/ [09:54:19] (but am willing to be convinced it's a bad idea) :) [12:37:49] ema: makes sense. maybe also add a filter for beresp.status==200? [12:38:34] yep! [12:39:47] bblack: OK to revert the X-Orig-CT workaround, meanwhile? https://gerrit.wikimedia.org/r/#/c/348699/ [12:40:11] yeah [13:16:13] 1/2 of the upload-codfw hosts are mailbox lagging, that's not great [13:16:21] yeah [13:17:16] FWIW codfw only has 10x hosts, eqiad 11, esams 12 (but currently 10) [13:17:56] I think this is the time of day when even eqiad tended to have temporary lag that didn't ultimately grow out of control though [13:18:26] (just due to load in US daytime) [13:21:06] did we figure out what we need to ask remote hands for those two esams hosts? [13:21:21] I can drop them an email now [13:21:55] paravoid: they both need their power cords pulled for a min, to get a full idrac/bios reset [13:22:04] at least, that's our first guess at fixing them [13:22:11] wasn't one of them a broken DAC? [13:22:16] paravoid: yup, that would be awesome, task is here: https://phabricator.wikimedia.org/T157537 [13:22:55] can't access both machines' mgmt interfaces [13:23:32] ema: there's another complication with the pass-small-objects thing [13:23:36] cp3003 is one with the broken network: https://phabricator.wikimedia.org/T162132 [13:24:38] ema: the 4-hit-wonder code we're still using on upload-frontend parses the backend response looking for X-Cache-Int to be hit/4 or higher [13:24:49] oh, so there are 3 systems failed right now/ [13:24:50] ? [13:25:03] ema: so if BEs start always-missing these small objects, it will confuse the frontend to consider them perpetual one-hit-wonders and not cache them there either [13:25:49] paravoid: yeah but cp3003 is low-prio and probably complicated (some kind of network interface fail on the host most likely) [13:25:59] nod [13:26:13] do we have any decom'ed 10G servers at esams? [13:26:18] so that we can salvage their DAC [13:26:38] I'm assuming at present we'll probably not even bother fixing 3003, as it and its cluster-mates will decom sometime in the next couple of months [13:27:03] bblack: oh wow nice catch. Should we lie in X-CDIS? [13:27:20] no, that would cause all kinds of nasty follow-on effects with stats [13:27:38] mmh [13:29:21] so we could add an exemption to the 4-hit-wonder code for the same object size class [13:29:30] (in the FE 4-hit-wonder code itself) [13:29:41] probably best for now [13:30:02] longer-term it's better that we get rid of the 4-hit-wonder + 256K cutoff code in the FE and replace with the exp(-size/c) solution [13:30:13] sounds good [13:31:53] on that latter bit, I was thinking about how we actually patch that in. his example used non-hfp, but there's some thinking to do there again about serialization effects [13:32:17] (and hfp would ruin its statistical perfection I think) [13:34:18] ok, mail sent [13:39:16] paravoid: thanks! [14:03:53] maybe the answer to the exp(-size/c) hfp problem is to employ hfp only when the calculation results in a very low probability of entering cache [14:04:38] if the calculation gives .33, it's kind of ok to stall till it enters, you won't stall much or for very long [14:05:29] but if the calculation gives say .01, that might be a lot more stalling for an object that could be hot (keep in mind exp(-size/c) by its nature knowns nothing about hotness) [14:06:45] of course defining a floor percentage chance of cache entry for hfp (which more or less kills the statistical magic below that floor)... [14:07:04] is really not much different than setting a size ceiling like we do today for >=256KB hfp [14:07:51] it might be simpler to keep the size ceiling as it is and do the exp(-size/c) after it without hfp as a better replacement for 4-hit-wonder that still scales relatively to object size within its bounded range [14:08:27] (and confirm our static upper size cutoff does work out such that all exp(-size/c) have reasonable odds of entering the cache before stalling becomes an issue) [14:12:18] FWIW, given the suggested tuning of "c" relative to FE mem size, a 256K object has 0.08% chance of cache entry, which is pretty far off the mark for the above [14:13:16] 128K object is ~3% odds, 64K object is ~17% odds [14:13:41] 32K is 41% [14:14:03] 16K is 64% [14:14:26] so a cutoff of 64K might make more sense with that scenario than 256K [14:15:04] or maybe "whatever is about 10%" [14:16:38] (which is about 83K) [14:23:04] or maybe there's other ways to break up this issue using hash_ignore_busy, I donno [14:25:10] (I wonder how horrible it would be if we turned on hash_ignore_busy for all FE traffic? the effect would be that if N requests for the same object arrive at a single FE before the first completes (and caches or hfps or whatever), all N get forwarded to the backend as separate requests without coalescing) [14:26:11] and then the local backend would coalesce them of course [14:26:19] it's just a bunch more local fe->be traffic potentially [14:33:50] 10netops, 06DC-Ops, 06Operations: Interface errors on pfw-codfw:xe-15/0/0 - https://phabricator.wikimedia.org/T163447#3197764 (10ayounsi) [14:34:43] I donno, it's all so complicated :P [14:34:47] :) [14:35:10] I'm not sure I understand the point about hfp ruining the exp magic [14:35:41] if you hit a hfp object then that request already entered the cache, for some definition of entering [14:35:55] 10Traffic, 10netops, 06DC-Ops, 06Operations, 13Patch-For-Review: Interface errors on asw-c-codfw:xe-7/0/46 - https://phabricator.wikimedia.org/T163323#3197781 (10ayounsi) a:03ayounsi [14:37:13] well no, forget that, I'm simply not getting the point about hfp ruining the magic :) [14:38:55] perhaps that once the hfp object has been created, successive requests for that object aren't "taken into account" by the formula because it will just pass? [14:39:34] yeah. the magic is that exp(-size/c) scales smoothly according to size and hotness statistically, even though an individual request isn't operating on any hotness information [14:40:04] so e.g. if a 128K object has 3% odds of entering cache according exp(-size/c): [14:40:25] if it's a 1-2-hit-wonder, odds are pretty decent it never enters cache (without relying on magic like parsing hit/N from the backend) [14:41:08] if that 128K object is getting spammed with a huge request load, let's say it gets 1000 requests in the space of a minute, odds become good of entering the cache fairly quickly [14:41:17] we could come up with more magic for the hfp object's ttl [14:41:42] to complicate things a bit more :) [14:41:53] but if we're creating an hfp object when the odds are below some floor (for fear of stalling/serializing a hot object), we're killing that [14:42:04] we see 3% and just go hfp right away, we effectively treat 3% as 0% [14:42:20] (or similarly, we see 128K+ and hfp right away, effectively treating as 0%) [14:43:32] and yeah, we could tune the hfp ttl [14:43:59] but we're still just tuning the problem, and mucking with the elegantly mathematically-tuned parametric behavior of the original exp(-size/c) [14:44:37] all of which we could avoid with hash_ignore_busy on the frontend [14:45:13] right, then there's no need for hfp objects (their only purpose is to enable coalescing to still work in the general case) [14:45:34] but we're taking the tradeoff of never coalescing for fe->be, which might be a pretty big penalty [14:46:55] unless... [14:47:18] so another way we could make this all work, is if there's a way to convert hfp-passes back into miss->enter-cache [14:48:01] the idea there would be on a truly-new miss, we apply exp(-size/c) odds of cache-entry. if cache-entry fails, we create an hfp. [14:49:10] but when a request comes in and hits the hfp object and does an uncoalesced pass, we hook into it and still do the same exp(-size/c) magic again. if the random chance succeeds, we destroy the hfp object and cache the response as it comes through (effectively turning coalescing back on) [14:49:49] then the algorithm's behavior is preserved over all the requests, and they temporarily hfp in parallel until some random one wins the cache-entering lottery and makes it a hit [14:50:07] I think a way to do that would be changing the hashing for the object [14:50:19] (which would also mess with the chashing though) [14:50:21] I just don't know if we can hook into the response side of an hfp-hitting request and convert it back into miss->cache-entry sort of behavior on the fly [14:50:45] (based on re-running the math alone, which we could do from any vcl sub) [14:51:39] maybe it's possible? (without changing hashing) [14:55:22] 10netops, 06DC-Ops, 06Operations: Interface errors on pfw-codfw:xe-15/0/0 - https://phabricator.wikimedia.org/T163447#3197830 (10Jgreen) If we're just talking about a <1m traffic hiccup then it's fine to do anytime. [14:57:06] hash_always_miss? [14:58:19] yeah it's not possible [14:58:40] (to do it like I was thinking above, at e.g. vcl_backend_response time on an hfp-hitting fetch) [14:59:00] but you're right, it might be possible to conditionally hash_always_miss to evade the hfp [14:59:13] well, maybe [14:59:20] we don't have size info in vcl_recv [14:59:38] 10Traffic, 10netops, 06DC-Ops, 06Operations, 13Patch-For-Review: Interface errors on asw-c-codfw:xe-7/0/46 - https://phabricator.wikimedia.org/T163323#3197843 (10ayounsi) [14:59:52] indeed [15:00:56] hmmm [15:01:14] you said before hfp objects have headers? so they have content-length stored? [15:01:42] :) [15:01:56] maybe we do create hfp if exp(-size/c) fails cache entry, and then on the hfp-hitting path, when we get to vcl_pass we could look at obj.http.content-length to re-run the size-based statistical check [15:02:27] with vtc I've confirmed that the size of a hfp object changes according to the headers [15:02:40] and if we decide to convert it into a miss->cache-entry, we set req.hash_always_miss and return (restart), and also set req.http.X-WTF or whatever to tell us to assume we passed exp(-size/c) and not re-check it on that request [15:02:41] 10netops, 06DC-Ops, 06Operations: Interface errors on pfw-codfw:xe-15/0/0 - https://phabricator.wikimedia.org/T163447#3197878 (10ayounsi) To clarify, that's on Node1, which might be named pfw2-codfw, ( AJ5112AA0049 ) FPC0, PIC0. [15:04:21] or in other words: [15:06:55] vcl_backend_response { if(!bereq.http.X-Assume-Entry && drand48() > exp(-1*(beresp.http.Content-Length)/N)) { set beresp.uncacheable = true; // creates hfp } [15:08:31] vcl_pass { if(this_is_due_to_hfp) { if(drand48() > exp(-1*(obj.http.Content-Length)/N)) { set req.http.X-Assume-Entry = 1; set req.hash_always_miss; return (restart) } [15:08:51] something like that, assuming we can access obj.* in vcl_pass during an hfp, or even detect that we're in vcl_pass due to hfp [15:09:15] and assuming that we can set req.hash_always_miss in vcl_pass [15:10:10] (I've seen it being set only in vcl_recv, but it's entirely possible that it can be set elsewhere) [15:10:27] however obj.* isn't available in vcl_pass, only req. [15:12:23] merging the upload-backend-no-tiny-objects patch now [15:13:03] yeah there's a lot of ifs in there :) [15:13:23] I know req.hash_always_miss only applies at vcl_recv time, but I think we can effectively set it from vcl_pass [15:13:57] if not as shown above through the restart, then we can just look for req.http.X-Assume-Entry or whatever that we set before (restart) and use that to conditionally set hash_always_miss during re-recv [15:14:06] (as restart does not change any alterations made to the request object) [15:14:23] (at least, not to headers!) [15:14:51] there must be somewhere we can do early examination of obj.http.* on a hit-for-pass [15:19:19] or, we can dump this insanely complicated conversation and say punt this level of fine-tuning for a post-varnish world when everything changes anyways :P [15:19:28] haha :) [15:25:30] 10Traffic, 06Operations, 10media-storage, 13Patch-For-Review, 15User-Urbanecm: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3197947 (10ema) 05Open>03Resolved a:03ema The user-facing issue h... [15:28:33] wat? [15:28:34] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&to=now&var-server=cp3037&var-datasource=esams%20prometheus%2Fops&panelId=42&fullscreen&from=now-3h [15:28:53] do we have two transient storages on frontends? [15:30:54] or more likely, some prometheus-related problem [15:31:20] the query there is: rate(varnish_sma_c_req{instance=~"$server:.*",layer="frontend"}[5m]) [15:31:33] legend format: {{layer}} [15:32:36] yeah it's me being retarded, I forgot type=Transient in the query [15:32:42] ok [15:33:21] do we want a metric that tracks allocated bytes? [15:33:27] (more like a meter than a counter) [15:34:06] or I guess we don't get that from the varnishstat output we parse for prometheus since it's not 1-shot? [15:34:48] currently we plot c_bytes/c_freed [15:34:56] there's g_bytes too [15:35:33] g_{bytes,alloc,space} [15:36:30] Bytes outstanding / Allocations outstanding / Bytes available respectively [15:37:39] so yeah we do get g_bytes from prometheus which seems to be interesting [16:02:04] 10netops, 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3198126 (10akosiaris) [16:02:07] 10netops, 06Operations, 10ops-eqiad, 13Patch-For-Review: switchover icinga.wikimedia.org from einsteinium to tegmen - https://phabricator.wikimedia.org/T163324#3198124 (10akosiaris) 05Open>03Resolved After a long day and a rabbithole, along with some unwanted pages, this is done. [16:11:42] 10Traffic, 10netops, 06DC-Ops, 06Operations, 13Patch-For-Review: Interface errors on asw-c-codfw:xe-7/0/46 - https://phabricator.wikimedia.org/T163323#3198174 (10ayounsi) 05Open>03Resolved papaul replaced the SFP on the switch side. Stress-testing done with bblack, no more interfaces errors. [16:15:32] cp2017 has just been bursting a few 503s [16:18:04] and again, I'll depool it for a bit [16:19:14] what sucks is every time we do one of those preventative depools, it shifts 1/N objects' traffic to elsewhere in the chash ring, and the pattern disturbance thus exacerbates any borderline problem on the other ones [16:20:00] any idea how the 256-byte thing is working out? notable bad statistical impact on miss/hit rates, etc? [16:21:24] hitrate-wise I don't see big shifts (yet, perhaps it's still early?) [16:22:54] my concern I guess is that the objects under the threshold are too numerous to stay in FE mem cache and thus we end up not caching them very effectively at all [16:23:08] but yeah, I guess we'd have to ban them out of the backends first to even see if that's true, or wait for them to expire out [16:23:23] or wait for a weekly restart :) [16:23:51] that still only gives us 1/N from the set that each FE is chashing out to backends though [16:24:10] right [16:24:25] meanwhile we're kinda hurting on the mailbox thing, and we may not see any relief (if this works at all) till we ramp up to ~1K for the cutoff [16:25:29] so it might be worth taking a guess and staring at some varnishlog traffic to semi-validate assumptions and then go ahead and Be Bold and turn it up to 1K and ban sub-1K objects off the BEs and see what happens [16:29:55] oh OK, so going straight to 1K instead of increasing gradually [16:31:55] yeah and not waiting around 3 days to see the impact of 256 either, if we can see from logs that it seems to be working reasonably-well, and that there's not a deluge of sub-1K cache misses in the frontends in general due to lack of space + eviction [16:32:07] I think those are reasonable assumptions anyways, it's more about validation than anything [16:33:23] meanwhile lag on cp2017 finally started to go down [16:33:47] will wait for lockops too before repooling [16:34:37] and then I'll call it a day and have a beer 'cause I'm tired! :) [16:34:57] ok :) [16:35:49] let me know if you notice anything weird, otherwise tomorrow morning I'll semi-validate assumptions and bump to 1K [16:36:04] sounds good [16:37:33] bblack: I was looking at the switchdc puppet output and I think I've found something that we could grep without having to manually check the output [16:38:36] for example, when running on the codfw hosts the diff in the puppet output included things like: [16:38:43] + backend be_appservers_svc_codfw_wmnet { [16:38:48] + appservers.add_backend(be_appservers_svc_codfw_wmnet, 100); [16:39:18] and when running in eqiad the removal of the equivalent lines with s/codfw/eqiad/ [16:41:21] so we could add the matching and if it matches it will continue automatically, if not will ask for the operator confirmation like yesterday [16:41:34] would something like this be ok for yuo? [16:41:45] s/yuo/you/ [16:44:06] yeah I think things like this are possible [16:44:39] but it's likely that between each iteration of switch-testing, the actual content changes as we continue to evolve the VCL and the mechanisms for controlling it [16:45:11] oh for sure, but I'm counting that for the next switch we will not need puppet deploy at all :D [16:45:51] hah [16:46:02] well then, there's little point worrying about this topic :) [16:46:34] just for the switchback, it's simple enough to add, speed up thing and remove an error prone step :) [16:46:39] we wouldnt use the scripted method in a real outage anyways. We'd just immediately change the data and deploy it everywhere that's still alive and reachablwe [16:46:44] ah switchback, sure [16:47:12] as long as we're sure we've properly sequenced disable -> merge ahead of time so that we don't cause no-ops that screw up the checking [16:48:17] sure and the fallback is just stop and wait for manual confirmation, I will print the puppet output anyway [16:48:25] in case needs to be manually checked [16:48:57] yeah just keep the regex loose-ish I guess [16:49:03] and it should work out ok [16:49:43] cp2011's turn to go nuts :( [16:49:56] agree, thanks [16:55:10] bblack: big spike acoming, cp2026 is also on fire [17:02:37] not much we can do I guess without messing the other codfw backends [17:16:20] hmmm [17:18:38] ema: I'm assuming you haven't done much lately (you should be gone anyways and drinking!). seems like it resolved itself eventually [17:19:00] well, almost anyways [17:19:07] bblack: nothing except for staring at varnishncsa and varnish-aggregate-client-status-codes [17:19:32] the rate has gone down, and interestingly the spike is similar in shape to the one occurred yesterday roughly at this time [17:21:26] well not really at this time, shifted by ~1h [17:23:24] anyways, as it seems to have settled I'll follow the advice and leave :) [20:17:05] 10Traffic, 06Operations: Server hardware purchasing for Asia Cache DC - https://phabricator.wikimedia.org/T156033#3199538 (10BBlack) [21:53:07] hey cp3038 is back :) [21:54:04] and cp3045 is too [21:54:15] bblack: ^^^ [22:37:32] 10Traffic, 06Operations, 10media-storage, 13Patch-For-Review, 15User-Urbanecm: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3200051 (10AlexRus) >>! In T162035#3197947, @ema wrote: > The user-faci... [22:41:18] 10Traffic, 06Operations, 10media-storage, 13Patch-For-Review, 15User-Urbanecm: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3200057 (10matmarex) @alexrus Can you give an example of a file that is... [22:52:02] 10Traffic, 06Operations, 10media-storage, 13Patch-For-Review, 15User-Urbanecm: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3200084 (10AlexRus) >>! In T162035#3200057, @matmarex wrote: > @alexrus...