[09:39:06] 10Traffic, 10Varnish, 06Operations, 13Patch-For-Review: Port remaining scripts depending on varnishlog.py to new VSL API - https://phabricator.wikimedia.org/T131353#2733965 (10ema) 05Open>03Resolved [09:39:09] 10Traffic, 10Varnish, 06Operations, 13Patch-For-Review: Upgrade all cache clusters to Varnish 4 - https://phabricator.wikimedia.org/T131499#2733966 (10ema) [10:31:57] 10Traffic, 06Operations: Standardize varnish applayer backend definitions - https://phabricator.wikimedia.org/T147844#2734045 (10BBlack) [10:32:00] 10Traffic, 06Operations, 10Wikimedia-Logstash, 13Patch-For-Review: Move logstash.wikimedia.org (kibana) to an LVS service - https://phabricator.wikimedia.org/T132458#2734043 (10BBlack) 05Open>03Resolved a:03Gehel [10:43:35] bblack: when you get a minute, I have https://gerrit.wikimedia.org/r/316742 to start rolling out varnish-exporter [10:43:56] ema: thanks! [10:44:07] likely on monday I'll merge it [10:44:36] it isn't going to do anything by itself unless asked for metrics over http [10:45:45] godog: eventually I assume we'll move it to role::cache::base (or something similar and less ugly in future refactorings) once it's usable for everything [10:46:33] bblack: yeah [10:47:07] I don't know of good mechanisms for partial rollouts in puppet ATM, maybe with the RFC and feature flags [10:48:28] if it works for now it works for now [10:48:38] the cache puppetization is still a half-refactored mess :) [10:50:21] heheh yeah many different use cases [12:08:16] ema: ok I've got a strange case from #wikimedia-tech [12:08:31] https://upload.wikimedia.org/wikipedia/commons/thumb/4/49/Relief_map_of_Serbia.png/272px-Relief_map_of_Serbia.png [12:08:40] ^ fails in various browsers, fetching from esams [12:08:46] with a content encoding error [12:08:52] ERR_CONTENT_DECODING_FAILED [12:09:01] if I hack my dns to point at esams upload-lb, I can repro with curl (--compressed fails, uncompressed is fine) [12:09:11] curl: (61) Error while processing content unencoding: invalid stored block lengths [12:09:24] same cache hit in failing and non-failing cases [12:09:33] so same cache object in frontend [12:09:37] could this be related to the jenkins issue? [12:10:01] that switch? [12:10:01] I don't think so [12:10:24] the jenkins issue ended up not being esams-specific, we can repro everywhere, but only with jenkins, and current theory is "blame jenkins" [12:10:30] i can repro it too [12:10:47] ah right, it wasn't esams-only in the end [12:11:11] so, anyways, same cache object in both cases [12:11:23] two interesting things stand out that make this odd: [12:11:36] 1) why do we try to gzip png at all, even if the browser accepts gzip? [12:12:05] 2) is it stored in the cache gzipped or ungzipped? (one of the two is getting transformed on the final outbound side in varnish-frontend) [12:12:51] our do_gzip filter for VCL is: [12:12:53] if (beresp.http.content-type ~ "json|text|html|script|xml|icon|ms-fontobject|ms-opentype|x-font") [12:13:16] so I *think* varnish won't try to do_gzip this one, but swift might be gzipping it before it reaches caches? [12:13:18] bblack: it might be related to the issue that I am seeing with varnishkafka/varnishlog [12:13:40] it will be interesting to trace the gzipness through the stack in any case [12:14:27] there was the semi-related issue the other day where there was temporary report of a cached image object having mime-type text/html on upload [12:14:31] (which would be gzippable) [12:15:06] it could be that bad gzip encoding is a fallout of some wrong mimetype issue infecting the cache in this case, too [12:15:34] https://phabricator.wikimedia.org/T148497 [12:15:47] ^ the mimetype thing the other day, which self-resolved (probably on cache objet eviction/expiry) [12:15:55] [15:06] ah I have a local repro with curl, if I use esams! [12:15:55] [15:08] Page can't be displayed on Windows in IE/Edge locally [12:15:56] [15:09] But the image doess display in both IE and Firefox 49 on a remote server I RDP to [12:15:56] [15:13] The server I try this from is located in US though, so uses ulsfo and not esams [12:16:31] so, not esams specific [12:17:05] we can obviously try purging this image, but I'd rather keep the reproduction for now first [12:17:14] yeah [12:18:24] godog: ^ in case you have insight on either issue re: swift and mimetype or gzip-encoding issues [12:21:09] Im physically in IL, nearest dc is esams/europe with which the image doesn't display (assuming nearest dc is contacted by default) from the local machine, but from the remote US server which supposedly use ulsfo the image does display and so doesn't have the problem [12:21:40] so [15:16] so, not esams specific - how it isn't if it apparently is? [12:22:02] oh sorry, I misread [12:22:08] so, still esams-specific for now [12:22:23] most likely it's just that esams is the one that happened to cache a bad copy, though, as opposed to esams-network stuff [12:23:14] what's the curl command you used? [12:23:32] If the issue that I am seeing with Varnishlog/Varnishkafka is related, I have a lot of upload links in my text files. From my side, the issue is present in all the upload caches [12:24:22] so one bit of evidence here: if I curl that url from Swift presently with --compressed, it doesn't try to compress the response [12:24:35] so varnish (or I guess potentially, nginx?) is the compressor here [12:27:27] so, it's almost certainly a varnish-fe problem... [12:27:50] I can also do curl --compressed directly cp3037 backend (port 3128) and get it in uncompressed form and valid [12:28:05] if I do so it port 3127 (varnish-fe, without nginx), I get it compressed and corrupted [12:28:28] do we have a phab ticket for this yet? [12:28:38] I'll create one otherwise [12:28:45] not yet [12:29:00] ok [12:29:37] I thought our do_gzip rules were layer-consistent [12:29:53] it seems odd to me the frontend compressed my response and the backend doesn't (it's ht ebackend the hit came from originally, too) [12:30:18] also, same result on a few other random esams frontends (they all cached it from cp3037 backend) [12:31:41] another interesting datapoint: both the compressed and uncompressed results have the same content-length header [12:31:51] compressed should be different (encoded length) [12:32:31] if I had to guess, it's quite likely that the only problem with that response is the CE: gzip header, and it's still sending the normal uncompressed PNG data [12:32:41] and PNG data of course fails to gunzip, hence CE error [12:33:00] how to test that... [12:36:01] wget seems to work fine [12:38:08] no CE:gzip for wget, I'm guessing it's not asking for it [12:40:02] oh I can get CE:gzip from wget [12:40:17] but only at esams [12:40:33] 10Traffic, 06Operations: ERR_CONTENT_DECODING_FAILED on certain png images from varnish-fe - https://phabricator.wikimedia.org/T148830#2734202 (10ema) [12:41:05] if I hit eqiad with wget --header "Accept-Encoding: gzip", I get a valid and not-compressed result [12:41:27] if I hit esams with the same, I get a CE:gzip response, which wget goes and saves to disk as a valid PNG, same sha1 [12:41:54] I'm guessing that supports my earlier theory: the png data is fine, the CE:gzip header coming out of esams frontends is wrong [12:42:02] (which it only outputs in response to AE:gzip) [12:42:38] so we've got two potential issues here I think: [12:43:08] 1) Why did varnish ever try to compress this (or is trying to compress it in realtime) in response to AE:gzip, given it doesn't match mimetype for do_gzip [12:43:11] 10Traffic, 06Operations: ERR_CONTENT_DECODING_FAILED on certain png images from varnish-fe - https://phabricator.wikimedia.org/T148830#2734215 (10ema) [12:43:20] 2) Why is it sending CE:gzip header with ungzipped content [12:49:45] yes, it's almost certainly just a bad CE:gzip header on the output [12:49:57] also, curl seems to output CE:gzip no matter what now, at least on some frontends, for that URL [12:50:06] and why doesn't it happen on the backend? [12:50:08] but with --compressed it tries and fails to decode it. [12:50:23] without --compressed, it still sees a CE:gzip header and ignores it and stores the PNG data to disk fine [12:50:48] ema , varnish-be or varnish-fe , there is an inconsistency between the task title and the description [12:51:28] that makes it slightly less-confusing. I don't think varnish is varying its behavior in response to AE:gzip [12:51:59] it simply has an otherwise-normal cache object that's not compressed, that it doesn't treat as compressed, but which happens to have an errant CE:gzip response header attached to it that shouldn't be there [12:52:26] arseny92: the error happens when hitting varnish-fe, the object is stored (varnish-be-wise) on cp3037 [12:53:12] cp3037 isn't serving the errant CE:gzip header, and is still hitting the same object the FE layer originally got I think [12:53:38] bblack: note that causing a miss seems to work now for me, appending ?x=banana [12:53:40] based on Age: [12:53:56] same age? [12:54:00] no, of course not [12:54:23] so, banana test means it normally works (it's not a persistent data-sensitive issue or anything) [12:54:40] something random intermittent happened that cause them to get stored with CE:gzip at some past point in time... [12:55:00] very interesting [12:55:20] I was thinking of a temporary glitch in swift, but then the varnish-be copy should also be broken [12:55:40] s/should/would/ [12:55:46] I've tried unique queryparams now both initially with AE:gzip and initially without AE:gzip, and neither one comes back with the borked CE:gzip on the first miss [12:57:43] I can't induce it any other way with queryparams either. I tried pushing both of those (the initially-AE:gzip and initially-not) through several times until they were cache hits, and then querying both as cache hits with or without AE:gzip [12:57:47] still no repro [12:59:18] mmh I get multiple frontend misses for the same unique query param till after a while I get a hit [12:59:24] curl -v --compressed https://upload.wikimedia.org/wikipedia/commons/thumb/4/49/Relief_map_of_Serbia.png/272px-Relief_map_of_Serbia.png?x=banana_joe -o /dev/null 2>&1|grep X-Cache: ; curl -v --compressed https://upload.wikimedia.org/wikipedia/commons/thumb/4/49/Relief_map_of_Serbia.png/272px-Relief_map_of_Serbia.png?x=banana_joe -o /dev/null 2>&1|grep X-Cache: [12:59:27] yeah that's our N-hit-wonder code [12:59:29] < X-Cache: cp1048 miss, cp3036 miss, cp3047 miss [12:59:31] < X-Cache: cp1048 miss, cp3036 hit/1, cp3047 miss [12:59:35] ah! [12:59:38] nice [13:00:06] so there's a few "obvious" pathways to how this went wrong: [13:00:41] (given cp3037 appears to still hold the same object that the frontends pulled from, and serves it fine now...) [13:01:18] Subject: DegradedArray event on /dev/md/0:cp3021 [13:01:23] 1. cp3037 temporarily emitted CE:gzip for this cache hit object at some earlier time, then later stopped doing so for the same object (seems crazy) [13:01:26] I don't see any tasks for that [13:01:36] oh sorry, you are in the middle of something else -- nevermind :) [13:02:41] 2. cp3037 never emitted CE:gzip, but at some past point in time when the frontends were first pulling it through into their caches, *they* applied the CE:gzip header to their frontend objects, but the condition that caused that is gone now (wtf?) [13:03:36] could it be that swift temporarily emitted CE:gzip, the object got cached on cp3037's varnish-be, then on the frontends, then it got evicetd by cp3037 and re-pulled properly [13:03:57] possibly [13:04:21] it would involve some crazy timing/eviction coincidences because the Age has remained consistent, but this is obviously a corner-case issue or we'd see more reports on more objects [13:05:51] in any case, in the present-tense, the only thing that's wrong, I think, is the errant CE:gzip header attached to the object in a bunch of esams frontends, which can be purged and will probably reset fine [13:06:05] we're trying to guess at the history of how that state came to be, but we can't yet repro the conditions [13:07:13] also, the age on the frontend objects is ~6 days, and we know we cap TTL shorter than that [13:07:16] (in frontend) [13:07:33] so as a side-note, apparently age is more transitively-appropriate on v4 than we observed in v3 [13:07:45] (with Snorri's stuff on cache_text) [13:08:08] true even on the new hits I created with query-params [13:08:21] (that the age appeared to transit correctly) [13:10:20] 10Traffic, 06Operations: ERR_CONTENT_DECODING_FAILED on certain png images from varnish-fe - https://phabricator.wikimedia.org/T148830#2734266 (10ema) p:05Triage>03Normal [13:10:53] if it's truly some intermittent one-off that was driven by (a) a temporary varnish glitch or (b) a unique aspect of a client request (e.g. some crazy set of headers, especially AE:, from a specific client, that tripped up varnish gzip support).... [13:11:17] it shouldn't have infected all of the varnish-fe's in esams from something like that in the frontend. we'd expect to see just one of them affected. [13:11:47] so the glitch seems like it had to have happened in the cp3037 backend, or a further backend, or swift [13:11:59] to end up shared so broadly in the frontends from natural clients that are source-hashed [13:12:28] maybe the unique-crazy part has been cached in cp3037's varnish-be but only triggers the issue when fetching through another varnish as opposed as hitting -be directly with curl [13:12:38] that it only happened in esams seems to point at cp3037, but then again global access patterns and all that. especially being serbia-related, this images access patterns would be way different in esams than elsewhere [13:12:58] ema: I can test that by manually purging it from a single esams frontned [13:13:11] let's do that yeah [13:13:37] ok I'll purge cp3037 since I have curl stuff up to hit that directly [13:13:47] ok [13:16:19] purge fixed it for cp3037 [13:16:31] well, fetching through :3127 (varnish-fe) [13:16:36] let me do it again and go through nginx JIC [13:16:50] why :3127 instead of :80? [13:18:05] why not also, it's just that I always use :80 :) [13:18:36] well in my head that's the new official port for varnish-fe. eventually nginx will own port 80, once all the other related blockers get done [13:19:03] in any case, purge -> refetch even via nginx comes back right now [13:19:10] for cp3037 [13:19:17] s/right/correctly/ [13:19:28] interesting [13:19:54] so there's nothing funky currently stored on -be that triggers the issue when fetching from -fe [13:19:55] using the cache hit on cp3037-be (which may or may not be the same cache hit that was there when the problem started. if it evicted/purged since the problem started, we couldn't tell via Age: anyways) [13:20:24] I wonder if age is transiting all the way from swift, or is set from eqiad varnish-be? [13:20:45] that might tell us whether swift or something varnish did this [13:21:31] ok, Swift sends no Age:, only < Last-Modified: Sun, 27 Oct 2013 23:17:31 GMT [13:21:42] and Age: clearly isn't derived from that, so it must be the age of entry into the eqiad backend [13:22:00] which means at least cp1072's cache hit is still the same one from when this problem started? [13:22:03] ok, after the first fetch from swift age is 0, in other words [13:22:05] since the age is still 6d+ [13:22:53] < Age: 525418 [13:22:53] < Via: 1.1 varnish-v4 [13:22:53] < X-Cache-Int: cp1072 hit/63 [13:22:59] ^ hitting cp1072 backend directly [13:23:06] which is the same kind of ages we're seeing on the errant objects [13:23:20] therefore, I think cp1072's output has been consistent throughout, and lacking CE:gzip [13:23:54] so either cp1072 temporarily emitted CE:gzip with this object then later stopped doing so (for the same cache object) [13:24:20] or cp3037 temporarily added CE:gzip on reception, then was later evicted/purged and cleaned itself up [13:24:41] or cp3037 may or may not have ever evicted/purged since, but temporarily added CE:gzip to its response to several frontends [13:25:05] or several frontends all added CE:gzip to their cache objects on reception from cp3037, and then whatever triggers that went away [13:26:03] my gut feeling is that cp1072's temprorary CE:gzip seems more likely [13:26:13] the middle two seem the most likely (cp3037-be temporary fault, either on reception from cp1072, or on all its responses until the object was evicted/refreshed in cp3037-be, or on all its responses until some other unrelated triggering thing stopped) [13:26:42] but multiple frontends got affected by the same issue at the same time then? [13:27:00] well it seems unlikely multiple frontends faulted in the same way from any possible trigger [13:27:12] right [13:27:13] they're affected because cp3037 emitted bad output at some point [13:27:28] the question is whether cp3037 or cp1072 was at fault, and how/why it happened and how/why it was temporary [13:28:09] ah sorry, I misunderstood you. Yeah the most likely seems a varnish-be issue, we don't know which -be though [13:28:11] but we know for sure (because of Age) behavior, that cp1072 is still using the same unevicted/unexpired cache object it was at the time the problem started, and doesn't have bad output now [13:28:20] that seems to make cp3037-be more likely at fault [13:28:51] and in general we've seen already multiple crazy issues with varnish<->varnish interaction [13:28:56] unless cp1072's bug caused it to temporarily change the emitted headers for a cache object and then change back (still the same cache object) [13:29:39] whereas with cp3037, it's possible the behavior was at least consistent: that it errantly added CE:gzip on reception/storage from cp1072, and cleared with a later purge/evict refresh, and we can't tell whether/when that refresh happened due to transitive ages [13:30:54] both could have been at fault, but there's a broader range of believeable causes with cp3037 [13:31:04] (or behaviors) [13:44:20] digging through open and closed varnish issues related to gzip... [13:48:33] bblack, ema: the new 4.4 kernel is running on a couple hundred systems as of now, can we roll this on out the cp hosts next week? [13:48:38] also seems fine on cp1008 [13:50:11] 10Traffic, 06Operations: ERR_CONTENT_DECODING_FAILED on certain png images from varnish-fe - https://phabricator.wikimedia.org/T148830#2734326 (10ema) After further investigation we've noticed that the problem is not reproducible forcing a cache miss by adding some random query parameters. Further, we've tried... [13:51:42] moritzm: yes [13:51:54] also LVS needs it, I can do at least the secondary LVSes today [13:52:11] ok, great [13:52:58] moritzm: nice, uname is now telling the truth! :) [13:53:17] yeah, that ugly bug only happens after ABI changes [13:54:52] (authdns is done already) [14:04:51] 10Traffic, 06Operations: cache_upload: uncompressed images with Content-Encoding: gzip cause content decoding issues - https://phabricator.wikimedia.org/T148830#2734359 (10ema) [14:04:54] nice [14:06:02] 10Traffic, 06Operations: cache_upload: uncompressed images with Content-Encoding: gzip cause content decoding issues - https://phabricator.wikimedia.org/T148830#2734202 (10ema) [14:08:54] ema: I'm trawling through 4.1 changes since our release now [14:09:10] looks like 4.1.4 is beta1, but not yet released. a lot of it's just code-quality line noise, a few might be interesting fixes [14:09:15] https://github.com/varnishcache/varnish-cache/commit/918e059ab08b13def7ebfb8c54717aea2825ce95 [14:10:27] ah nice one [14:10:32] https://github.com/varnishcache/varnish-cache/commit/c3374b185e7db43db673590fdefc4424ca8de610 [14:11:15] https://github.com/varnishcache/varnish-cache/commit/0b0bf921f065ff746f38ff2c598f338f67f2cb02 [14:11:41] ^ this one in particular talks about leaking context data like gzip buffers... [14:12:02] yeah [14:12:26] https://github.com/varnishcache/varnish-cache/commit/9edb2b653f531b9c2456ba4afdf77de327b0153b [14:12:32] ^ this one I think we looked at before, considered backporting once [14:12:59] those are really the only bugfix commits that stand out [14:13:24] a lot of the rest is docs updates, varnishtest fixups, minor code cleanup nits, or seemingly innocuous [14:15:15] I could package the 4 useful fixes on Monday or would you rather wait for beta1 to be released? [14:15:16] ema: in any case, I don't think we're getting debug value from the repro at this point, ok to just purge it everywhere? [14:15:32] bblack: +1 for purging [14:15:39] ema: we may as well wait, unless we can be sure there's no interaction between those fixups and the many many other minor commits since 4.1.3 [14:16:25] I think beta1 is out, maybe beta2 as well [14:16:32] but I meant wait on 4.1.4 proper [14:17:48] maybe moving to beta2 now is an ok path as well [14:18:06] we have to reboot everything anyways heh, could install just before the reboot and kill two birds with one stone on cache wipes [14:18:42] I donno, the timing on that is hard though [14:18:59] we want to get through the reboot cycle on them all next week, that doesn't really leave time to progressively test beta2 on some clusters first, etc [14:19:32] right [14:19:44] yeah I'd say it's not really that urgent to upgrade [14:19:50] (varnish) [14:22:47] 10Traffic, 06Operations: cache_upload: uncompressed images with Content-Encoding: gzip cause content decoding issues - https://phabricator.wikimedia.org/T148830#2734393 (10BBlack) The specific repro URL for the Serbia map has been PURGEd now to clear up the issue for users, since we're not getting much debug v... [14:32:17] 10Traffic, 06Operations: cache_upload: uncompressed images with Content-Encoding: gzip cause content decoding issues - https://phabricator.wikimedia.org/T148830#2734403 (10BBlack) To be clearer about what was debugged on IRC: this wasn't a case of actual bad gzip encoding. The object contents in all affected... [14:33:20] I think the only way this could be related to the mime-type issue, is if we have some kind of awful varnish code bug affecting backends that causes random leak of headers across from one request/object to another [14:33:34] they're similar in the sense that a header was mysteriously-wrong, but that's about it [14:33:56] (mysteriously-changed content-type from jpeg->html, mysteriously-added CE:gzip, both seemed transient but temporarily FE cached) [14:35:24] and the leak, if any, happens very rarely (just like the mailbox stuff?) :) [14:35:47] you can imagine some hypothetical bug where, for example, a list of headers in request context rarely gets the header-count wrong and ends up reusing freed storage from a previous request's header or something [14:37:38] for many such cases, those leaks would go unnoticed, since most headers that could move between requests would be relatively-innocuous [14:37:49] only things like CL, CT, CE would really cause big issues [14:47:33] perhaps you can somehow add a header at fetch time which encodes varnish's object id or whatever [14:47:54] to detect that [14:54:16] we could also just mirror them on the output side and try to find a glitch by comparison [14:54:52] as in, have all backends emit "X-HeaderCheck: CL:X CE:Y CT:Z", and then on the frontend reception (or even output), compare those to the actual header values [14:55:08] yes [14:56:08] although there is the risk that it's just gonna stay consistent with the actual headers as all headers are wrong? :) [14:56:15] sure [14:56:39] but if it's an actual rare header-leaking bug inside varnish, it's unlikely it manages to preserve correctness with that [14:56:53] yeah i guess [14:57:11] I guess we should be careful to apply it only on backend-most fetch, and check it at deliver time at all layers [14:57:40] (except of course in the gzippable cases, it will change between backend-most fetch and storage/output from that server, from varnish gzipping) [14:58:20] we could limit the whole thing to known-uncompressable files by their extension with a URI regex (avoiding potential mimetype issues) [14:58:33] only \.(png|jpg|jpeg)$ or whatever [15:50:08] oh circling back to the cp3021 cronspam [15:50:17] it's a spare/decom, not an active cache [17:42:35] 10netops, 06Operations, 05Goal, 13Patch-For-Review: Decomission palladium - https://phabricator.wikimedia.org/T147320#2734929 (10Dzahn) yep, "last heads up" sent to ops@ now [19:34:57] 10netops, 06Operations, 10ops-eqiad: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2735117 (10Cmjohnson) p:05Normal>03Unbreak! Moving this to higher priority on my workboard. [21:07:06] 5.6 days for cp1049, 6.7 days for cp1072, 6.2 days for cp1074 [21:07:22] (uptimes ^) all 3x of those have large mailbox backlog [21:07:43] there are others with longer, but not in eqiad [21:07:54] I think eqiads have gone longer before without incident, too, just not this time around [21:08:41] e.g. cp1063 just restarted ~6 hours ago after making it a full week [21:09:00] so, 1 week restarts is not safely universally reliable [21:09:31] I expect eqiad gets more churn than the rest, since it gets the access patterns of all 4x DCs, causing more churn/eviction than a storage set that only deals with one geographic region [21:11:24] heh I guess that lacks contenxt: there was a graphite 5xx threshold alert, turns out we had a growing 503 problem in cache_upload. not big yet, but big enough to trip that alert [21:13:01] it started ~18:00 and slowly ramped in by ~18:30 to a value of ~4 reqs/sec, stayed there for a good while, then started getting spiky around 19:54, with peaks up to 50/sec [21:13:37] I've restarted those 3x backends with backlogs though, and it's dropped back off [21:14:19] perhaps we should be thinking more like j.oe does with the hhvm issue [21:14:52] have the cron fire rapidly (say every 10-15 mins), and actually check the mailbox backlog : total objects ratio, and restart if bad values are reached [21:15:02] rather than restarting on fixed time intervals [21:15:35] yeah, reactive is better than fixed time [21:16:13] if we could also avoid that they restart all together distributing the crons or checking the cluster will be better [21:16:24] yeah that's tricky [21:16:50] also, we want to protect the backlog/objects check to never fire if the process uptime is short. that kind of statistic is only valid or useful after it's been up a while [21:17:49] the thing is, when it hits we want to react relatively-quickly. but yeah, even if it hits a few at once, we don't want them simul-restarting [21:17:49] when it starts to grows it's quick and you need to restart it in minutes or can wait an hour? [21:17:49] doesn't leave much window [21:17:49] I think it grows slowly, so perhaps we can just set the threshold to catch it earlier on [21:17:53] and then broader then checks [21:18:25] (but on the other hand, it's not necessarily a runaway until it's a runaway. we've seen them get up to fairly high backlogs and then recover themselves without ever causing a problem before, too) [21:18:44] could you get also the 503? [21:18:55] anyways, not something I'm going to engineer late friday afternoon. odds of this striking again this weekend are slim, after fixing these 3x [21:18:59] so that you restart only when the queu is high and there are 503s [21:19:12] ehehe, yeah, just talking :) [21:19:16] volans: probably, but tricky to get that part right :) [21:19:47] or we could deploy software that sucks less and not have to do all this duct-tape work around it :P [21:19:53] eheheh [21:20:01] true that! [21:20:23] or we could have a magic automation framework that has visibility of the cluster and is able to do it safely and with all the infos :-P [21:20:26] * volans hides [21:20:35] :) [23:12:18] 10Traffic, 06Discovery, 06Maps, 06Operations, 03Interactive-Sprint: Maps - move traffic to eqiad instead of codfw - https://phabricator.wikimedia.org/T145758#2639951 (10jeremyb) completed today? [23:32:06] 10Traffic, 06Discovery, 06Maps, 06Operations, 03Interactive-Sprint: Maps - move traffic to eqiad instead of codfw - https://phabricator.wikimedia.org/T145758#2735559 (10BBlack) >>! In T145758#2735539, @jeremyb wrote: > completed today? Yes, there was an unplanned incident with the codfw karthotherian se...