[01:14:18] 10Traffic, 10Operations, 10User-zeljkofilipin: How is Varnish errorpage enabled for empty 404 text/html from mw/index.php?action=raw - https://phabricator.wikimedia.org/T190450#4089968 (10Krinkle) 05Open>03declined Agreed. If anything, this may also very well fix issues for people writing new things with... [01:14:54] 10Traffic, 10Operations, 10User-zeljkofilipin: Figure out how Varnish errorpage was enabled for empty 404 text/html from mw/index.php?action=raw - https://phabricator.wikimedia.org/T190450#4089970 (10Krinkle) [01:15:12] 10Traffic, 10Operations, 10User-zeljkofilipin: Figure out how Varnish errorpage was enabled for empty 404 text/html from mw/index.php?action=raw - https://phabricator.wikimedia.org/T190450#4074032 (10Krinkle) 05declined>03Resolved a:03BBlack [01:18:47] 10Traffic, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Varnish HTTP response from app servers taking 160s (only 0.031s inside Apache) - https://phabricator.wikimedia.org/T181315#4089973 (10Krinkle) [06:47:33] good morning [06:47:43] there are a few 501s on cache_upload today [06:47:45] https://grafana.wikimedia.org/dashboard/db/prometheus-varnish-aggregate-client-status-code?orgId=1&panelId=2&fullscreen&from=1522273059758&to=1522305989216&var-site=esams&var-site=eqiad&var-site=ulsfo&var-site=codfw&var-site=eqsin&var-cache_type=varnish-upload&var-status_type=5 [06:51:53] uh, the few I found are caused by a funny UA (that we've seen already in the past) [06:51:59] User-Agent: Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98) [06:58:16] oh interesting, those are unsatisfiable range requests [06:58:44] swift should respond with a 416 in those cases, not with 501 [07:02:38] this was already opened as T147162, reopening [07:02:39] T147162: upload.wikimedia.org returns HTTP 501 instead of 416 for non-satisfiable byte ranges - https://phabricator.wikimedia.org/T147162 [07:03:12] 10Traffic, 10Operations, 10media-storage: upload.wikimedia.org returns HTTP 501 instead of 416 for non-satisfiable byte ranges - https://phabricator.wikimedia.org/T147162#4090176 (10ema) p:05Triage>03Normal [07:16:13] 10Traffic, 10Operations, 10media-storage: upload.wikimedia.org returns HTTP 501 instead of 416 for non-satisfiable byte ranges - https://phabricator.wikimedia.org/T147162#2683206 (10ema) Reopening this bug as swift still returns `501` when it should return `416`. I have noticed [[https://grafana.wikimedia.o... [08:23:07] 10Traffic, 10Operations, 10monitoring: prometheus: slow dashboards due to suboptimal query_range performance - https://phabricator.wikimedia.org/T190992#4090268 (10ema) [08:23:19] 10Traffic, 10Operations, 10monitoring: prometheus: slow dashboards due to suboptimal query_range performance - https://phabricator.wikimedia.org/T190992#4090279 (10ema) p:05Triage>03Normal [08:27:29] 10Traffic, 10Analytics, 10Operations: Investigate and fix odd uri_host values - https://phabricator.wikimedia.org/T188804#4090285 (10ema) p:05Triage>03Normal [08:27:51] 10Traffic, 10DNS, 10Operations, 10Release-Engineering-Team, and 2 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776#4090287 (10ema) p:05Triage>03Normal [08:28:31] 10Traffic, 10Analytics, 10Operations: Spammy events coming our way for sites such us https://ru.wikipedia.kim - https://phabricator.wikimedia.org/T190843#4090289 (10ema) p:05Triage>03Normal [08:30:20] 10Traffic, 10Operations, 10Pybal: Upgrade pybal-test instances to stretch - https://phabricator.wikimedia.org/T190993#4090290 (10Vgutierrez) p:05Triage>03Normal [08:33:33] 10Traffic, 10Operations, 10Pybal: Upgrade pybal-test instances to stretch - https://phabricator.wikimedia.org/T190993#4090302 (10Vgutierrez) [08:33:52] *sigh*... wikibugs is too chatty sometimes [08:34:17] I like it nonetheless [08:34:21] our friendly bot [08:59:13] so looking at graphite's varnish caching vs the prometheus version: [08:59:18] https://grafana.wikimedia.org/dashboard/db/varnish-caching?refresh=15m&panelId=8&fullscreen&orgId=1&var-cluster=All&var-site=eqiad [08:59:25] https://grafana.wikimedia.org/dashboard/db/prometheus-varnish-caching?refresh=15m&panelId=8&fullscreen&orgId=1&var-cluster=All&var-site=eqiad [08:59:41] the prometheus version does not seem correct [09:00:36] I'm rebooting eqiad hosts and the overall eqiad hitrate is going up [09:01:01] ack [09:01:04] I'll check that later [09:01:18] I just updated cp* mtail to 3.0-rc5 [09:02:05] to be able to merge the varnishbackend ttfb change [10:02:39] 10Traffic, 10Operations, 10ops-ulsfo: setup bast4002/WMF7218 - https://phabricator.wikimedia.org/T179050#4090724 (10fgiunchedi) FWIW I'm happy to assist with the Prometheus part, e.g. next week [10:14:39] 10Traffic, 10Operations, 10media-storage, 10User-fgiunchedi: upload.wikimedia.org returns HTTP 501 instead of 416 for non-satisfiable byte ranges - https://phabricator.wikimedia.org/T147162#4090776 (10fgiunchedi) a:05fgiunchedi>03None Quite possible! Initially I thought it might have to do with thumbna... [10:15:21] ema: changing npre_commands triggers the removal of the npre monitor? (https://gerrit.wikimedia.org/r/#/c/421338/3..4/modules/varnish/manifests/logging/xcps.pp) [10:17:49] vgutierrez: see latest comment on https://gerrit.wikimedia.org/r/#/c/422106/ [10:18:00] yup, just read that [10:18:08] (bot it's nice <3) [10:29:02] 10Traffic, 10Operations, 10media-storage, 10User-fgiunchedi: Swift invalid range requests causing 501s - https://phabricator.wikimedia.org/T183902#4090814 (10fgiunchedi) [10:29:05] 10Traffic, 10Operations, 10media-storage, 10User-fgiunchedi: upload.wikimedia.org returns HTTP 501 instead of 416 for non-satisfiable byte ranges - https://phabricator.wikimedia.org/T147162#4090816 (10fgiunchedi) [10:30:13] 10Traffic, 10Operations, 10media-storage, 10User-fgiunchedi: Swift invalid range requests causing 501s - https://phabricator.wikimedia.org/T183902#3866584 (10fgiunchedi) As per related {T147162} this is likely a bug in our usage of `webob` in `rewrite.py`. We should be using `swob` which is what swift itse... [10:54:50] bblack, ema: re prometheus smoothness | varnish caching graph not being right, https://snag.gy/2So0Aw.jpg --> Green: rate()[5m], Yellow: rate()[2m] [10:55:10] rate()[2] mimicks way better graphite behavior [10:55:21] *2m [10:59:03] godog ^^ [11:13:13] vgutierrez: ow! I didn't expect so much change, which dashboard / expressions are involved? [11:13:45] https://grafana.wikimedia.org/dashboard/db/prometheus-varnish-caching VS https://grafana.wikimedia.org/dashboard/db/varnish-caching [11:13:57] also: https://gerrit.wikimedia.org/r/422910 [11:14:03] O:) [11:18:17] vgutierrez: I'm looking at this panel, which is the one you have in the screenshot it looks like https://grafana-admin.wikimedia.org/dashboard/db/prometheus-varnish-caching?refresh=15m&orgId=1&panelId=8&fullscreen&edit [11:19:03] the expressions in that one don't seem to use the :rate5m metric but :sum, at least in the current version of the dashboard [11:20:15] it doesn't have to end up as un-smooth as the original, that's not the goal (to get things so unsmooth) [11:20:41] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown: Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366#1824561 (10Aklapper) [11:20:48] it would be nice if it were a little less smoothed than it looks at present (e.g. 7d view has very little detail), but some smoothing is ok [11:21:38] but either way, the more important thing is let's try to make sure whatever the pipeline is now (from source -> grafana), that it has some proper statistical meaning and isn't a meaningless avg-of-avg-of-whatever kind of situation [11:23:09] (or for bonus points, it would be nice if smoothness was a variable that could be picked from a dropdown, or automatically-scaled to a reasonable value based on the time-width of the graph) [11:23:38] godog: :facepalm: [11:23:40] sometimes the extra detail is annoying and you do want to see the smoothed-out changes (day-vs-day, week-vs-week). [11:23:50] sometimes you really want to see the per-hour detail in small steps [11:24:38] india going online for eqsin shortly [11:24:57] (big jump, but we're already done with most of their daytime too) [11:25:10] vgutierrez: heheh it happens! I was suspicious because changing the rate period usually (un)smooths things out but not by that much [11:26:15] bblack: awesome! agreed re: statistical meaning of the pipeline [11:26:20] going to lunch [11:51:28] fixed: https://grafana.wikimedia.org/dashboard/db/prometheus-varnish-caching?refresh=15m&panelId=8&fullscreen&orgId=1 [11:55:24] vgutierrez: what's with the odd stats event that seems to happen ~ 2018-03-28 ~17:00 ? [11:55:46] is that artificial from fixing some earlier part of the pipeline in how it gathers things up? [11:55:52] vgutierrez: oh very interesting! [2m] vs [5m] [11:58:31] it's an interesting thing to focus on, I think something's not-quite-right about how DCs/sites are aggregated [11:58:40] or how clusters are aggregated, or something [11:59:24] to be clear, I'm talking about the rapidly-vanishing Pass slice of the pie in: [11:59:27] https://grafana.wikimedia.org/dashboard/db/prometheus-varnish-caching?refresh=15m&panelId=7&fullscreen&orgId=1&from=1522254550792&to=1522258934813&var-cluster=All&var-site=All [11:59:49] if you limit to upload+misc, that effect is gone [12:00:16] (or either of upload or misc separately) [12:00:35] text is clearly the source of the wide pass-band, but then when you look at text in isolation... [12:00:59] there's a bumpy thing around 17:09, but it doesn't really dissappear like it does in the 3-cluster aggregate... [12:01:32] I wonder if something changed about aggregation (e.g. when 3 clusters are combined, they're considered equal parts, when they should be combined based on their reqs into a single whole) [12:01:50] (or something like that at the dc or site aggregation level) [12:02:34] the bumpy thing seems eqiad-specific [12:02:35] https://grafana.wikimedia.org/dashboard/db/prometheus-varnish-caching?refresh=15m&panelId=7&fullscreen&orgId=1&from=1522254550792&to=1522258934813&var-cluster=cache_text&var-site=codfw&var-site=esams&var-site=ulsfo&var-site=eqsin [12:03:01] either way, the total aggregate graph has no such "loss of most of pass%" over that same time range in the non-prom version: [12:03:04] https://grafana.wikimedia.org/dashboard/db/varnish-caching?refresh=15m&panelId=7&fullscreen&orgId=1&from=1522261626095&to=1522266057757 [12:13:53] ema: so thinking on https://gerrit.wikimedia.org/r/#/c/421542/ , a few things: [12:15:37] 1) There's a couple of ttl-conditions in the middle there, the >0 for dealing with keep, and the >60 for swizzling. We should maybe do to make that work out logically the same as before (to not apply to no-store kinds of objects) unless we can really prove to ourselves that such objects would already have ttl<=0 and the original "if no-store ttl=0" clause was redundant/useless. [12:16:33] (but looking in the varnish source, it seems the core C code only deals with CC:max-age and CC:s-maxage, I don't see where it ever pays attention to no-cache|no-store|private except in the default VCL) [12:17:31] for that matter, I'm not even sure what the standards interpretation (or validity) would be for something like "CC: no-store, no-cache, private, max-age=1000" or whatever [12:18:16] a zero max-age/s-maxage would seem to imply we can't usefully cache, but I don't know what it means to have non zero maxage(s) + otherwise-uncacheable headers. [12:19:41] the default vcl's vcl_backend_fetch is interesting to look at for comparison of what the expected/normal logic there is [12:19:50] err sorry, vcl_backend_response [12:20:53] 2) We should put some thought into how grace works with multi-layer, since this work has implications for that case. I don't think we've ever quite handled it in an ideal way (or at least, haven't understood fully whether we do or don't) [12:21:20] in the simple case of 1x varnish sitting between N clients and an applayer, it's easy to reason about how grace works out. [12:21:48] but how does it really work when an object is undergoing a true point-in-time expiry->refresh and we have two varnish layers in the pipeline? [12:23:58] picture just 2x varnishes stacked. You have 50 clients of the varnish-fe chugging along requesting /foo fairly frequently between them. It has decent cacheability, but the current hit-object (which is stored in both layers) is going to reach ttl=0 at 12:34:56 and has a 5-minute grace. [12:25:45] so the expiry-point passes, and then at 12:35:00 (4 seconds post-expiry), the first request for /foo comes in from a client, possibly followed by a few more clients before all of this stabilizes back into normal cached objects at both layers... [12:26:25] C1 requests /foo from varnish-fe, which serves it a stale object while kicking off a background fetch from varnish-be. [12:26:49] varnish-be in turn also serves a stale grace object to varnish-fe in response to the bgfetch, while doing its own bgfetch. [12:27:24] the stale grace object arrives at varnish-fe, but has say max-age:1000 + age:1005, implying TTL=-5 [12:28:34] (right there my first question is: what happens with that bgfetch result? does it get thrown out because there's no waiting client and no positive TTL? is this a case where varnish-fe's VCL needs some explicit behavior in the area we're editing?) [12:29:18] while varnish-be is waiting for its bgfetch to give a new live/fresh/correct object, which may take a while [12:29:24] several more clients request /foo from varnish-fe [12:30:18] I guess they keep getting stale responses. I kind of assume the bgfetches coalesce if they're rapid, but they're all returning uncacheable quick results so far, so it's still a mini-storm (but serialized by coalesce to one-at-a-time). [12:30:50] (but honestly, I don't know for sure if the grace bgfetches internally coalesce or not) [12:31:35] ...and then eventually the initial bgfetch from be->applayer succeeds, and everything repopulates and all is right in the world. [12:32:24] but what happens re varnish-fe->varnish-be traffic during that window while it's all waiting on the applayer to refresh, and clients are getting stale responses, is kind of a mystery to me as noted above. [12:33:24] 10Traffic, 10Operations: Unwanted service startups and their triggers - https://phabricator.wikimedia.org/T191017#4091056 (10ema) [12:33:44] 10Traffic, 10Operations: Unwanted service startups and their triggers - https://phabricator.wikimedia.org/T191017#4091066 (10ema) p:05Triage>03Normal [12:34:45] (but it seems like our handling of ttl<=0 in related code is going to affect how the varnish-fe deals with these already-expired responses it gets from the varnish-be grace results) [12:36:49] 10Traffic, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4091075 (10Cwek) Can I ask something? How to measure where traffic should route through? latency? I suggest a website... [12:39:26] bblack: (1): we should probanly use something like `if ! CC:(private|no-cache|...)` as a guard. Interesting point about objects with CC:private and maxage>0, I (randomly) guess that anything indicating uncacheability should win over maxage/s-maxage? [12:40:00] [yeah for caching purposes I think so, but I wonder what the overall meaning is, and whether it's legitimate at all from a client pov or something] [12:41:17] surrogate-control will eventually clear up some of the client-vs-cache meanings in applayer outputs, but we're not there yet. [12:44:46] mildly imbecile passing thought on (2), how about we try to avoid grace mode kicking in at multiple layers at the same time? [12:45:41] for example with increasing TTLs at backend-most-1, backend-most-2, ..., frontend+1 [12:47:55] anyways, first we need to figure out exactly how grace works with multiple layers and whether we have a problem or not [12:53:21] right, the default behavior here may not be problematic [12:53:41] so long as varnish-fe bgfetches in this case coalesce and don't create much perf harm [12:54:22] I tried to look at the idea of changing TTLs a bit before for other reasons [12:54:41] https://gerrit.wikimedia.org/r/#/c/364606/ (which is fairly broken I think) [12:54:50] and the more-recent "random 5% reduction" thing [12:55:33] both of those efforts, and your example of increasing TTLs by-layer, suffer from the logical problem of thinking that TTLs are heuristic suggestions and we can always get a fresh-er copy with a longer TTL when we desire it. [12:56:22] when in fact the applayers internal behavior for an object may well be: Object /foo will have contents "XYZ" until 12:34:56, at which point it is recalculated for new contents. [12:56:56] and all requests before 12:34:56 returning ever-decreasing TTLs (either a fixed Expires at that future date, or a fixed max-age with an age approaching that value). [12:57:13] and no fresher / longer-term object available until you make a request *after* 12:34:56 [12:57:29] in which case all attempts to shorten up a TTL or refresh-early at various layers just cause pointless churn but don't solve the problem. [12:58:28] (you can only do things about this sort of problem after the true TTL expires and we're off in grace/stale-time) [13:01:04] we should perhaps address the topic of grace separatedly when it comes to hfp/hfm vs "normal objects". In the latter case, grace would have the consequence of delivering stale objects which may or may not be problematic, while we know that having excessively short grace periods for hfp/hfm causes actual issues due to unwanted request coalescing [13:01:06] this, I suspect, is why in general "grace/stale" is the general-case answer, rather than taking the approach of just refreshing things early when TTL gets small [13:02:48] (whereas in the equivalent case for DNS caches, the typical/assumed applayer (authdns) behavior is not fixed point-in-time expiries, and thus a "background refresh hot objects early" approach works for avoiding stalls) [13:04:10] complicating all of this is that we cap TTLs artifically of course [13:04:45] when mediawiki serves, say, a 14d TTL object, typically we're capping that to 1d, and refresh-early sorts of strategies actually work to refresh our locally-capped TTL. [13:05:06] but you eventually run into the edge case, if MW's output has a fixed point-in-time expiry that's 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 11 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#4091129 (10kchapman) @Tgr we are just putting it in the Declined TechCom-RFC workboard, not in Phabricator as a whole. For reference, this is how we approa... [13:11:38] 10Traffic, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4091163 (10Nemo_bis) >>! In T189252#4091075, @Cwek wrote: > Can I ask something? > How to measure where traffic shoul... [13:17:50] so given that we're decoupling setting ttl=0 from "this object is uncacheable", should we perhaps use an internal header to mark uncacheable objects? As in those with beresp.http.Cache-Control ~ "(private|no-cache|no-store|max-age=0|s-maxage=0)" [13:19:04] we could set it at the backend-most layer and it might be useful all the way up to the frontend, and of course we can strip it away on the frontends [13:50:53] maybe? [13:51:19] if x-uncacheable (or whatever) is equivalent to the CC regex, may as well use the CC regex [13:51:46] there's also of course beresp.uncacheable locally (not layer-transitive), which we set based on CC, but also varnish sets [13:52:16] the main reason to decouple "uncachable" from ttl<=0 is that cacheable objects can also have ttl<=0 at certain points in time [14:15:09] bblack: new iteration! https://gerrit.wikimedia.org/r/#/c/421542/6/modules/varnish/templates/vcl/wikimedia-common.inc.vcl.erb [14:20:13] mmh even though the else part should also be applied only to responses with status < 500 [14:31:43] this seemed like an easy patch and now the commit log has sections [14:46:33] moritzm: getting there! https://grafana.wikimedia.org/dashboard/db/cache-hosts-software-versions?orgId=1&panelId=1&fullscreen&from=now-24h&to=now [14:47:27] \o/ [15:14:12] 10Traffic, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4091592 (10ayounsi) >>! In T189252#4091075, @Cwek wrote: > How to measure where traffic should route through? latency... [15:28:41] ema: https://gerrit.wikimedia.org/r/#/c/421053/ is what i was referring to on monday in the SRE meeting [15:29:19] this removes the ambiguity of pybal's Server.pooled variable that has confused us in the past, I hope [15:29:33] and with the test cases that have been added lately it should be somewhat safe to do so now [15:41:24] 10Traffic, 10Operations, 10ops-codfw: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4091656 (10Papaul) @BBlack @RobH I did the test on already 6 of the systems that are depooled and upgrade also the IDRAC and BIOS. You can see the... [16:02:59] 10Traffic, 10Commons, 10Operations: Caching problem with file description page on Commons - https://phabricator.wikimedia.org/T191028#4091702 (10zhuyifei1999) [16:16:29] 10Traffic, 10Operations, 10ops-codfw: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4091737 (10RobH) @Papaul: The remaining systems will need to be depooled and repooled one at a time for work, please coordinate with either myself o... [16:24:25] 10Traffic, 10Operations, 10ops-codfw: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4091743 (10BBlack) Well we should maybe pause at this point and ask if this test is doing any good? It seems odd that 3/6 tested had the SEL entrie... [16:27:17] it's going to be interesting, to keep an eye on overall cache hitrates in esams/ulsfo as eqsin comes on, too (keeping in mind other artificial disturbances of course) [16:27:47] so far the initial trend is that SG has great cache hitrates, and that moving traffic there may improve esams hitrate (by reducing the regional pattern-mixing in esams by taking e.g. IN and PK away from it) [16:28:21] ulsfo hitrates is at least for now on a downwards trend, but that may reflect the relatively-tiny population still hitting it and how much of that mix is non-human traffic from the bay area (FB aside), I donno [16:28:33] (or it may recover in ulsfo after a few more days and look better) [16:29:15] it's kind of a separate interesting fallout, aside from latency impacts, of building more regional caching points. [16:29:30] the more of them you build, the more-regionally-localized the cache contents of each are, improving hitrates [17:26:34] 10Traffic, 10Commons, 10Operations: Caching problem with file description page on Commons - https://phabricator.wikimedia.org/T191028#4092016 (10Aklapper) Thanks for reporting this. https://commons.wikimedia.org/wiki/File:Ariano_Irpino_ZI.jpeg shows an image with a cloud and in the "File History" section the... [17:29:43] 10Traffic, 10Commons, 10Operations: Caching problem with file description page on Commons - https://phabricator.wikimedia.org/T191028#4092032 (10Aklapper) 05Open>03stalled //If// this is about reverting to the non-cloud version and the preview on top of the `File:` page still shows the cloud version, you... [17:33:25] 10Traffic, 10Commons, 10Operations: Caching problem with file description page on Commons - https://phabricator.wikimedia.org/T191028#4092058 (10Aklapper) [17:35:18] bblack: so im not sure what to do on the ones that failed and memtest didnt show the failure [17:35:22] im a bit stumped. [17:35:38] ref: cp20XX testing [17:37:51] im discussing a bit with papaul [17:38:00] cp2022 hasnt had bios upgrade, so i asked him to test memory without upgrading bios [17:38:10] and we'll see if the error shows on older bios versions, and then upgrade memory and repeat if it does [17:38:16] if it doesnt im stil stumped. [17:38:37] the bios could conceivably affect low-level hw settings or the selftest itself, but wouldn't change whether the memory errors in general [17:39:01] it's possible the memtest we're doing now (whatever it is) simply isn't intense enough to catch a rare or specific type of error that does affect runtime, too [17:39:30] yeah, once we upgrade the bios on everything (required by dell for most hardware warranty) [17:39:42] would it be feasible to return them to service to check for more memory errors? [17:39:49] or does that cause issues for endusers? [17:40:12] we don't know, it depends since we're questioning the reliability of the things that detect such things :) [17:40:31] my gut feeling is it's probably a relatively-rare error, not a completely-dead area of memory or whatever. [17:40:40] just using them under load seemed to produce the memory error right? [17:41:15] no, we haven't caught a logged/detected error at runtime. when we reboot the hosts, during startup POST testing or whatever, they trip and complain about the memory error and pause the reboot process. [17:41:21] ohh [17:41:39] well, then we should be able to see that if we reboot them a few times or is it super rare? [17:41:39] so, since the SEL entries correspond to the reboot... [17:41:45] [don't know] [17:41:57] likely the latter then, heh ;P [17:42:10] but we're kind of assuming at this point that since the SEL happened at reboot time instead of runtime, it's the startup POST tests that are actually claiming to catch the error [17:42:26] well, then i think i'm going to err towards replacing the dimms in question on each error per system. [17:42:30] (which are generally very minimal compared even to these short memtest86 runs) [17:42:35] but not sure i can get dell to do a lot more than that [17:43:03] relatedly, there's supposed to be runtime stuff at the linux level that catches these kinds of issues and logs and/or panics, and I suspect that isn't working at all anymore. [17:43:09] i mean, we can try, but it seems unlikely they'll want to go around replacing all the dimms in the entire batch when they arent showing failure [17:43:09] (because software changes) [17:43:41] you might try multiple reboots on one of these that had the SEL entry, and see if you can get it to trigger the failure that stops booting up and creates a SEL entry. [17:44:00] yeah, that is what we shall try i think [17:44:08] seems the next step [17:44:09] another possible theory: [17:44:19] keep them coming because im stumped. [17:44:29] it could be the memory's fine, and there's a bug in the BIOS that runs the POST tests where it flags an error when it shouldn't, and the bios upgrade fixes the bug [17:45:17] (in which case you can probably repro on an un-upgraded bios with enough reboots, but it will go away after bios update) [17:45:51] you'd think that kind of bugfix would be noted somewhere in some bios revision changelog, though [17:46:46] (... and I'm sure dell is awesome about making comprehensive changelogs public and easy to find for all the bios revs from old->new :P) [17:46:50] 10Traffic, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar), 10Performance-Team-notice: Varnish HTTP response from app servers taking 160s (only 0.031s inside Apache) - https://phabricator.wikimedia.org/T181315#4092111 (10Krinkle) [17:49:48] bblack: confirm ill have the cp2022 have reboot testing [17:49:51] before we flash bios [17:49:59] we'll try to make it error, and then flash bios and try again [17:51:27] 10Traffic, 10Operations, 10ops-codfw: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4092127 (10RobH) Please note I've asked @papaul to memtest86+ cp2022 WITHOUT flashing the bios/drac. Once we have that result, we'll also then star... [17:51:58] hopefully the error happens quickly pre-bios update [17:52:01] and not at all post [17:52:09] that would be the ideal ;D [17:52:33] these are the small potential lies i tell myself through hardware testing all day ;P [19:12:00] 10Traffic, 10Commons, 10Operations: Caching problem with file description page on Commons - https://phabricator.wikimedia.org/T191028#4092635 (10zhuyifei1999) 05stalled>03Open Look like it's thumbnails did not get purged. 'Original file' https://upload.wikimedia.org/wikipedia/commons/c/ca/Ariano_Irpino_Z... [19:14:00] 10Traffic, 10Commons, 10Operations: Image thumbnails of File:Ariano_Irpino_ZI.jpeg do not get purged despite many re-uploads - https://phabricator.wikimedia.org/T191028#4092640 (10zhuyifei1999) [20:09:19] cp2022 shutdown for mem testing and rebooting for hw error [20:09:23] i put in sal [20:09:43] since its upload and was pooled, we wont depool another upload in codfw for testing while this is down [20:11:57] 10Traffic, 10Operations, 10ops-codfw: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4092735 (10Papaul) cp2022 SEL "Normal","Sat May 30 2015 03:52:02","Log cleared." "Warning","Wed Jun 01 2016 17:39:30","Correctable memory error rat... [20:34:12] 10Traffic, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4092761 (10ayounsi) [21:18:03] 10Wikimedia-Apache-configuration: noc.wikimedia.org has broken link to Server Admin Log - https://phabricator.wikimedia.org/T191085#4092965 (10PrimeHunter) [21:25:50] 10Traffic, 10Operations, 10ops-codfw: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4093023 (10Papaul) cp2022 SEL after test "Normal","Thu Mar 29 2018 20:11:58","Log cleared." "Warning","Thu Mar 29 2018 20:14:22","Fan 5A RPM is les... [22:25:33] 10Traffic, 10Commons, 10Operations: Image thumbnails of File:Ariano_Irpino_ZI.jpeg do not get purged despite many re-uploads - https://phabricator.wikimedia.org/T191028#4091369 (10BBlack) >>! In T191028#4092635, @zhuyifei1999 wrote: > AFAIK, client-side purging on upload.wm.o has been intentionally disabled... [22:27:02] 10Domains, 10Traffic, 10Operations, 10WMF-Design, and 2 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#4093321 (10Volker_E) [22:28:53] 10Traffic, 10Commons, 10Operations: Image thumbnails of File:Ariano_Irpino_ZI.jpeg do not get purged despite many re-uploads - https://phabricator.wikimedia.org/T191028#4093330 (10zhuyifei1999) >>! In T191028#4093308, @BBlack wrote: > So to purge an image from the upload.wikimedia.org caches, you purge via M... [22:31:58] 10Traffic, 10Commons, 10Operations: Image thumbnails of File:Ariano_Irpino_ZI.jpeg do not get purged despite many re-uploads - https://phabricator.wikimedia.org/T191028#4093337 (10BBlack) It definitely does purge Varnish for the original file on upload.wikimedia.org. An in general, when a replacement file i... [22:40:04] 10Traffic, 10Commons, 10Operations: Image thumbnails of File:Ariano_Irpino_ZI.jpeg do not get purged despite many re-uploads - https://phabricator.wikimedia.org/T191028#4093343 (10BBlack) To be sure of what I'm saying, and perhaps provide some trace data that may help pinpoint whatever the actual problem is,... [22:47:05] 10Traffic, 10Commons, 10Operations: Image thumbnails of File:Ariano_Irpino_ZI.jpeg do not get purged despite many re-uploads - https://phabricator.wikimedia.org/T191028#4093349 (10zhuyifei1999) Interesting. I'm unaware of that. Thanks. I issued two more purges, but https://upload.wikimedia.org/wikipedia/com... [22:47:16] 10Traffic, 10Commons, 10Operations: Image thumbnails of File:Ariano_Irpino_ZI.jpeg do not get purged despite many re-uploads - https://phabricator.wikimedia.org/T191028#4093350 (10BBlack) Hmmm, now I'm noting what is probably the critical discrepancy here.... When I visit https://commons.wikimedia.org/wiki/... [22:49:31] 10Traffic, 10Commons, 10Operations, 10Thumbor, 10media-storage: Image thumbnails of File:Ariano_Irpino_ZI.jpeg do not get purged despite many re-uploads - https://phabricator.wikimedia.org/T191028#4093352 (10BBlack) [23:12:40] 10Traffic, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4093427 (10BBlack) So, intersecting our info at the top ("Y" for eqsin as best site, not zero-blocked), the peering u...