[00:42:08] <wikibugs>	 10Traffic, 10Operations, 10ops-ulsfo: setup bast4002/WMF7218 - https://phabricator.wikimedia.org/T179050#4086871 (10Dzahn) Was just working on the Bastion related Wikitech pages due to Bast1001 being replaced and i noticed we have 2 bastions in ULSFO, 4001 and 4002.  stalled?
[00:45:29] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4086872 (10ayounsi)
[08:28:25] <ema>	 so there's just been a bunch of fetch failures in codfw
[08:28:26] <ema>	 https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=1522224487806&to=1522225335787&var-datasource=codfw%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend
[08:28:35] <ema>	 text@codfw, that is
[08:29:02] <ema>	 all hosts affected except for cp2010 which is depooled due to memory errors
[08:40:27] <ema>	 symptoms look different from the text@eqiad issues of last week: all hosts affected at once, no correlation with mbox lag, no significant backend connection pileup
[08:41:38] <vgutierrez>	 which varnish version is running in text@codfw?
[08:43:14] <volans>	 vgutierrez: 5.1.3-1wm3 (from cumin 'A:cp-text_codfw' 'dpkg -l | grep varnish' ) :D
[08:46:19] <ema>	 interestingly, some ulsfo hosts also are featured in varnishospital https://logstash.wikimedia.org/goto/7fda17b0224a535ac15168bcf697f74d
[08:47:21] <ema>	 namely 4030 and 4031
[08:49:16] <ema>	 which yeah, simply suggests that codfw was unhappy for a bit and ulsfo<->codfw backend probes also failed
[08:52:25] <ema>	 https://logstash.wikimedia.org/goto/d59a83a6a68fdce4fc70c99c34752ded
[08:53:12] <ema>	 nothing particularly exciting in slowlog ^, with the exception of a spike at 8:30ish (thus ~15 minutes after the text@codfw instability)
[09:23:53] <ema>	 https://gerrit.wikimedia.org/r/#/c/421542/ amended as WIP, with XXX comments here and there :)
[09:28:12] <ema>	 I'll go afk for a while o/
[10:37:54] <wikibugs>	 10Traffic, 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942#4087564 (10Vgutierrez) >>! In T184942#4083351, @Krinkle wrote:  > The following counters are currently reported to StatsD from `ReqURL ^/w/load.php` ([va...
[12:29:33] <wikibugs>	 10Traffic, 10Operations, 10ops-ulsfo: setup bast4002/WMF7218 - https://phabricator.wikimedia.org/T179050#4087858 (10BBlack) Was stalled on my lack of time dealing with the prometheus switchover and then switching peoples' SSH configs, otherwise it's ready for switchover.
[12:51:47] <bblack>	 re: the text@codfw issue, yes, the patterns all seem completely-different.  whatever it is, it's not the recurrent problems we've been looking at.
[12:52:35] <bblack>	 there was a correlated user-facing 503 spike circa 08:14 , which only hit the codfw side (codfw+ulsfo+eqsin, not eqiad+esams)
[12:52:59] <bblack>	 some small 503 rate spike was there for both codfw and eqsin, but the bulk of the impact was via ulsfo.
[12:53:32] <bblack>	 (could be indirect fallout of being a client of a temporarily-failing codfw, and higher rate could be due to higher traffic levels)
[13:12:00] <ema>	 bblack: hey :) So do we have any reason for capping 4xx ttl at 10m? We also have a comment saying that perhaps only 404 should be capped that way here https://gerrit.wikimedia.org/r/#/c/421542/4/modules/varnish/templates/vcl/wikimedia-common.inc.vcl.erb
[13:13:54] <bblack>	 yes
[13:14:27] <bblack>	 because things that don't exist may get created, like new articles or files and such, and we don't want some braindead default TTL at various layers persisting a 404 for hours or days after something new is made :)
[13:15:16] <bblack>	 that's the only justification I know of, which is why my comment about only-404
[13:15:46] <ema>	 oh that makes sense
[13:15:58] <bblack>	 one wonders whether certain errors should be cached at all, though (probably aren't, somewhere internal in varnish, e.g. 400 bad req)?
[13:16:16] <bblack>	 certain 4xx errors I meant
[13:16:45] <bblack>	 404 is kinda "special" in that while it's signalling to a client "hey you request a thing that isn't a thing", it's obviously cacheable for any client requesting the same Thing
[13:16:56] <bblack>	 but many of the other possible 4xx, not so much... more likely to be client-specific
[13:18:09] <ema>	 right, 404 is a different kind of "client mistake" compared to, eg, 401
[13:19:10] <ema>	 I think we should use a ttl_404_cap vcl_config setting to make the distinction w/ ttl_cap more explicit 
[13:19:33] <ema>	 so that next time somebody comes and asks The Question (what's our ttl?!?) we remember the 404 exception
[13:21:06] <bblack>	 we probably had one at one time and I collapsed it down because we used the same setting everywhere :)
[13:21:25] <bblack>	 thinking about whether the other 4xx cases should be handled there, is one of those Things That Need Analyzing and Dealing With for a long time
[13:21:51] <bblack>	 (but I tend to think the other 4xx cases probably mostly fall into the bin of uncacheable type things, either internally in varnish or by default at any sane applayer)
[13:22:27] <ema>	 you mean an applayer using proper http status codes? :)
[13:25:59] <ema>	 so, rfc7234 explicitly mentions 404 as an example of negative results which are cacheable
[13:27:59] <ema>	 I assume other 4xxs could also be considered cacheable negative results? As in 405 for example: the method is not fine for this specific resource (but might be in the future?)
[13:41:29] <ema>	 https://grafana.wikimedia.org/dashboard/db/prometheus-varnish-aggregate-client-status-code?orgId=1&panelId=2&fullscreen&var-site=eqiad&var-site=ulsfo&var-site=codfw&var-site=esams&var-cache_type=varnish-text&var-cache_type=varnish-misc&var-cache_type=varnish-upload&var-status_type=4
[13:42:01] <ema>	 interesting how at times we send more 429 per second than we do 404s
[13:43:18] <ema>	 anyways, the overwhelming majority is 404/429
[13:44:51] <bblack>	 technically 429 is an over-aggressive defense, someday we should rid ourselves of it intelligently.  but it's useful for now.
[13:49:05] <bblack>	 the PH/JP/HK move is much bigger btw, there's a more-notable stats dropoff in ulsfo
[13:49:09] <bblack>	 (->eqsin)
[13:49:58] <bblack>	 eqsin point-in-time jump over the 10m DNS TTL was ~3K/sec -> ~10K/sec
[13:51:34] <ema>	 yup and ulsfo went from ~19K/s to ~12K/s
[13:51:57] <bblack>	 at this rate, eqsin will easily overtake ulsfo in reqs/sec before we're done moving countries
[13:52:49] <bblack>	 (but if that jars with historical thoughts on the matter, keep in mind we offloaded facebook's huge traffic over from ulsfo->codfw a while back, which knocked down some ulsfo reqrate in general)
[13:53:30] <bblack>	 point-in-time isn't a great comparison though, we'll see how the averages look later
[13:54:32] <bblack>	 but we still have e.g. India to go, which is huge, and then a long tail of over populations that add up significantly
[13:54:52] <bblack>	 India and Pakistan will move from esams->eqsin rather than ulsfo->esams though, so different impacts
[13:55:03] <bblack>	 (any significant drop at esams is nice to have anyways)
[13:55:24] <bblack>	 err  "rather than ulsfo->eqsin" above :)
[13:55:30] <ema>	 still funny to me that India is one single entity for geodns in the same way as San Marino is 
[13:56:09] <bblack>	 well, we have coordinate data down to city-ish level for a lot of IPs in the world
[13:56:20] <bblack>	 but the more precise you try to get, the bigger the error margins are too
[13:56:45] <bblack>	 in most places "Country" -level is the reasonable compromise on accuracy for this kind of purpose
[13:56:51] <ema>	 plus, we have the granularity to route the Vatican to eqsin while keeping Italy routed to esams, which is cool :)
[13:57:02] <bblack>	 heh
[13:57:26] <bblack>	 if nothing else, governmental and business structures tend to align on countries, making IP blocks a bit more reliable at that level
[13:57:59] <bblack>	 but we do have the data and capability to split India if it makes sense
[13:58:17] <bblack>	 Russia is another interesting case in that regard
[13:58:36] <bblack>	 it's geographically HUGE, most of the population and infra is on the western side closer to EU/esams
[13:59:00] <bblack>	 but then there is population in the far-east of RU that's more-appropriately routed towards eqsin, probably. (e.g. Validivostok)
[13:59:06] <bblack>	 err however that's spelled
[13:59:11] <bblack>	 Vladivostok? :)
[13:59:49] <paravoid>	 I ran some tests a while back
[13:59:58] <paravoid>	 found IP ranges from Vladivostock
[14:00:05] <paravoid>	 they were all routed through Moscow etc.
[14:00:10] <paravoid>	 esams was still closer at the time
[14:00:17] <bblack>	 there's cables landing there from JP, etc though
[14:00:18] <bblack>	 seems odd
[14:00:41] <paravoid>	 yeah, but ISP's L3 topology doesn't necessarily follow that
[14:00:52] <paravoid>	 dunno, maybe it's diffferent now, or maybe I didn't find the right IP ranges
[14:00:55] <bblack>	 well no "etc", just from JP
[14:01:05] <paravoid>	 but the little I saw wasn't encouraging, 
[14:01:28] <bblack>	 https://www.submarinecablemap.com/#/submarine-cable/russia-japan-cable-network-rjcn
[14:02:07] <bblack>	 maybe it's only used for routing JP<->RU traffic between certain entities, and not more-broadly routeable via various transits
[14:07:09] <paravoid>	 volans just gave me catchpoint's IP in Vladivostok
[14:07:15] <paravoid>	 rostelecom IP
[14:07:29] <paravoid>	 ~150ms from esams, ~116ms from eqsin
[14:07:45] <paravoid>	 the former directly via our AMS-IX peering, the latter via PCCW
[14:07:56] <bblack>	 nice
[14:08:08] <paravoid>	 PCCW is asymmetrical though, who knows what the return path is
[14:08:11] <bblack>	 yeah
[14:08:23] <bblack>	 in general there are so many minor oddities
[14:08:27] <paravoid>	 seems there's a rostelecom<->PCCW peering in hong kong
[14:08:34] <bblack>	 as I was adding panels on the perf graph today for HK (etc)
[14:08:52] <bblack>	 I was noting that HK has had lower-than-possible "tcp start" latency to us since before we turned on any routing to eqsin
[14:08:54] <volans>	 paravoid: we could do some traceroute checks from their endpoints, one-shot things
[14:09:02] <bblack>	 so who knows wtf there, perhaps lots of TLS-terminating proxies :/
[14:09:30] <paravoid>	 :/
[14:09:38] <paravoid>	 rostelecom seems to be present in equinix HK, but not SG
[14:09:44] <paravoid>	 (and equinix tokyo)
[14:10:19] <bblack>	 e.g. during a low-point a week ago (several hours before initial geodns for SG->eqsin), HK "tcp start" metric was claiming 52ms (and that metric isn't just tcp actually, it's a browser metric that includes TLS negotiation)
[14:10:29] <bblack>	 no way 52ms TLS estab was possible to ulsfo :P
[14:10:40] <paravoid>	 is that 95p?
[14:10:45] <bblack>	 well...
[14:10:52] <paravoid>	 browser metrics can and have been buggy in the past
[14:11:12] <paravoid>	 sending supposed metrics for already established/reused connections and stuff like that
[14:11:16] <bblack>	 so sure there's the fact that there's some unreliability to browser metrics in the raw...
[14:11:56] <bblack>	 but also, in a related-but-bigger-scope sense that we can't solve right now in the midst of using these metrics, I'm almost certain the metrics we're using are way off in terms of statistical validity in general.
[14:12:27] <paravoid>	 averages of averages or averages of 95p and stuff like that you mean?
[14:12:31] <bblack>	 because they're being stored to graphite and then pulled up into grafana, and we just had this whole conversation the other day about how neither of graphite nor prometheus can store this type of data in a statistically-valid way for end-results
[14:12:40] <paravoid>	 prometheus can, I think
[14:13:27] <bblack>	 either way it's a basically a time-series database.  you can bin the data into a small number of bins, but you can't store per-event (e.g. per-http-request) raw numeric data over a wide range in a scalable way into graphite/prometheus that comes out the other side correctly-analyzable for averages/pNN over X time-spans, etc
[14:15:30] <bblack>	 for that, you need something more like the analytics->hive pipeline
[14:15:45] <bblack>	 which is where we should really be running calculations on ttfb-style per-request metrics
[14:16:51] <bblack>	 so, whatever it is we're looking at now for this stuff that runs via graphite->grafana, it has to be taken with a large grain of salt.  You can see things move and they move in expected directions, but it's probably not numerically-valid in any absolute sense, and the errors probably vary as you change time units and as traffic rates shift up and down, et
[14:16:56] <bblack>	 c
[14:20:16] <wikibugs>	 10Traffic, 10Operations, 10ops-codfw: cp2006, cp2010, cp2017: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4088153 (10ema) The memory error situation when it comes to codfw cache hosts is pretty bad. Besides cp2006, cp2010, and cp2017 (found rebooting), I've now checked SEL and the...
[14:20:59] <ema>	 so there are 4 other cache hosts with uncorrectable memory errors in codfw (2008, 2011, 2018, 2022)
[14:21:19] <paravoid>	 4?!
[14:21:21] <paravoid>	 seriously!?
[14:21:39] <ema>	 which means that 7/22 cache hosts in codfw have such errors :(
[14:23:23] <paravoid>	 warranty hasn't expired yet I see
[14:23:25] <paravoid>	 but expires soon
[14:23:45] <paravoid>	 (June 1st)
[14:23:53] <paravoid>	 so do file a task
[14:24:08] <paravoid>	 and mention that we have 7/22 cache hosts with those errors, ask rob to escalate this with dell
[14:24:20] <bblack>	 I'm writing some updates in general in the other task about lack of icinga reporting (or any action in general, automatically)
[14:24:48] <paravoid>	 I think we have a task about lack of icinga reporting for memory errors
[14:24:54] <paravoid>	 oh is that what you meant?
[14:24:59] <bblack>	 yeah
[14:25:04] <bblack>	 sec I'll finish writing there...
[14:25:13] <ema>	 the situation is partiularly bad for upload@codfw (4/10 hosts affected)
[14:25:41] <paravoid>	 file a task ASAP, Cc Rob
[14:26:14] <ema>	 yup
[14:26:57] <paravoid>	 bad batch probably, we could ask them to just replace all of those memories
[14:28:43] <wikibugs>	 10Traffic, 10Operations, 10ops-codfw: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4088168 (10ema)
[14:29:02] <bblack>	 well we should characterize whether they're persistent or not too
[14:29:20] <bblack>	 that hosts have isolated SEL entries for some single-shot uncorrectable mem error months ago isn't necessarily a cause to replace
[14:29:26] <bblack>	 anyways... still writing
[14:29:38] <bblack>	 (and doing some research on linux edac along the way)
[14:30:20] <godog>	 FWIW yesterday I published https://gerrit.wikimedia.org/r/c/422110/ which should do part of that job
[14:31:08] <wikibugs>	 10Traffic, 10Operations, 10ops-codfw: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4088172 (10faidon) These seem to be under warranty for another 2 months, so we should hurry up.  7 out of 22 identical hosts having memory errors so...
[14:31:09] <bblack>	 part of what I'm unclear on, is Linux used to have a sysctl for panicing when the kernel detected an uncorrectable error on memory that mattered, and I think that's changed more-recently.
[14:31:15] <bblack>	 (independent of userspace edac tools)
[14:32:55] <godog>	 yeah it'd be nice to be able to use sth like that
[14:34:59] <bblack>	 also it's notable that none of them currently show uncorrectables (or correctables) since last boot in codfw, in the kernel stats
[14:35:07] <bblack>	 maybe edac driver is currently broken for this hw?
[14:35:45] <bblack>	 the whole cp* fleet shows zero ue_count and zero ce_count
[14:37:27] <godog>	 ugh, and ipmi-sel shows more recent errors ?
[14:39:45] <ema>	 let's see
[14:40:34] <ema>	 so cp2006 for instance logged a UME in SEL on Mar-23-2018
[14:41:10] <wikibugs>	 10Traffic, 10DC-Ops, 10Operations, 10monitoring, and 2 others: memory errors not showing in icinga - https://phabricator.wikimedia.org/T183177#4088202 (10BBlack) See updates in T190540 , quite a few codfw hosts have SEL entries for uncorrectable ECC errors that went by unnoticed (but we tend to notice on r...
[14:41:16] <ema>	 but I did reboot it 4 days ago (24th)
[14:41:33] <bblack>	 yeah I gave up on analyzing the whole thing before finishing my post above, and just left the various questions open, because it's a deep rabbithole :P
[14:41:36] <ema>	 23rd even
[14:41:55] <bblack>	 yeah it could be with all the upgrades (varnish+kernels) that none of them have been up long enough to show their past relatively-rare UEs
[14:44:23] <bblack>	 I would say the spike of it in codfw could be environmental rather than bad-batch, but
[14:44:55] <bblack>	 when we've lookat environmental issues in the past (e.g. the cpu temp issues), codfw tends to seem to have the best environmental conditions of our DCs
[14:45:25] <bblack>	 but what's really annoying is some of these errors first logged a very long time ago, and we just never knew and nothing ever panic'd either :P
[14:46:06] <ema>	 yeah
[14:46:12] <ema>	 OK to carry on rebooting the remaining codfw hosts without memory errors in SEL? 
[14:46:38] <bblack>	 well really, that's what we should think about
[14:46:54] <bblack>	 when it seemed like 1-2 isolated cases my reaction was depool for hwfix
[14:47:06] <bblack>	 (regardless of anything else about the situation, because why not we have plenty)
[14:47:25] <ema>	 right
[14:47:28] <bblack>	 with so many, we can bother to think harder about which of these we should depool or not
[14:47:43] <bblack>	 but either way, whether we chose to reboot a node or not, is orthogonal to whether it should be depooled for faulty memory
[14:47:44] <ema>	 depooling 40% of upload seems a bit extreme :) 
[14:48:29] <bblack>	 if the pattern (in SEL, apparently our only reliable record) is peristent, repeated, frequent, maybe the machine deserves depooling.
[14:49:08] <bblack>	 if the pattern in SEL is few entries, or maybe lots of history but it crops up ~ once every 3 months or whatever... it's still something we wish we had better data on before, and something that should be investigated for hwfix going forward, but maybe not worth depooling.
[14:51:37] <paravoid>	 bblack: if your investigation results into this not being a bad batch, update the task to stop robh from contacting Dell :)
[14:52:09] <paravoid>	 maybe I jumped the gun -- thanks for looking at this much more deeply :)
[14:53:04] <bblack>	 well either way, I think if we have multiple SEL entries for ECC uncorrectable on a bunch of machines in codfw that are about to exit warranty, we should work with Dell on replacing the indicated DIMMs
[14:53:23] <bblack>	 the rest is about what we do going forward for monitoring, and how critical it is that we depool various cases now.
[14:53:49] <paravoid>	 well, if we have suspicions that this is a bad batch and it's a matter of time until we get another 5 of those DIMMs fail, say a month after warranty expires
[14:53:59] <paravoid>	 then we should be proactive and ask Dell to just replace it all
[14:54:26] <bblack>	 will they do that on a bad-batch suspicion that we can back up with stats?
[14:54:38] <bblack>	 or will they say hey they didn't fail before warranty-time, not our problem financially?
[14:54:45] <paravoid>	 I don't know
[14:55:09] <paravoid>	 we can try :)
[14:55:47] <bblack>	 (plus who knows how batching actually works in this sense, Dell themselves may know better.  Maybe 7/22 hosts that are showing logs already all had DIMMs from Lot123, and the rest didn't.  server batches for our purchasing don't necessarily equate to mfg DIMM batches)
[14:59:48] <bblack>	 ema: back on the subject of depooling... at the end of the day what we're trying to avoid is unreasonable odds that an UE is going to silently corrupt cache data or code and cause bad results or crash.  If it's rare events on the timescale, I think it's reasonable to keep them pooled for now while we work the other angles.
[15:01:00] <bblack>	 https://xkcd.com/1737/ ! uncorrectable ecc error -> set dallas on fire and walk away! :)
[15:11:01] <ema>	 :)
[15:11:23] <ema>	 bblack: so repool the depooled ones and carry on with the remaining reboots?
[15:11:45] <robh>	 hrmm
[15:12:10] <robh>	 bblack: it would be ideal if we could roll through the entire order with memory testing though
[15:12:16] <bblack>	 ema: well looking at the cp2017 data as a guide:
[15:12:27] <bblack>	 ( https://phabricator.wikimedia.org/T190540#4085174 )
[15:12:45] <bblack>	 for cp2017 the rate there is ~3-4 times per year with one very recent
[15:13:06] <bblack>	 so, seems poolable.  but should check that others from the 7 are similar (as opposed to say a spam of recent events).
[15:13:39] <bblack>	 but now that I stare at that, I'm questioning what those SEL entries represent.  Could it be 1x SEL entry per reboot is when it notices/logs a past event? or did those SEL entries happen while the machine was live?
[15:14:03] <bblack>	 was the last one at the time of reboot (well given SEL's clock may be slightly off)?
[15:14:06] <bblack>	 50  | Mar-27-2018 | 16:12:53 | ECC Uncorr Err   | Memory                   | Uncorrectable memory error ; OEM Event Data2 code = C0h ; OEM Event Data3 code = 02h
[15:14:34] <robh>	 Anything that shows any kind of SEL event in the last 6 months we should pull aside and run actual memory testing on
[15:14:47] <ema>	 reboot time was Mar 27 16:16:38
[15:15:34] <ema>	 (that's the first journald entry)
[15:16:08] <bblack>	 robh: well there's ~7 of those in codfw now.  the concern is 7/22 may indicate we're due for eventual problems on them all because of a bad batch of whatever.
[15:16:34] <robh>	 Understood, but I'm not sure how much Dell will be willing to work on it
[15:16:37] <bblack>	 (and it's suddenly-7 out of nowhere because apparently we're not good at noticing these via logs/alerts/icinga/whatever, so we've missed it piling up in the past)
[15:16:44] <robh>	 I mean, I doubt they'll proactivtely replace the memory across the entire order.
[15:17:00] <robh>	 we can ask, but it seems to be asking a lot
[15:17:36] <robh>	 They tend to ask for each system to be handled on its own in terms of warranty support.  We've only gone out and done batch replacements like this once prevoiusly
[15:17:44] <robh>	 and it was on systems 2 months old not at end of warranty
[15:18:30] <robh>	 bblack: can we go ahead and move through the entire fleet there of 26 cp systems from that order and run testing on them one by one?
[15:19:07] <robh>	 26 was the initial order on https://rt.wikimedia.org/Ticket/Display.html?id=9336, cp2001-cp2026
[15:19:51] <robh>	 It would be easier to present failed or passed memtests for a decision than hope Dell can look up which systems got which batches of memory when ordered (i doubt they can really check that effectively)
[15:20:39] <robh>	 (thought i realize that means taking cp systems in codfw offline)
[15:20:43] <robh>	 though even
[15:21:17] <bblack>	 how long does the memtesting take?
[15:21:49] <bblack>	 ah, the reason for the "7/22" vs 26-initial-order discrepancy
[15:22:02] <bblack>	 is we've decommed a 4-node cluster there and put those nodes on hold for future re-use (within Traffic)
[15:22:12] <bblack>	 so they're not actively in the live set, they're sitting as spare-roles right now
[15:22:56] <bblack>	 (if this situation takes several turns for the worse, we could check SEL on those and possibly swap some into service in place of the badmem ones.  but there's lots of reasons to avoid that if we can)
[15:23:20] <robh>	 Dependso n how much memory in a host, these are 256 so at least an hour
[15:23:29] <robh>	 i'd assume 1-3 hours per host closer to the 1 hour mark
[15:23:37] <bblack>	 ok
[15:23:54] <bblack>	 well, we can roll through them all doing that, in numeric order (since that alternates clusters, etc)
[15:24:22] <ema>	 notice that in one case (cp2010) the host had no memory errors in SEL prior to reboot (Mar-23-2018)
[15:24:31] <bblack>	 ema also has some varnishd upgrades and kernel reboot cycles ongoing though, we'll need to coordinate between all the ongoing things.  but in general taking out any given host via depool -> reboot is reasonable at a time.
[15:25:21] <ema>	 state of the currently depooled hosts:
[15:25:41] <robh>	 bblack: i'd like to have the testing happen across even failed hosts, since it may tun up more failed memory dimms
[15:25:45] <ema>	 cp2017 (upload) 11 errors between Sep-28-2015 and Mar-27-2018
[15:25:54] <ema>	 cp2006 (misc) 3 errors Feb-05-2016 Jun-01-2016 Mar-23-2018
[15:25:58] <robh>	 Dell is unlikely to simply send us memroy to replace 416 16GB Dimms
[15:25:58] <ema>	 cp2010 (text) 1 error Mar-23-2018
[15:26:08] <robh>	 which is what this batch has
[15:26:27] <bblack>	 see it seems odd to me the date alignments
[15:26:37] <bblack>	 I think the SEL entry is only getting created during a reboot, not during runtime
[15:27:01] <bblack>	 either there's some stored state from a past runtime UE that isn't recorded until reboot (seems unlikely...)
[15:27:01] <robh>	 I think we should create a google sheet for all 26 cp systems, and then note on it when memory testing takes place, and which dimms report bad at time of testing
[15:27:10] <robh>	 by we = i'll make it if it sounds reasonable
[15:27:11] <bblack>	 or it's that a minimal selftest happens on powerup and is catching these at that time
[15:27:17] <ema>	 or rather, POST checks spot the errors and log them into SEL?
[15:27:59] <bblack>	 robh: yes, sounds reasonable.
[15:27:59] <robh>	 We have seen memory errors show up on dells when they have not been rebooted recently in the past.
[15:28:32] <bblack>	 robh: (roll through all 26, coordinate on depool->reboot.  a few are already depooled and thus easy for now)
[15:28:38] <bblack>	 (or are currently spares)
[15:28:46] <robh>	 pefection, making sheet now
[15:28:51] <robh>	 perfection even
[15:29:41] <ema>	 I'm gonna keep cp2006, cp2010 and cp2017 depooled then
[15:31:42] <ema>	 the three hosts left to be rebooted are cp[2022-2024,2026] (all cache_upload)
[15:32:40] <bblack>	 ema: assuming they don't have an awful pattern in SEL already, ignore it for your purposes?
[15:32:44] <bblack>	 for those 3
[15:33:17] <robh>	 bblack: we can take down two at a time for testing right?  as long as ones odd and ones even?
[15:33:26] <robh>	 since it takes an hour to run tests, more efficient if two can go at once.
[15:33:35] <ema>	 the only host to be rebooted with memory errors in SEL is 2022 (one entry on Oct-06-2016 and another on Feb-08-2017)
[15:34:11] <bblack>	 robh: if you want to do 2x-at-a-time that's fine.  rather than odd/even, just go serially (e.g. cp2001+2002, cp2003+2004).  the way the clusters are laid out that makes it best.
[15:34:27] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4088342 (10ayounsi)
[15:34:27] <robh>	 yep, im listin gout the steps will have you review before we escalate for papaul to do testing
[15:34:29] <bblack>	 (except of course the ones that are already spared or depooled, can do all of those whenever for now)
[15:34:33] <robh>	 indeed
[15:34:42] <robh>	 https://docs.google.com/spreadsheets/d/1zw8lXpqh9KxgjUpGKufnQ76_kramD8_is4abBPo3TU0/edit?usp=sharing
[15:35:03] <bblack>	 well I think, now I have to question myself (re: "serially")
[15:35:14] <bblack>	 the way they're laid out for the original clusters is to rotate clusters
[15:35:33] <bblack>	 hmmm
[15:36:04] <ema>	 robh, bblack: so yeah, given that SEL patterns don't seem exceedingly scary for 2022 and there are no errors for the other two I'd finish up with the codfw reboots today unless you guys have objections
[15:36:22] <bblack>	 +1
[15:37:04] <robh>	 So we can trigger the memtest remotely, but the serial update of it is terrible and not reliable
[15:37:11] <robh>	 so we're better allowing papaul to hook up a crash cart to do so
[15:37:18] <robh>	 or we run the risk of missing an error =P
[15:37:36] <ema>	 something like memtest86+ ?
[15:37:38] <robh>	 ie: it shows error, serial doensnt update reliably, i hit a key, error is lsot
[15:37:43] <robh>	 it is memtest86+ basically
[15:37:50] <robh>	 just run via chipset on dell mainboard
[15:38:20] <robh>	 can one of you update that sheet on which ones are already offline?
[15:38:47] <ema>	 sure
[15:39:03] <robh>	 also, each pair papaul offlines we'll need to coordinate with him since he cannot bring them back into service right?
[15:39:50] <robh>	 added to https://phabricator.wikimedia.org/T190540
[15:39:52] <wikibugs>	 10Traffic, 10Operations, 10ops-codfw: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4088357 (10RobH)
[15:40:03] <ema>	 robh: sent you a request for write access to the spreadsheet
[15:40:06] <robh>	 'Memory Testing for cp2001-cp2026' section
[15:40:11] <robh>	 oh, it should be write for all wmf, fixing
[15:40:26] <robh>	 fixed!
[15:40:32] <robh>	 anyone in WMF with link can now edit it
[15:41:10] <robh>	 ema: try now?
[15:41:23] <robh>	 may have to close and reopen it
[15:41:51] <ema>	 robh: it works, thanks!
[15:41:55] <robh>	 Also I dont want it to seem like I'm avoiding dragging our Dell team in to fix this, I'm not so much avoiding as getting more info before I involve them =P
[15:42:09] <robh>	 I realize this has a very short period of warranty left ;]
[15:42:27] <bblack>	 it makes sense
[15:42:54] <robh>	 if we can show each of these failed hosts has multiple dimm failures, our demand to replace them all has far more support with a bunch of failed tests than some event log entires.
[15:43:27] <ema>	 I've marked the currently depooled hosts as such
[15:43:44] <wikibugs>	 10Traffic, 10Operations, 10ops-codfw: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4088365 (10RobH)
[15:43:51] <robh>	 cluster is nice
[15:43:59] <robh>	 ie: papaul wont test more than one host per cluster
[15:44:17] <robh>	 (unless its arleady one of the depooled ones which dont matter)
[15:45:56] <ema>	 one thing I've noticed is that hosts which will ask for user input before rebooting show the following multiple times during POST:
[15:45:58] <ema>	 Running In-System Characterization...
[15:46:05] <ema>	 it just happened to cp2022
[15:46:25] <bblack>	 robh: added cluster column to avoid hitting 2x in one cluster
[15:46:34] <bblack>	 (and marked the perma-depooled spare 4x ones)
[15:48:18] <robh>	     start at cp2001, and work upwards.
[15:48:19] <robh>	         Systems that are already depooled can be tested at any time (suggest testing all depooled first, then starting at cp2001 and working upwards.)
[15:48:19] <robh>	         Systems that are perma-depooled as spare can be tested at any time.
[15:48:20] <robh>	     of the pooled systems, do not take down more than 1 server per pool (text/upload).
[15:48:22] <robh>	     systems will be automatically depooled by pybal/confctl, no need to manually depool.
[15:48:24] <robh>	     copy down any errors in the SEL before memory testing, the SEL has to be wiped prior to tests (or it will report on SEL errors.) (command: racadm getsel)
[15:48:26] <robh>	     clear out the SEL on drac (command: racadm delsel)
[15:48:28] <robh>	     run memory tests on system, monitor result and update this google sheet.
[15:48:30] <robh>	     if system passes memory testing, reboot it back into its OS and note it on the google sheet. coordinate in #wikimedia-traffic to bring servers on and offline.
[15:48:32] <robh>	 hrmm, that pasted ok...
[15:48:34] <robh>	 those are the steps i listed
[15:49:05] <robh>	 since papaul isnt taking them offline but we are
[15:49:13] <robh>	 i assume there is a nice way to shut these down? =]
[15:49:24] <robh>	 (recall he doesnt do root level stuff really)
[15:49:40] <ema>	 `shutdown -h now` should do  
[15:50:28] <ema>	 there's a systemd unit taking care of depooling on shutdown and repooling at startup
[15:51:03] <ema>	 but /var/lib/traffic-pool/pool-once needs to be created first IIRC for repool to happen
[15:53:09] <ema>	 so yeah:
[15:53:12] <ema>	 touch /var/lib/traffic-pool/pool-once
[15:53:17] <ema>	 shutdown -h now
[15:53:45] <ema>	 assuming we want auto-repool at startup time 
[15:55:26] <robh>	 well, when he brings one online before he starts the next one we'llh ave to check status
[15:55:43] <bblack>	 eh
[15:55:52] <bblack>	 let's skip the auto-repool and just coordinate that for manual?
[15:56:12] <ema>	 +1
[15:56:13] <wikibugs>	 10Traffic, 10Operations, 10ops-codfw: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4088416 (10RobH)
[15:56:17] <robh>	 if system passes memory testing, reboot it back into its OS and note it on the google sheet. coordinate in #wikimedia-traffic to bring servers on and offline. Do not take down another system until we confirm the system just tested is back online.
[15:56:24] <bblack>	 I think in the past I've made the case we should really delete the autorepool functionality completely, just never gotten around to it.
[15:56:49] <bblack>	 it's problematic because you can't predict that a rebooted node will ever come back in reasonable time, and then it's going to auto-repool itself at a surprise future date.
[15:58:47] * volans remember to have already had this discussion
[15:58:58] <volans>	 we could check that the lock is not older than X hours
[15:59:58] <bblack>	 yeah, that too
[16:00:49] <volans>	 or just make the whole procedure a task in the soon-to-be-developed stiwtchdc-spinoff :-P 
[16:00:52] * volans hides
[16:01:20] <robh>	 so are the steps i list on https://phabricator.wikimedia.org/T190540 good
[16:01:27] <robh>	 and can i get papaul working on it =]
[16:04:47] <robh>	 ?
[16:05:11] <bblack>	 +1
[16:07:41] <wikibugs>	 10Traffic, 10Operations, 10ops-codfw: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4088458 (10RobH) a:03Papaul After reviewing with traffic team, we're goign to test memory in all of these.  I've updated the task description with...
[16:33:36] <wikibugs>	 10Traffic, 10Operations, 10ops-codfw: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4088571 (10Papaul) All those systems are running outdated IDRAC  and BIOS version. I will like to update the IDRAC and BIOS first before running the...
[16:55:20] <ema>	 all cache@codfw hosts rebooted into new kernel
[16:58:03] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4088630 (10ayounsi)
[16:59:19] <bblack>	 ema: 2002 didn't come back?
[17:00:19] <ema>	 bblack: I haven't rebooted 2002
[17:01:19] <ema>	 today I've done 2022, 2023, 2024 and 2026
[17:03:31] <ema>	 bblack: but I see on the spreadsheet that 2002 had its bios updated by papaul
[17:04:42] <ema>	 I have to go now o/
[17:15:10] <robh>	 you guys may want to join #wikimedia-dcops
[17:15:15] <robh>	 while papaul is working on those systems
[17:15:18] <robh>	 just to follow along
[17:16:40] <bblack>	 ah ok, right
[17:17:05] <bblack>	 I'm not following things well in general, still endlessly typing a book into a ticket reply
[17:17:18] <robh>	 no worries, cp2002 was rebooted but he said he shouldnt have done so quite so quickly
[17:17:21] <robh>	 but it happened =P
[17:17:37] <robh>	 so he upgraded bios there, but yeah, directions state not to touch those until the others are done
[17:17:48] <robh>	 (test the perma depooled and other depooled hosts first)
[17:41:10] <wikibugs>	 10Traffic, 10Operations, 10TemplateStyles, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#4088764 (10Tgr)
[18:43:05] <wikibugs>	 10Traffic, 10Operations, 10ops-codfw: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4088993 (10Papaul) Cp2003 result   {F16367386}
[19:36:45] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: Investigate Chrony as a replacement for ISC ntpd - https://phabricator.wikimedia.org/T177742#4089162 (10BBlack) Looking at ntp::chrony now as I've noticed the above dns5001 switch.  There seems to be nothing in there about local peering, or about clock consistency...
[22:37:05] <wikibugs>	 10Traffic, 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942#4089633 (10Krinkle) Thanks @Vgutierrez !
[22:53:50] <wikibugs>	 10Wikimedia-Apache-configuration, 10Operations, 10Performance-Team (Radar): VirtualHost for mod_status breaks debugging Apache/MediaWiki from localhost - https://phabricator.wikimedia.org/T190111#4089711 (10Krinkle) Forgot to say: The aforementioned workaround is not actually a workaround (sorry). The hostna...
[22:59:39] <mutante>	 i came here to ask about cp2006
[22:59:48] <mutante>	 saw backlog .. logging out again, heh
[23:01:28] <mutante>	 and of course everything recovers about 10s later :)