[11:59:16] akosiaris & godog: https://phabricator.wikimedia.org/T227139 is goign to happen today if you can do the items you list as needing drain? [11:59:26] we wont be doing it until 10am [11:59:34] but figured if i tell you now you can do what you gotta do =] [11:59:48] (dont have to this second, but if it can happen for 10am pdu swap it would rock) [12:00:07] robh: ok will do [12:00:15] thank you! [12:00:31] after we do A3, A5 or A7 is next for today if you wanted to look ahead [12:00:42] they are all linked off the parent task https://phabricator.wikimedia.org/T226778 [12:08:26] do you have a notion of the schedule (which ones which days) for these ones coming up? it might help folks to plan their own schedules a bit [12:25:08] apergos: i just have today for now [12:25:13] A3, then A5, A7 [12:25:20] after that i have to coordinate a lot with cloudteam [12:25:33] I have a list of row b that clears DBA master review [12:25:54] I have no clue who is on cloud team and awake now [12:26:04] i tend to talk to the cloud team folks in pacific time ;D [12:26:40] so yeah, i'll track someone down from there later today, and that will allow us to schedule row b racks for wednesday and thursday [12:26:59] robh: arturo might be able to help for the cloud related things on this timezone [12:27:09] marostegui: thank you for clearing up a3! [12:27:11] o/ [12:27:30] arturo: Heyas, I haven't documented the contents of each rack in row B, but it is in netbox. https://phabricator.wikimedia.org/T226778 [12:27:36] the short story is this [12:27:49] we are swapping PDUs in rows A and B, as the existing ones are 6+ years old, and not great [12:28:12] (they use fuses, cannot be hot changed, have a single tower chassis for both infeeds rather than two independent towers) [12:28:36] Wednesday and Thursday, we would like to swap out as many racks in row B as we can (b5 is already done) [12:29:16] according to dba team, they have no issues with b1, b2, b4, b5, b6, b7 all going, none have db masters [12:29:22] * arturo context-switching to this [12:29:39] so, if you can review cloud team items sometime today to give us the cloud team recommended order of rack swaps, I'd appreciate it [12:30:02] ok, reviewing things [12:30:31] ie: wednesday maybe drain all of b1, b2 into b3 and b5 [12:30:35] robh: what is the expectation for power loss? [12:30:40] robh: b2 as long as it is done after thursday 05:30AM UTC [12:30:48] marostegui: oh, yes, sorry! [12:31:06] arturo: please read marostegui's comment on https://phabricator.wikimedia.org/T226778 in relation to what racks we can do first [12:31:25] the expectation of power loss is 4 out of 10. [12:31:29] maybe 3 [12:31:31] 3.5? [12:31:39] servers? [12:31:43] We didn't lose anything unexpectedly in a4 yesterday [12:32:00] but, it does involve moving a live PDU tower around in the rack and hoping you dont unseat any power cables [12:32:15] with two of us, we move very deliberately and we had no power loss yesterday [12:32:19] but, it can happen. [12:32:25] ok [12:32:41] my initial thinking is that completely draining the servers is not realistic for us [12:32:42] robh: when do you expect to do b5? [12:32:52] marostegui: b5 is done [12:32:56] Ah cool [12:32:57] thanks [12:32:57] it was the test rack [12:32:59] =] [12:33:12] marostegui: so put anything you want into it from row b and you wont have interference from us =] [12:33:15] so our approach would be more like select specific "important" VMs in each affected rack and reallocate them previous to the operations. [12:33:24] robh: yep, thanks [12:33:26] arturo: that sounds reasonable to me [12:33:57] arturo: so if you want, we can plan to do half (higher half) on wednesday [12:34:12] b7, b6, b4 [12:34:23] or atleast b7, and b6 [12:34:38] robh: I will comment this plan with the cloud team today, I need to get andrew involved in the conversation, just in case this interfere with other work he is currently doing [12:34:41] but b4 would also be nice. then the only thing in b3 is wikitech master [12:34:52] arturo: ok, please also note i am only in eqiad to assist chris this week [12:35:01] and after this week, the new guy, who seems good, will be helping [12:35:06] but of course i think im better ;D [12:35:21] ie: I recommend you guys reshuffle whatever you can for this week [12:35:23] robh: for b3, as I said, it is up to cloud :) [12:35:37] arturo: and then we'de like to move b3, and b1 on thursday [12:35:42] it is "their" database, as in: it is only used by them and wikitech [12:35:45] and then friday b2 [12:36:07] marostegui: we would also take special care triple checking that systems power connections [12:36:09] I only see cloudvirt1027 in B3, the other are decom [12:36:41] arturo: cool, do i need to copy notes to https://phabricator.wikimedia.org/T226778 or did you want to summarize after you investigate? [12:36:55] im about to head out to home depot for a drill bit and then onto eqiad, so going afk for about 45 minutes [12:36:57] I will investigate and write a proposal from our side [12:39:02] thank you! [12:42:55] robh: ok will do, I'll start in ~1h [13:38:29] ? [13:38:36] oh buffer playback issue [13:38:37] nm [13:41:20] robh: are hosts going to be downtimed in icinga for the duration of the work? [13:41:36] only if you tell me to like you did the ms-be system [13:41:38] otherwise no [13:41:46] if something loses power we want it to immediately alert imo [13:41:54] the goal is zero power loss [13:42:21] agreed, ok I'm not going to downtime unless I'm powering off things [13:42:48] yeah yesterday I think netmon and kubestage powered off but that wasn't unplanned and we got notifications [13:47:11] ok, i have to go afk and stage the pdus for the isntallation, ill check back in here and will also admin log before i do anything in a3 [13:47:40] but yeah, should be good to start the swap at 10am (so in 15min) [14:05:44] Ok please note we are working on A3 right now https://phabricator.wikimedia.org/T227139 [14:14:06] !log a3-eqiad pdu swap taking place now via T227139 [14:14:06] robh: Not expecting to hear !log here [14:14:08] T227139: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139 [14:14:14] bah [14:14:17] whatever, works [14:16:20] heya godog yt? [14:16:42] been trying X-Container-Read but not having much luck [14:16:49] ottomata: yep I'm here [14:16:58] am looking at https://docs.openstack.org/swift/latest/overview_acl.html [14:17:07] have tried doing swfit post ottotest0 to set --read-acl [14:17:13] have tried uploading with X-Container-Read [14:17:25] have also tried creating new container and uplaoding with X-Container-Read [14:17:43] all still seem give Unauthorized for unauthenticated GET [14:19:49] ottomata: mhh ok, under the analytics swift account ? [14:19:56] yes [14:20:07] oh [14:20:12] maybe i'm requesting the url wrong? [14:20:22] possibly, what's the url? [14:20:34] e.g. [14:20:41] curl https://ms-fe.svc.eqiad.wmnet/v1/AUTH_analytics/ottotest0 [14:20:51] but, maybe i shouldn't have th AUTH_analytics part in there [14:21:53] the prefix should be fine, though there you are requesting a listing [14:22:04] does it work if you try downloading a single object ? [14:22:19] hmmmm it does! [14:22:40] ok now am trying eric's prefix then [14:22:42] with e.g. ?prefi= [14:22:44] like [14:23:01] curl -v 'https://ms-fe.svc.eqiad.wmnet/v1/AUTH_analytics/ottotest0?prefix=location_test1' [14:23:13] then yeah you are missing the listings acl I believe, it is mentioned in the acl overview [14:23:22] ah mh [14:23:30] AHHHH [14:23:31] got it [14:23:32] ok [14:23:50] although I think going with single objects instead of listings might be desirable, see my comments on the cr [14:24:08] not feeling very strongly, I can see arguments for both approaches [14:24:10] i agree i think too, but i think having listing ability would be good [14:24:15] yeah. there might be a lot of files... [14:24:19] and [14:24:31] it might be hard to get the list of files from the upload job [14:24:36] hadoop usually abstracts the actual files away from you [14:24:41] since it splits them up [14:24:52] depends a bit on how they are being written [14:25:34] heh for that option I guess what we could do is list the container instead and generate kafka messages with the files found inside [14:26:17] to me having a simpler client seems preferrable, but I don't have a clear picture of all uses cases either [14:26:35] aye, either way i think allowing listing will be ok, at least for debugging purposes [14:27:04] *nod* yeah [14:27:20] hm ok, still having trouble, even with .rlistings [14:27:29] i tried modifying with swift post [14:27:35] but also now created a new container [14:27:36] ottotest2 [14:28:06] with an upload [14:28:06] e.g. [14:28:08] /usr/bin/swift upload --header 'X-Container-Read:.r:*,.rlistings' ... [14:29:07] AH ha ok [14:29:13] just got it with ottotest0 with swift post [14:29:50] ok that's good, but i need it to work with swift upload and X- Container-Read [14:29:52] hm [14:30:37] the acl need to be set only when creating a new container though, could you detect that and swift post accordingly? [14:30:52] it might be that swift upload doesn't touch the container directly like post does [14:30:59] hm [14:31:04] could [14:31:47] ya ok ca ndo [14:32:46] godog: in that case, should I only pass X-Storage-Policy when creating the container with swift post [14:32:51] and not in the swift upload? [14:32:59] ok both new pdu towers installed in a3 without issue. we are about to reroute the infeeds and kill infeed b [14:33:01] that can't be changed per object anyway, riht? [14:33:11] but it could end up being a, and then we have to replug it in and wait a minute before moving to b [14:33:18] also one of these will cause all mgmt to flap [14:33:22] ottomata: that's right, yeah set the policy and acl at container creation time and you should be done [14:33:31] (i admin logged this, just echoing in here) [14:45:01] side b in a3-eqiad migrated [14:46:45] ok, swaping the side a towers, mgmt didnt blink on side b [14:46:48] but it WILL on this [14:55:22] robh: dbproxy1003 went down apparently [15:00:18] robh: and it is now back [15:04:19] yep, no bueno [15:04:27] atleast it was a non-in-use server [15:04:37] but we double checked everything, that flap was after we had the new pdus in [15:04:42] and re-shuffling power [15:04:49] so, noted, and hopefully wont happen again [15:05:05] robh: fyi, there are still hosts in icinga with a message saying that PS are not redundant, is that expected? [15:05:35] some of them are clearing up now [15:05:55] marostegui: they all show green here [15:05:56] let me force an icinga check [15:07:05] robh: so the only outstanding things on icinga for now: elastic1031 with no PS redundancy and db1127 mgmt down [15:07:50] db1127 just cleared, elastic1031 remains [15:09:49] marostegui: elastic1031 is green on powersupplies but organge on error led [15:09:54] chris is checking [15:10:03] cool [15:10:04] thanks [15:10:47] ha [15:10:52] the psu error for that system is a psu fan [15:11:00] unrelated to our work, but is a psu failure in icinga [15:11:24] so maybe a ticket for our elastic sres? [15:11:29] chris is making one now [15:11:40] lovely thanks [15:15:49] ok, once chris is done making that task [15:15:58] we're ready to move onto a5 https://phabricator.wikimedia.org/T227141 [15:17:27] akosiaris: are you about and can you return your nodes in a3 to use [15:17:32] and drain nodes in https://phabricator.wikimedia.org/T227141 ? [15:18:00] robh: yeah, will do so now [15:18:35] thank you! [15:18:56] I won't be around for the next one though [15:19:03] probably that is [15:24:51] well, you have the directions on how to depool [15:24:58] just seemed easier to ping you when you were about =] [15:34:52] robh: done [15:46:19] ok, both new pdus mounted in the rack, old pdu still live but not mounted [15:46:25] we're going to kill side b of a5-eqiad now [15:56:47] ok side a done doing side b in a5-eqiad [16:07:54] a5 all done [16:08:18] correction, chris is cleaning up the cables but the pdu swap is done [16:19:11] akosiaris: ok you can return a5 items to service [16:19:14] we are all done there [16:19:45] a7 will be after lunch (in about an hour or so) [16:22:47] not sure how to migrate services back to a ganeti node [16:23:17] sudo gnt-node add it seems [16:23:27] wait, no thats if removed for a long period [16:23:36] we just migrated never removed, so not sure [16:23:54] Anyone other than akosiaris know how to migrate back to a node? [16:23:57] moritzm: ^? [16:26:00] robh: maybe you want 'gnt-instance migrate' ? [16:26:03] never done it though [16:27:05] thats to migrate away afaict [16:27:13] sudo gnt-node migrate -f ganeti1008 pushes ganeti1008 out of service [16:27:16] which is what alex did [16:27:20] but now i want ot bring it bakc [16:27:36] gnt-instance migrate is different from gnt-node migrate [16:28:00] oh, yes, sorry [16:28:01] you'd pick some of the instances -- the VM names -- that are supposed to be on ganeti1008 [16:28:05] oh, ok [16:28:14] i did not output what was there beforehand [16:28:16] so bleh. [16:28:27] * robh didnt migrate away and assumes alex has that list [16:28:32] so it can wait for him i suppose [16:28:45] a7 doesnt have a ganeti node to drain so we're just down 1 [16:29:16] bblack: heyas, you note that traffic has to be around for a7 [16:29:21] https://phabricator.wikimedia.org/T227143 [16:29:38] it is 12:30 here, can we do this in 90 minutes? [16:30:22] robh: yes [16:30:33] robh: this will tell you what could be running on ganeti1008 but presently isn't: sudo gnt-instance list -o name,snodes | grep ganeti1008 [16:30:43] anyway I would not worry too much about it [16:31:27] ok, we are going to go snag some lunch, back in less than an hour, then we'll start a7 [16:40:10] cdanis: makes sense. ack. thx [16:40:27] I actually think it's probably not worth doing any ad-hoc rebalancing [17:18:44] robh: just proceed, we can rebalance the Ganeti cluster when the PDU maintenance is over [17:18:57] it's a fairly long-running operation anyway [17:19:02] https://wikitech.wikimedia.org/wiki/Ganeti#Cluster_rebalancing [17:40:47] robh: ^ that [17:41:35] and cdanis is correct on that oneliner ofc. pro tip: list -o +snodes is ever better ;-) [17:41:47] ooh ty akosiaris [17:48:58] cool [17:49:03] ok, chris and i are back from lunch [17:49:13] we are goig to start work on a7-eqiad [17:49:27] bblack: let me know when we are ok to continue on a7 [17:49:31] since it specifically had lvs system [17:49:45] ref: https://phabricator.wikimedia.org/T227143 [17:52:34] also ping vgutierrez or ema i suppose as traffic folks =] [17:54:14] hopefully one of you is around or else we have to stop swapping pdus for today [17:54:21] non-ideal. [17:57:20] robh: Valentin is on holiday and I think that Ema is already out, probably Brandon is your best bet :) [17:57:39] yeah, i thought he would be around but it may be lunch time for him [17:58:05] yep [18:02:40] robh: I'm here now [18:02:57] cool, we are ready to work on a7 when you are =] [18:03:06] just prestaging other items while i was waiting [18:04:11] so if/when you can kill the lvs system that is live there, let us know! [18:04:20] we will shutdown the ms-be systems there after that and do the pdu work [18:07:11] robh: lvs/cp are ready to go (1013 traffic is failed over to 1016) [18:07:24] awesome, i'm killin gms-be systems now [18:07:26] is this a definite downtime or just a possible? [18:07:37] ah definite [18:07:39] ok [18:07:56] so should I soft-poweroff these cp/lvs? [18:09:10] uhhh [18:09:11] its possible [18:09:13] not definte. [18:09:16] definite even [18:09:24] we just shtudown ms-be because godog prefers it [18:09:29] oh ok [18:09:38] we'll just wait and see then [18:09:39] the ms-fe we depool and leave up [19:09:12] ok, returning a7 servers to service [19:09:54] ok to repool lvs/cp? [19:10:19] yep, we are done in there now [19:10:22] thank you [19:16:04] ok relocating out of dc floor to hotel a block away [19:16:10] back online shortly to schedule row b and stuff [19:20:26] robh: <_joe_> can we please suspend all maintenance for the day? [19:21:34] ^ [19:21:54] row B should hold for now, at least until we understand all the badness that correlated with the A7 timeframe [19:22:07] (which may or may not be related, but either way, the noise makes it extra confusing) [19:25:15] <_joe_> 10.64.0.83 can we confirm this is in A6? [19:25:41] so recapping some salient points that are hard to dig from IRC with all the noise in the past ~1.5h: [19:25:43] for the record A6 pdu wasn't touched today or the task isn't updated [19:25:46] _joe_: that is mc1022 which is indeed A6 [19:25:58] <_joe_> 10.64.0.80 - 84 are all giving timeouts to some appservers [19:26:19] re: A7 power work - work started ~18:00 with some depools and prep work, and the actual power cuts were ~18:33 and ~18:42 for the two separate legs [19:26:32] <_joe_> and that's when the problem started [19:26:46] the early depools were ms-fe and ms-be though, which seem kinda out there to be related [19:26:56] was there mw depools as well? [19:27:06] should we phone robh or Chris to make sure everything on site looks good? [19:27:18] <_joe_> cdanis: can you depool all appservers that are in A7 please? [19:27:29] the other early depool work was traffic: we depooled 1x node each in text-eqiad and upload-eqiad for A7, and also failed over LVS public text traffic from lvs1013 to lvs1016 (which seemed fine) [19:27:30] FWIW the overall rate of 50x does not look too bad [19:27:33] _joe_: rgr [19:27:54] on mw1270 - /var/log/php7.2-fpm/error.log - script '/srv/mediawiki/docroot/wikipedia.org/w/index.php' .. executing too slow .. child .. stopped for tracing [19:27:58] I checked bandwidth usage on mc1022 with ifstat, it seems normal [19:28:03] <_joe_> mutante: that's ok [19:28:26] * akosiaris checked for network errors on mc1022 and net util and looks fine [19:28:31] ah snap [19:28:31] Jul 23 19:28:11 mw1271 mcrouter[919]: I0723 19:28:11.842501 1051 ProxyDestination.cpp:453] 10.64.0.81:11211 marked hard TKO. Total hard TKOs: 1; soft TKOs: 0. Reply: mc_res_connect_error [19:28:38] _joe_: done [19:28:43] <_joe_> elukey: I am operating in the hypothesis there is some network problem in A7 at this point [19:28:44] about 5MB/s thatn a couple of hours ago, but otherwise ok [19:28:56] less than* [19:29:12] ok [19:29:15] but those 5MB/s less traffic on mc1022 might very well be related [19:29:28] XioNoX: ^ see above: 19:28 < _joe_> elukey: I am operating in the hypothesis there is some network problem in A7 at this point [19:29:35] drop seems to start at 18:51 [19:29:47] https://grafana.wikimedia.org/d/000000377/host-overview?refresh=5m&panelId=8&fullscreen&orgId=1&var-server=mc1022&var-datasource=eqiad%20prometheus%2Fops&var-cluster=memcached [19:29:54] yeah the big badness is mostly ~18:[45]x and onwards [19:29:55] fwiw, regarding the scap upgrade. i upgraded it all on codfw mw hosts and some other machines, but not mw eqiad, and they did not involve any restarts [19:29:59] but there are bad signs earlier in graphs [19:30:06] <_joe_> cdanis: can you please !log your actions? [19:30:14] _joe_ the hard tko is something that I haven't seen in mcrouter before (only soft), I am not sure if the shard have been depooled for good [19:30:21] by some mw appservers [19:30:28] i am not ar home but is there sth i can help with for logstash? [19:30:29] _joe_: I did [19:30:35] <_joe_> elukey: it's possible. that appserver is in A7 though [19:30:49] godog: pretty sure logstash is tertiary fallout, it's being overwhelmed by other things going wrong and logging about it [19:30:53] <_joe_> mc1021 has a puppet failure too [19:31:09] bblack: kk, thanks for the context [19:31:11] godog: logstash appears to be a victim of whatever is going on [19:31:11] <_joe_> next mitigation possible is we remove the mc servers in A6 from the pool [19:31:12] (mostly the logstash impact is coming from memcache errors being logged, I think) [19:31:45] all PHP7 rendering alerts are green again except: mw1312 and mwdebug1002 [19:31:54] <_joe_> I see no change by depooling those servers though [19:31:58] <_joe_> in the mcrouter metrics [19:32:05] <_joe_> maybe I need to wait the moving average [19:32:07] the earliest hard graph evidence that seems definitely-related, that I've seen is the ~18:03 massive ramp-up in MC requests [19:32:21] https://grafana.wikimedia.org/d/000000316/memcache?orgId=1&panelId=41&fullscreen [19:32:26] em [19:32:27] https://grafana.wikimedia.org/d/000000562/network-errors-by-cluster?panelId=2&fullscreen&orgId=1&from=1563909907302&to=1563910330003 [19:32:30] what is this ^ ? [19:32:38] <_joe_> this https://grafana.wikimedia.org/d/000000549/mcrouter?orgId=1&var-source=eqiad%20prometheus%2Fops&var-cluster=All&var-instance=All&var-memcached_server=All&panelId=33&fullscreen&from=now-15m&to=now gives a pretty compelling story [19:32:58] analytics? [19:32:59] yep [19:33:15] there are two an-worker hosts on that rack [19:33:18] (A7) [19:33:30] <_joe_> shdubsh: can you downtime logstash? [19:33:34] <_joe_> so that it stops paging us [19:33:39] ack [19:34:00] <_joe_> ok, anyone is looking at connectivity between appservers and those mc hosts somehow? [19:34:08] network traffic from an-worker10{81,82} does not look alarming [19:34:09] <_joe_> please say what you're doing when you're doing it [19:34:18] looking if those 2 hosts justify those TCP retransmits [19:34:20] <_joe_> cdanis: let's focus on the appservers/mc issue [19:34:21] but the retrans rate is interesting right [19:34:34] cp1077 and cp1078 are two others in the same rack to peek at, if it's a rack-wide network issue [19:34:37] the retrans rate made me suspect rack switch saturation [19:35:10] )the big dropoutt of traffic on those two cp hosts is because they were depooled) [19:35:10] <_joe_> do A6 and A7 share some network apparatus that could saturate? [19:35:11] I am chasing down that lane (the analytics TCP retransmits), will report back here, please everyone, don't get sidetracked [19:35:24] _joe_: all An racks are part of a shared switch stack [19:35:50] each An has a separate top-of-rack switch, but they're all interconnected into one larger "virtual switch", they can affect each other for sure [19:36:16] <_joe_> Jul 23 19:23:52 mw1280 puppet-agent[38807]: Could not retrieve catalog from remote server: request https://puppet:8140/puppet/v3/catalog/mw1280.eqiad.wmnet interrupted after 0.864 seconds [19:36:27] <_joe_> mw1280 having issues connecting to the puppetmasters too [19:36:46] <_joe_> elukey: I think we should declare that mc rack lost for now and deploy a mcrouter config change [19:37:01] <_joe_> that will ease the pressure on logstash, at the cost of reduced caching [19:37:24] XioNoX: what does "Poller Time" mean? https://librenms.wikimedia.org/graphs/type=device_poller_perf/device=160/from=1563824100/ [19:37:37] is the switch taking literally 3x as long to respond to SNMP scraping? [19:38:00] <_joe_> if XioNoX is off, let's call faidon I guess [19:38:09] he was here 15 minutes ago... [19:38:09] https://librenms.wikimedia.org/graphs/to=1563910500/device=160/type=device_bits/from=1563888900/legend=yes/ [19:38:14] this graph can't possibly be real, right? [19:39:51] <_joe_> we keep getting [19:39:53] <_joe_> Jul 23 19:39:33 mw1280 mcrouter[9835]: I0723 19:39:33.730286 9841 ProxyDestination.cpp:453] 10.64.0.82:11211 marked hard TKO. Total hard TKOs: 1; soft TKOs: 0. Reply: mc_res_connect_error [19:39:55] <_joe_> Jul 23 19:39:37 mw1280 mcrouter[9835]: I0723 19:39:37.800527 9841 ProxyDestination.cpp:453] 10.64.0.82:11211 unmarked TKO. Total hard TKOs: 0; soft TKOs: 0. Reply: mc_res_ok [19:39:59] rescheduled icinga service checks on the 2 remaining hosts showing rendering alerts. did not go away. but also no new ones showing up. mw1312, mw1270, mwdebug. should i try to restart hhvm on 1312 ? - Fatal error: entire web request took longer than 60 seconds [19:40:09] <_joe_> yes [19:40:12] <_joe_> please do [19:40:30] !log restarting hhvm on mw1312 [19:40:30] mutante: Not expecting to hear !log here [19:41:08] <_joe_> ok so [19:41:15] <_joe_> it's clear from logs from different servers [19:41:17] got your ping _joe_, I was checking metrics [19:41:21] <_joe_> that servers in any rack/row [19:41:39] <_joe_> are having the same issues connecting to the same servers [19:41:41] <_joe_> Jul 23 19:40:30 mw1340 mcrouter[48804]: I0723 19:40:30.489393 48810 ProxyDestination.cpp:453] 10.64.0.82:11211 marked soft TKO. Total hard TKOs: 0; soft TKOs: 1. Reply: mc_res_timeout [19:42:01] it is weird since telnet work from mw to mc [19:42:03] <_joe_> this server is in row c [19:42:09] RECOVERY - OSPF status on cr2-eqdfw is OK [19:42:19] <_joe_> elukey: but maybe you can't retrieve a big blob of data [19:42:22] <_joe_> before the timeout [19:43:06] this is mc1022 https://librenms.wikimedia.org/device/device=160/tab=port/port=14186/ [19:43:30] there are big holes [19:43:32] I suspect there are some monitoring artifacts in librenms because of network issues [19:43:44] should we contemplate a DC failover? [19:43:45] but there are big holes and there is a completely unreasonable spike of packets and traffic [19:43:48] for asw2-a-eqiad [19:44:01] how confident are we currently on codfw-switchover stuff? [19:44:12] <_joe_> bblack: not sure about the databases [19:44:22] <_joe_> but for everything else, we shall be ok-ish? [19:44:32] I can move the traffic edge stuff out of eqiad, it may help in some indirect sense in reducing general network load there [19:44:47] <_joe_> frankly I'd first cordon off the mc servers that are showing issues [19:44:48] (and getting rid of some reqs to active/active services in eqiad) [19:44:50] cdanis: yeah but I don't see it for mc1030 https://librenms.wikimedia.org/device/device=162/tab=port/port=14758/ [19:44:55] ok [19:45:04] elukey: it's on asw2-c-eqiad [19:45:10] so something happened with asw2-a I think [19:45:19] (or connectivity *from* librenms *to* asw2-a) [19:45:27] mw1280 - could run puppet just fine manually - we have been seeing some of those intermittent puppet issues before this issue started [19:45:37] mc1021 as well https://librenms.wikimedia.org/device/device=160/tab=port/port=14185/ [19:45:41] that is A5 [19:45:54] <_joe_> mutante: it depends if we have a network issue ongoing during the puppet run [19:46:01] <_joe_> network issues are intermittent [19:46:18] elukey: assuming that the letters in the access switch hostname also means row, I suspect you'll see the same for the port of any mc host in row A [19:46:29] _joe_: ack, just saying we have seen some of them before the main issue started [19:46:40] _joe_: TKOs are not as bad as they were 20 minutes ago, at least [19:46:48] <_joe_> elukey: what do you think of my proposal? [19:47:03] cdanis: yes (the letters) [19:47:26] <_joe_> timeouts haven't changed significantly [19:47:27] current appservers showing mcrouter issues, in descending order by current TKOs: mw1312 mw1327 mw1329 mw1264 mw1326 [19:47:29] for what is worth, the retransmits on the analytics cluster seems to be a byproduct of the entire cluster running at fully network capacity right now. Most an-workers (e.g. an-worker1078 up to an-worker1095 and from analytics1042 up to analytics1077 have network traffic upwards of 60MB/s and usually in the 120MB/s [19:47:34] there are more but those are the large ones [19:47:38] elukey: does this make any kind of sense? [19:47:43] _joe_ I agree with proceeding but if we establish first what is the impact to mediawiki/dbs/etc.. My point is that if we currently see timeouts but things can allow us for some debugging, better to proceed, otherwise let's remove the shards [19:47:46] why would these hosts move SO MUCH DATA around ? [19:48:04] akosiaris: distributed filesystem and lots of shuffling, that's the nature of mapreduce [19:48:14] akosiaris: I can check, there is probably a big job running, but usually it is not an issue [19:48:15] <_joe_> cdanis: can you extract the tkos by remote host IP too? [19:48:17] it seems to me that almost all are saturating their network interfaces [19:48:25] <_joe_> I'm pretty sure they're all the same 5 servers [19:48:42] <_joe_> elukey: I'd kill such a job in this moment maybe [19:48:50] actually, elukey scratch that. You are on the MC front [19:48:56] is otto around? [19:49:11] <_joe_> akosiaris: not in this room, go fetch him :) [19:49:26] not on any room currently it seems [19:49:29] _joe_: 10.64.0.174 10.64.32.48 10.64.32.66 10.64.0.59 10.64.32.47 [19:49:45] hey [19:49:48] <_joe_> cdanis: those are the memcached? [19:49:51] akosiaris: I can check now, it will be quick for me. You are saying that all hosts are saturing their ports? or the row a ones? [19:49:53] _joe_: these are appservers [19:49:54] <_joe_> uhm. not what I see [19:50:06] <_joe_> cdanis: no I mean the IPs of the memcached servers going tko [19:50:09] so [19:50:10] paravoid: Memcache request rates spiked up early on (~18:00), various things go worse over the 18:xx hour, there were lots of MW servers failing rendering checks from icinga, all in A7, around the worst of it coming on. [19:50:15] elukey: all I think [19:50:16] can anybody check what is the current overall impact? [19:50:18] I'm looking at asw2-a logs [19:50:25] various A7 hosts have obvious signs of network problems: TCP retrans graphs, puppet failing to connect, etc [19:50:26] I mean to mediawiki requests etc.. [19:50:35] <_joe_> elukey: it's small right now [19:50:37] lots of stuff happened (which I can go in detail in a bit), but ended at 19:27 [19:50:40] <_joe_> it was worse at some point [19:50:44] <_joe_> we have some added latency [19:50:47] i.e. there is nothing useful in the logs past 19:27 [19:50:54] elukey: look at https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&from=now-1h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=analytics&var-instance=All&panelId=84&fullscreen [19:50:54] ok [19:51:02] <_joe_> paravoid: but we still see this for instance [19:51:09] <_joe_> https://grafana.wikimedia.org/d/000000549/mcrouter?orgId=1&var-source=eqiad%20prometheus%2Fops&var-cluster=All&var-instance=All&var-memcached_server=All&panelId=33&fullscreen&from=now-30m&to=now [19:51:10] paravoid: ok, good to know, we're still seeing mcrouter issues fwiw [19:51:18] and the network row that gets expanded to all hosts [19:51:20] memcached general request rate is still elevated: [19:51:22] https://grafana.wikimedia.org/d/000000316/memcache?panelId=41&fullscreen&orgId=1 [19:51:30] <_joe_> the 5 servers in a specific rack have higher timeouts [19:51:40] A6 or A7? [19:51:41] which rack? [19:51:48] <_joe_> let me check [19:52:15] <_joe_> A6 [19:52:26] so a cross-switch (= cross-rack) switch was going up and down [19:52:29] <_joe_> mc1019 to mc1023 [19:52:36] FPC6 indeed [19:52:40] <_joe_> they're still showing problems [19:52:43] _joe_: I'll have a list for you of most-TKO'd memcacheds in a minute [19:52:49] <_joe_> cdanis: <3 [19:52:50] A6 task doesn't have any update about pdu work there [19:52:55] we did not work on A6 [19:53:01] dbprov1001 (A7) shows PSU critical alert since about 20 minutes [19:53:03] <_joe_> marostegui: it's a fallout from other work, clearly [19:53:13] akosiaris: I don't see any big job running but I'll keep checking [19:53:20] mutante: we fixed dbprov1001 like 30 minutes ago, lemme check the host directly [19:53:24] Jul 23 18:50:16 asw2-a-eqiad vccpd[1867]: interface vcp-255/0/50 went down [19:53:27] Jul 23 18:50:16 asw2-a-eqiad vccpd[1867]: Member 7, interface vcp-255/0/50.32768 went down [19:53:30] Jul 23 18:50:17 asw2-a-eqiad vccpd[1867]: interface vcp-255/0/50 came up [19:53:33] Jul 23 18:50:17 asw2-a-eqiad vccpd[1867]: Member 7, interface vcp-255/0/50.32768 came up [19:53:36] Jul 23 18:54:51 asw2-a-eqiad vccpd[1867]: Member 6, interface vcp-255/1/0.32768 went down [19:53:39] Jul 23 18:54:51 asw2-a-eqiad vccpd[1867]: Member 6, interface vcp-255/1/0.32768 came up [19:53:42] Jul 23 18:56:57 asw2-a-eqiad vccpd[1867]: Member 6, interface vcp-255/1/0.32768 went down [19:53:43] mutante: its old error, its clear on the host [19:53:45] Jul 23 18:56:58 asw2-a-eqiad vccpd[1867]: Member 6, interface vcp-255/1/0.32768 came up [19:53:48] Jul 23 19:09:09 asw2-a-eqiad vccpd[1867]: Member 6, interface vcp-255/1/0.32768 went down [19:53:51] Jul 23 19:09:09 asw2-a-eqiad vccpd[1867]: Member 6, interface vcp-255/1/0.32768 came up [19:53:52] mutante: maybe reforce the icinga check [19:53:53] "member 6" etc above, is the rack number [19:53:54] Jul 23 19:13:52 asw2-a-eqiad vccpd[1867]: Member 6, interface vcp-255/1/0.32768 went down [19:53:57] Jul 23 19:13:53 asw2-a-eqiad vccpd[1867]: Member 6, interface vcp-255/1/0.32768 came up [19:54:00] Jul 23 19:25:00 asw2-a-eqiad vccpd[1867]: Member 6, interface vcp-255/1/0.32768 went down [19:54:03] Jul 23 19:25:00 asw2-a-eqiad vccpd[1867]: Member 6, interface vcp-255/1/0.32768 came up [19:54:06] Jul 23 19:27:28 asw2-a-eqiad vccpd[1867]: Member 6, interface vcp-255/1/0.32768 went down [19:54:09] Jul 23 19:27:29 asw2-a-eqiad vccpd[1867]: Member 6, interface vcp-255/1/0.32768 came up [19:54:27] correct [19:54:33] ah snap that is A6 where the mc hosts are [19:54:34] <_joe_> elukey: to respond to your question [19:54:35] all VC ports are up right now [19:54:46] correct [19:54:51] <_joe_> it's mostly https://grafana.wikimedia.org/d/000000066/resourceloader?refresh=5m&panelId=45&fullscreen&orgId=1 resourceloader being slow [19:54:54] it is possible this VC link is having errors [19:54:57] broken fiber or something [19:55:07] trying to figure out how to check that from the switch, these are kind of special interfaces [19:55:12] we still have elevated latency in A6 apparently, is it possible there's still some network-level fallout happening? (some multicast storm or packet loss or elevated latency due to crappy VC routing, etc?) [19:55:19] thanks _joe_ [19:55:29] this is the FPC 6 <-> FPC 7 link, FPC 7 is one of our two spines [19:55:42] and FPC7 is in the rack with the PDU work [19:55:42] elukey: ah, it's all 10G hosts, they can't be saturating their interfaces at 150MB/s [19:55:48] paravoid: I can disable that link if needed [19:56:04] 150MB/s > 10Gb/s [19:56:09] (they can be saturating) [19:56:25] oh wait my math was off by a zero lol [19:56:34] 150MB == 1.2Gb [19:56:38] akosiaris: not all have 10G, but a lot of them for sure [19:56:39] for a moment I thought my sleepiness got me [19:56:41] still, could be pps rate [19:56:55] elukey: sudo cumin 'an-worker*' 'ethtool enp130s0f0' [19:56:58] those for sure do [19:57:03] checking the other set [19:57:11] ah yes the newer ones, analytics* should not [19:57:12] XioNoX: can you check librenms to see if i) this specific VCP ii) other VCP are a) saturated b) showing errors? [19:57:29] lot of thoses in the logs: [19:57:29] Jul 23 19:27:28 asw2-a-eqiad fpc6 [EX-BCM PIC] ex_bcm_linkscan_handler: Link 54 UP [19:57:29] Jul 23 19:27:28 asw2-a-eqiad fpc6 [EX-BCM PIC] phy_40g_cr4_an_status : Port 54 mii_status = 0x88, ll_adv = 0x1, lp_adv = 0x0 [19:57:29] Jul 23 19:27:28 asw2-a-eqiad fpc6 [EX-BCM PIC] phy_40g_cr4_an_status : Port 54 pause resolution = 0, ll_p = 0x0 lp_p = 0x0 [19:57:29] Jul 23 19:27:28 asw2-a-eqiad fpc6 [EX-BCM PIC] ex_bcm_cr4_get_remote_pause: GET REMOTE PAUSE = 0x0, port 54 [19:57:29] Jul 23 19:27:28 asw2-a-eqiad fpc6 BCM Error: API bcm_port_advert_remote_get(device, port, &ablity) at ex_bcm_get_remote_ability:716 -> Operation disabled [19:57:29] Jul 23 19:27:28 asw2-a-eqiad fpc6 [EX-BCM PIC] ex_bcm_pic_get_an_info: Failed to get the remote ability for Rear QSFP+ PIC port 0 [19:57:30] Jul 23 19:27:29 asw2-a-eqiad fpc6 [EX-BCM PIC] ex_bcm_pic_ifd_config: vcp-255/1/0, enable - 1 [19:57:43] paravoid: VCP don't show up in librenms [19:57:51] _joe_: https://phabricator.wikimedia.org/P8787 [19:57:53] <_joe_> elukey: can you prepare a patch removing those hosts from memcached at least? [19:57:58] sure [19:58:06] those are the IPs for which mcrouter reports the most errors over the past ~20 minutes [19:58:16] robh/marostegui: ack. done. rescheduled icinga check - icinga is just overloaded i think. everything takes longer [19:58:16] the most TKOs that is [19:58:26] <_joe_> cdanis: thanks [19:58:37] what's the evidence we have of network-related issues right now, rather than this being a secondary/cascading effect? [19:59:03] we see high-level (= mcrouter) errors, has anyone confirmed network-level errors? packet loss, latency etc.? [19:59:26] <_joe_> paravoid: not right now, no, we had multiple signals before [19:59:37] when is before? and what kind of signals? [19:59:42] <_joe_> including puppet failing with broken pipes or connection resets [19:59:46] mutante: thanks [19:59:57] from mw1312 (the remaining host with rendering alerts besides mwdebug) i ping the puppetmaster1001 and i have about 7% packet loss [20:00:10] mutante: thank you! [20:00:13] paravoid: also some monitoring artifacts in librenms like https://librenms.wikimedia.org/graphs/to=1563910500/device=160/type=device_bits/from=1563888900/legend=yes/ [20:00:16] <_joe_> before == 19:25 - ish [20:00:20] XioNoX: thoughts? [20:00:24] https://grafana.wikimedia.org/d/000000562/network-errors-by-cluster?orgId=1&from=now-3h&to=now [20:00:24] this was mentioned before as well https://grafana.wikimedia.org/d/000000562/network-errors-by-cluster?panelId=2&fullscreen&orgId=1&from=now-3h&to=now [20:00:30] ^ if you exclude labs on there, you can see better [20:00:39] <_joe_> herron: thanks that too [20:00:43] shall we disable fpc6: vcp-255/1/0? [20:00:55] parsoid has UDP inerrors from 19:45 - 19:49 [20:01:14] and analytics TCP retrans is still ongoing, started looking bad at 19:27 [20:01:15] chaomodus: is there an easy way to translate a set of IPs or a set of hostnames into racks (e.g. A6) from the command line? [20:01:31] dunno what hosts they are but I'm seeing spikes of traffic https://librenms.wikimedia.org/device/device=160/tab=port/port=14063/ [20:01:42] but that could be fallout (after 19:27 switch recovery, analytics is sending way more data and facing natural retrans) [20:01:47] elukey: are the analytics and an-worker hosts part of the same cluster? That would explain many of the TCP retransmits. if analytics1* hosts are saturating their interfaces and an-worker1* hosts try to talk to them it makes sense [20:02:11] akosiaris: yes they are, the analytics* ones are going to be refreshed this year, but part of the Hadoop cluster [20:02:27] cdanis: not command line but i could whip something up [20:02:35] tell me ds&rack [20:02:47] the psike of traffic above is analytics1058 [20:02:58] so yeah, I don't know if there's hard evidence of any network-level issue past 19:27 (as in, caused by faulty network in the present) [20:03:09] bblack: there is, see mutante's info above [20:03:31] ah there [20:03:39] there are alerts for " timed out before a response was received" for 2 restbase hosts, 1021 and 2010. thats A6 and B8 [20:03:43] 6-7% packet loss between mw1312 (A6) <-> puppetmaster1001 (B6) [20:03:54] robh: here? [20:03:56] elukey: ok that explains it then. grafana says also this has happened in the past (2019-07-20, 2019-07-19, 2019-07-11, so recently) in similar scales, so it's probably just an unlucky coincidence that it happened now [20:04:02] * akosiaris closing that avenue [20:04:21] paravoid: yes [20:04:29] <_joe_> paravoid: I don't see packet loss right now, but it might be a coincidence [20:04:30] robh: did you or chris at any point in time unplug/plug a cross-switch fiber? [20:04:42] _joe_: I reproduced it just seconds ago [20:04:49] Not that I am aware of, I did not unplug any fibers at all. [20:04:49] robh: and/or is it possible you damaged a fiber? [20:04:54] <_joe_> heh me too now [20:05:09] okay, so the one thing we do know is still happening is the mcrouter TKOs [20:05:13] ok, my hypothesis is that a fiber was broken in the process of this and it's causing packet loss [20:05:16] can we locate the erroring fiber by looking at VCP ports for interface error stats? [20:05:18] <_joe_> cdanis: and the packet loss [20:05:18] the memcacheds with the most TKOs across all appservers are on A6 [20:05:28] unfortunately I can't seem to extract information out of the switch for VCP ports [20:05:31] diagnostics say "Not-Avail" [20:05:32] ok [20:05:36] so I'm going to assume it's the fiber that was flapping [20:05:41] mc1023 mc1021 mc1020 mc1022 mc1019 are 4 of the top 5 TKOd memcacheds and they are all on A6 [20:05:42] and let's disable it [20:05:51] XioNoX: thoughts? agree/disagree? [20:06:02] paravoid: do you have an idea about the *start* time for this switch/fiber-level issue? [20:06:11] 18:50:16 [20:06:15] ok [20:06:20] does that match? [20:06:22] <_joe_> that's when the bigger issues started [20:06:24] paravoid: sure, let's try it [20:06:26] <_joe_> but we had some issues before [20:06:30] yeah it matches the big increase in issues [20:06:33] <_joe_> starting from 18:00 [20:06:33] cdanis: there is also https://grafana.wikimedia.org/d/000000549/mcrouter?orgId=1&var-source=eqiad%20prometheus%2Fops&var-cluster=All&var-instance=All&var-memcached_server=All&panelId=33&fullscreen&from=now-3h&to=now to see it visually [20:06:37] but MC had elevated stats back to ~18:00 [20:06:47] cdanis: i have the list if you want it for a6 [20:06:59] paravoid: Do you want me to go back onsite? Ive been standing by for that. [20:07:06] elukey: ahaha now I feel a fool [20:07:07] robh: in a bit maybe, triaging [20:07:10] XioNoX: faidon@asw2-a-eqiad> request virtual-chassis vc-port set interface member 6 vcp-255/1/0 [20:07:14] confirm? [20:07:19] er [20:07:21] XioNoX: faidon@asw2-a-eqiad> request virtual-chassis vc-port set interface member 6 vcp-255/1/0 disable [20:07:25] set disable I guess ? [20:07:26] and reminder, there were software-level things going on circa 18:00 too (some deploy traffic, and a specific issue with an MW hosts having bad scap and depooling, then later a scap software upgrade of some kind, I don't know what else) [20:07:26] bad paste, sorry [20:07:28] ok [20:07:36] we could have multiple concurrent issues stepping on each other [20:07:54] XioNoX: ? [20:08:11] paravoid: lgtm [20:08:14] <_joe_> bblack: right, so let's first try to exclude further network problems [20:08:18] _joe_: I'm looking at shunting memcached logs to logstash to /dev/null in an attempt to bring back the logging cluster. Sound good? [20:08:28] done [20:08:45] <_joe_> shdubsh: let's wait a couple minutes, then you have my +1 [20:08:51] <_joe_> it's mediawiki errors though [20:08:53] paravoid: how is that packet loss? from puppetmaster? [20:08:56] <_joe_> not coming from memcached [20:08:58] re-running pings from mw1312, no packet loss [20:09:03] that I can see of [20:09:17] _joe_: right, I want to install a selector in the mediawiki filter that dumps the memcached channel [20:09:17] <_joe_> I see timeouts going down, but it's to early to call [20:09:19] things should be getting better [20:09:19] I am tailing mw1274's mcrouter logs, don't see more errors [20:09:28] mcrouter timeouts have nosedived [20:09:31] <_joe_> shdubsh: it might not be needed [20:09:31] as have tcp retransmits [20:09:33] <_joe_> yeah [20:09:35] alright [20:09:37] ok [20:09:40] wow we are recovering [20:09:43] :) [20:09:48] https://grafana.wikimedia.org/d/000000549/mcrouter?orgId=1&var-source=eqiad%20prometheus%2Fops&var-cluster=All&var-instance=All&var-memcached_server=All&panelId=33&fullscreen&from=now-3h&to=now [20:09:52] \o/ [20:10:09] rendering on mw1312 and mwdebug recovered [20:10:12] right after you did that [20:10:14] damn, called it 16 minutes ago, should had gone with my instinct :) [20:10:16] it'd be neat to programmatically generate rack-level grafana dashboards [20:10:51] _joe_: no more rendering alerts [20:10:58] paravoid: so one of the fibers that makes up the virtual switch chassis got damaged? [20:11:05] paravoid: good catch [20:11:05] that is my current theory, yes [20:11:28] does the Juniper switch not.. like.. notice high BER or something and disable automatically? [20:11:34] no [20:11:43] <_joe_> ok, I'm going off, it's late and I'm tired. Can someone try to write down a timeline? [20:11:49] and disabling it does not affect the virtual chassis? [20:11:57] elukey: we have redundancies [20:11:58] elukey: presumably it has less capacity now [20:12:04] <_joe_> elukey: your new mcrouter metrics are amaizing btw [20:12:11] _joe_ <3 [20:12:12] <_joe_> kudos on that work, it paid off tonight [20:12:21] https://grafana.wikimedia.org/d/000000562/network-errors-by-cluster?panelId=2&fullscreen&orgId=1&from=now-1h&to=now [20:12:27] ^ the analytics TCp retrans dropped off [20:12:32] ok question now is [20:12:35] do we replace that fiber now? :) [20:12:53] I'm off too [20:13:11] (question being whether robh goes back to the DC) [20:13:12] interestingly, the elevation in memcached request rates that started closer to 18:00 has now dropped off on switch fix as well [20:13:15] <_joe_> shdubsh: did logstash recover too? [20:13:24] paravoid: we're running without redundancy, and we're going to be doing rack work so I'd say asap [20:13:29] which makes me think the fiber issue goes all the way back there and we just don't know it yet [20:13:30] <_joe_> bblack: it's a bit of a thundering-herd effect [20:13:36] <_joe_> yep [20:13:39] well we're not doing any more rack work until this gets fixed for sure [20:13:45] robh: ^, please ack :) [20:13:47] bblack: that's retransmits for any reason, right? do we have graphs of NACK rate? [20:13:49] _joe_: events appear to no longer be building up in kafka, but we're a long way from consuming that backlog [20:13:55] (wondering if these were timeouts or if they were checksum errors) [20:14:03] cdanis: yes any [20:14:05] <_joe_> shdubsh: if you want to drop the memcached errors, just do so [20:14:13] ack [20:14:23] paravoid: issue is how to test the fiber? replace it and hope the issue doesn't show up again?? [20:14:37] <_joe_> ok I'm really off, I count on bblack or cdanis for writing the incident report! [20:14:40] robh: we have (what I think is) a damaged fiber between the switches of A6 and A7; we've disabled that link, but until that fibers gets replaced and the link gets enabled again, no more PDU swaps or any work that gets even remotely in the vicinity of fibers :) [20:15:03] XioBoX: there should surely be a way to see basic interface stats on the VCP links? like error rates and byte rates, etc? I don't get why we couldn't earlier. [20:15:10] err XioNoX ^ :) [20:15:12] +1 to bblack's question [20:15:12] paravoid: so no 10g racks can have pdu work? [20:15:15] do we have replacement for it on site ? [20:15:21] robh: no row A racks at all [20:15:24] bblack: actually it's traffic that dropped off https://grafana.wikimedia.org/d/000000607/cluster-overview?panelId=84&fullscreen&orgId=1 [20:15:37] oh, well, we are done with a until we can do the racks with db masters and the two network racks anyhow [20:15:40] but understood [20:15:43] and started dropping off about 3-4 mins before paravoid submitted the change on the switch [20:15:49] until we fix this [20:15:53] I am still thinking it's just coincidence [20:15:57] akosiaris: ah yeah, probably because kafka is in the error chain too [20:15:58] akosiaris: I will now conjecture that the retransmits were because of the bad fiber causing checksum errors (causing a lot of retransmits) [20:16:29] robh: also for the next racks, please plan for staying in the DC for at least half an hour after, just in case [20:16:33] bblack: yeah that data is exposed via snmp, so we could add that feature to librenms [20:16:50] XioNoX: we should for sure -- have you figured out a way to get them from the CLI? [20:16:57] I tried faidon@asw2-a-eqiad> show virtual-chassis vc-port diagnostics optics all-members [20:17:07] oh I see [20:17:10] neither kafka nor the bad fiber explains that amount of data (peaked at 3.7GB/s) [20:17:12] wilco [20:17:12] it seid [20:17:14] said* [20:17:15] Virtual chassis port: vcp-255/1/0 [20:17:15] Optical diagnostics : Not-Avail (run request) [20:17:20] which at the time didn't realize what it meant [20:17:20] paravoid: so optics errors no [20:17:24] but it's late and I am quite possibly wrong [20:17:40] paravoid: but interface level stats (usage, errors, etc) yes [20:17:40] but I think now it meant "request virtual-chassis ..." instead of "show virtual-chassis ..." [20:17:53] so is the answer just live with it tonight and first thing tomorrow onsite work with paravoid or arzhel to fix? [20:17:59] akosiaris: with a high enough bit error rate, a retransmit in the case of checksum failure is a positive feedback loop [20:18:07] or go there tonight? (either is fine i just want to know) [20:18:09] ;] [20:18:17] lol and now it is: [20:18:17] Optical diagnostics : N/A [20:18:20] gee thanks [20:18:47] cdanis: sure, but history says it's happening pretty often. e.g. https://grafana.wikimedia.org/d/000000607/cluster-overview?panelId=84&fullscreen&orgId=1&from=now-30d&to=now and https://grafana.wikimedia.org/d/000000562/network-errors-by-cluster?panelId=2&fullscreen&orgId=1&from=now-30d&to=now [20:18:50] yeah I've never been able to get the stats [20:19:00] if you run with request: `vc-port Diagnostics Optics Done (run show)` [20:19:01] XioNoX: interface level stats from the CLI, how? [20:19:09] errors etc. [20:19:09] that's my main reason for saying this is probably just coincidence [20:19:14] could be wrong though [20:19:29] s/e.g./i.e./ [20:19:34] XioNoX: also thoughts on whether to replace now or tomorrow? [20:19:40] (see robh's question above) [20:19:55] i can confirm we have seen some of the intermittent puppet errors yesterday [20:20:04] paravoid: which racks was that fiber connecting? [20:20:09] akosiaris: that past occurrence of 800k/sec retransmits on cache_text scares me [20:20:09] akosiaris: A6 to A7 [20:20:22] other than the possibility that the fiber issue started much earlier than we think, I still don't have a good explanation for the MC request rate pattern here (now recovered): [20:20:24] paravoid: no preference, myabe tomorrow so we can look more at which stats we have [20:20:26] https://grafana.wikimedia.org/d/000000316/memcache?panelId=41&fullscreen&orgId=1 [20:20:34] akosiaris: oh lol nvm I know what that one was 🤦 [20:20:35] but today we can too [20:20:36] i had started a scap upgrade when this issue happened. which is now rolled out partially. i would like to finish it so it's not mixed versions but i also don't want to add noise and maintenance until we are in the clear. i will wait for now. [20:20:43] cdanis: I was about to point out ;-) [20:20:55] can stil replace the fiber today and enable it tomorrow [20:21:28] XioNoX: I think nobody's at the DC at present [20:21:36] ah, then tomorrow yeah [20:22:10] robh said he's at a hotel a block away and offered to go, but yeah, let's not do it if we are not to use it :) [20:22:27] ah, I see a7 had pdu work today so it makes sense. [20:22:41] shdubsh: basically everything recovered except logstash is refusing connections. is that you getting rid of mc logs? [20:22:50] akosiaris: yeah, basically: [20:22:54] Jul 23 18:34:43 asw2-a-eqiad alarmd[2039]: Alarm set: PS SFXPC color=YELLOW, class=CHASSIS, reason=FPC 7 PEM 1 is not powered [20:23:03] [...] [20:23:03] Jul 23 18:35:59 asw2-a-eqiad alarmd[2039]: Alarm cleared: PS SFXPC color=YELLOW, class=CHASSIS, reason=FPC 7 PEM 1 is not powered [20:23:06] bblack: the part that I know is that when mcrouter reports a TKO for a shard, all the commands for that shard get "blackholed" for 4s, then a health probe is sent to the shard and if positive it is re-inserted in the pool. In the meantime mediawiki thinks that something is weird, and for example starts to react (trying to set a value that it thinks is not there, etc..) [20:23:16] Jul 23 18:43:34 asw2-a-eqiad alarmd[2039]: Alarm set: PS SFXPC color=YELLOW, class=CHASSIS, reason=FPC 7 PEM 0 is not powered [20:23:17] mutante: logstash is choking on the logs built up over the course of the incident. we're looking at unclogging it now [20:23:22] mutante: logstash is still mega backlogged because the messages are queued in kafka [20:23:22] [...] [20:23:22] Jul 23 18:44:14 asw2-a-eqiad alarmd[2039]: Alarm cleared: PS SFXPC color=YELLOW, class=CHASSIS, reason=FPC 7 PEM 0 is not powered [20:23:25] [..] [20:23:33] shdubsh/cdanis: ack! [20:23:39] okay [20:23:46] so _joe_ said he was coordinator but did not actually start a document [20:23:46] Jul 23 18:50:16 asw2-a-eqiad vccpd[1867]: interface vcp-255/0/50 went down [20:23:48] has anyone else? [20:23:53] we should be assembling a timeline [20:24:11] cdanis: I think that nothing was created, we should start a new one [20:24:24] nope. And _joe_ should not have been the coordinator per our notes [20:24:27] TZ and all [20:24:30] cdanis: how do we call it? "memcached" ? [20:24:56] the above means: rack 7 switch's first PSU went down at 18:34, then up at 18:35, then second PSU at 18:43, and back at 18:44 [20:25:00] and should not have actively dug into it either ofc [20:25:01] no, memcached was the victim [20:25:05] virtual chassis link partial failure [20:25:10] https://etherpad.wikimedia.org/p/2019-07-23-eqiad-asw-a [20:25:12] ahhaha poor memcached [20:25:14] the VC link went down at 18:50, but not that [20:25:19] note that* [20:25:26] when I started investigating it was reporting as up [20:25:45] (but we had packet loss, signaling a damaged-but-working fiber) [20:25:48] cdanis: except the - in the date [20:26:04] so it is possible that it was damaged before as well, and had packet loss before 18:50Z [20:26:19] mutante: it will live on the wiki eventually, this is a transient title :) [20:26:38] so if it existed before, i have a theory. the top of the racks have a large fiber duct, that may get banged when moving cables, but the fibers are protected inside [20:26:46] cdanis: nevermind, i was already on wikitech [20:26:51] if the fiber was already bad though, just shifting it from that could cause this issue. [20:26:52] ;D [20:27:00] ie: this is the not really onsites fault theory ;D [20:27:12] fairly unlikely :) [20:27:16] but yeah, we didnt really unplug fibers im aware of and chris was announcing what he unplugged. [20:27:17] we weren't having errors before [20:27:28] oh, then yeah something must have crushed a fiber. [20:27:34] I don't think it was unplugged, I think it's damaged [20:27:39] it's the fiber between A6 and A7 [20:28:01] hrmm, maybe when the pdus wre moved in or out it pinched it against the side of the rack rails [20:28:37] so [20:29:04] it sounds like we've decided above to postpone the fiber replacement tomorrow [20:29:18] robh: please plan to do that first thing (and coordinate with XioNoX) [20:29:23] no other work should happen before that [20:29:46] grrrrr I chanserv won't let me op myself here or set the topic [20:29:51] and please communicate that with cmjohnson too -- is there a chance he would go earlier than you and start working on it unaware of today's fallout? [20:30:53] paravoid: no other b row work either? [20:30:57] also, given we've had issues around these replacements for two days in a row, perhaps we tuned a bit too much towards "move fast and break things" and need to pace ourselves :) [20:30:58] or just a? [20:31:17] I'd say no other work, let's restore redundancy in the DC network before anything else [20:31:24] we lost a couple of servers and that was with being very careful [20:31:24] +1 [20:31:31] i dont think we could have prevented it better for those [20:31:39] but the fiber is something else entirely and i cannot argue that one [20:32:00] well, i rather not sit on my hands all day [20:32:04] well regardless of the prevention and related discussions, in the state we're in now, we should definitely back off until we've resolved current issues. [20:32:06] XioNoX: what is the earliest your time you wanna do this? [20:32:47] since you are 3 hours earlier in your day [20:33:09] robh: also please plan to stick around for a while (say, half hour) after that kind of major work is done [20:33:23] paravoid: i acknowledged that request earlier but you must have missed in backlog [20:33:28] wilco [20:33:28] ah yes I did [20:33:29] thanks :) [20:33:35] sorry, lots going on :) [20:33:43] cmjohnson1: oh hey [20:34:01] mutante: LOL at your etherpad name [20:34:27] cdanis: lol, i forgot about that [20:34:32] logging off if not needed o/ [20:34:38] paravoid hi [20:35:37] robh: I can do 11am eastern time maybe 1h early [20:36:03] cmjohnson1: not sure if robh filled you in, should I repeat? [20:36:15] he has filled me in [20:36:18] no need to repeat [20:36:21] just means i kinda waste a morning in eqiad not doing any eqiad work [20:36:24] but ok [20:36:33] there's plenty to be done isn't there? [20:36:46] we can do decoms [20:36:47] that doesn't involve recabling etc. [20:36:49] sorry, its unclear [20:36:53] i thought you meant no work [20:36:56] you mean just no pdu work? [20:37:00] its been a long day [20:37:02] I meant anything that risks fibers [20:37:11] PDU work, or recabling work, stuff of that nature [20:37:17] ok [20:37:26] or work that risks entire rows, etc. [20:37:56] with row A missing a link, we're on shaky ground, basically. [20:38:07] if that slows you down, I think we can do it now too - we avoided it in order to avoid asking you to go back and work some more :) [20:38:28] nah if we can work on normal hardware backlog that is also acceptable use onsite [20:38:39] there is more than 3 hours of that [20:38:45] mutante: can you please set a fixed time range on the grafana links you're adding [20:38:46] heh :) [20:38:52] way more than 3. [20:38:55] this doesn't make sense to me, shouldn't we see some interface drops or something on the hosts that were affected? [20:39:05] just waiting for POSTs in the current backlog is 3 hours ;D [20:39:11] I feel like we still don't understand *how* what happened, happened [20:39:18] anytime anything is done in that data center...there is that risk..no matter how much you work around things and check and double check and are careful things can happen. When we started talking about replacing PDU's I recall stating that there is a risk because of the really tight spaces I have to work. [20:39:46] cdanis: all of the switches are stacked together with fibers [20:39:49] cdanis: fixed the last one [20:40:00] in the past it was a crazy thick copper cable, but now is a nice delicate fiber. [20:40:11] in moving the new pdus into the rack [20:40:13] or the old pdus out [20:40:15] yes I understand that much robh [20:40:22] cdanis: unsure how much detail we want. adding which mw servers were affected is probably not worth much. kind of random? [20:40:25] we likely bashed the edge of that stacking fiber [20:40:29] what I'm saying is, you'd expect to see packet drops on the hosts involved, or similar [20:40:32] cdanis: there are very thick power cables in each rack and all the cables are routed in tiny spaces that also contain the same network cables [20:40:34] oh, sorry [20:40:37] =] [20:40:42] mutante: I think it's enough to mention A6 and A7 and leave it be [20:40:47] did anyone already have a google doc? [20:40:53] if not, I can start one and fill out the basics [20:40:59] chances are pulling those cables through knocked into one of the network cables [20:41:15] bblack: https://etherpad.wikimedia.org/p/2019-07-23-eqiad-asw-a [20:41:25] ah thanks [20:41:38] it's late and my brain is fried - anything immediate I can help with? [20:41:40] XioNoX: sorry, to be clear! 11am my time works [20:41:46] I tried to get ops in here and change the topic, but chanserv said no [20:41:57] until then cmjohnson1 and i can restrict ourselves to pre-staging work for pdus (no actual swaps) [20:42:02] and hardware repair on servers [20:42:04] cdanis: packet drops only show up if the device dropping them increments a counter [20:42:10] and decom of old crap if we get bored. [20:42:11] paravoid: nothing immediate from me, although I'm curious on some things later [20:42:17] here it's a middleware silently dropping then [20:42:19] XioNoX: so do we think the switch itself dropped them? [20:42:30] that lines up with the data I can see, of course [20:42:32] cdanis: yeah a link between two switches [20:42:34] cdanis: shoot, before I context switch and forget :) [20:42:44] I mean shouldn't we be able to see interface errors somewhere? [20:42:54] so symptoms are retransmits [20:43:31] cdanis: on the switch yeah, but it's only exposed by a juniper mib, not pulled by LibreNMS [20:43:43] and it's not well exposed on the cli [20:43:47] I assume a 'virtual switch chassis' is a way to federate a bunch of switches to form a big L2 network with some SDN driving the flows [20:44:06] cdanis: yeah pretty much [20:44:07] you could call it that, it's a very retroactive definition :) [20:44:10] ahaha [20:44:15] okay [20:44:18] there is no SDN and the concept predates the term [20:44:26] got it [20:44:43] virtual chassis (aka stacking) is that you merge the control plane of a bunch of switches together [20:44:47] would it be a correct guess that they are in a separate MIB so that the virtual switch can 'look' like one giant switch in the normal MIB? [20:44:48] ao the data is exposed, but poorly and we don't pull it [20:45:04] okay that makes sense [20:45:07] it looks like one giant switch for all intents and purposes, including SNMP, SSH etc. [20:45:21] thanks, that's helpful [20:45:26] you SSH to asw2-a-eqiad.mgmt.eqiad.wmnet, not asw2-a1/2/3 etc. [20:45:38] cdanis: the server facing ports are in a regular mib, but the cross links links are separate [20:45:51] cross switch [20:45:52] part of me is still itching to see a graph of error rate or packet drops or something for this fiber -- an actual smoking gun -- but I think it's enough for now [20:46:15] a definite actionable for this is to get better visibility + alerting for VCP ports (e.g. in LibreNMS) [20:46:27] yeah, we could implement it, it's not hard but needs time [20:46:56] we added alerting if a VCP goes down, but not if it fails like that [20:47:11] (VCP = VC port, so me saying "VCP port" is just wrong) [20:47:22] (and VC = Virtual Chassis) [20:47:59] XioNoX: so do you know how to get errors etc. on the CLI? [20:48:04] at least my LCD display is rendering your wrongness correctly ;) [20:48:19] cdanis: haha [20:48:26] XioNoX: if yes, add it to https://wikitech.wikimedia.org/wiki/Network_cheat_sheet, if not, find out :) [20:48:47] not on the CLI, I remember looking for it but not finding it, only for the master node, which doesn't help [20:48:51] ok I have one more question [20:48:54] I can look more [20:48:59] https://librenms.wikimedia.org/graphs/lazy_w=838/to=1563914700/device=160/type=device_bits/from=1563893100/legend=no/ [20:49:10] do we think that is real or a glitch caused by scrapes timing out [20:49:41] I think it's glitch [20:49:46] it has happened in the past [20:49:54] yeah probably a glitch [20:50:01] that is my guess as well [20:50:09] I saw some spikes of multicast/broadcast around that time but nothing that big [20:50:19] 1.1Tbps is not out of question for this stack, but unlikely [20:50:52] looking at ports individually I can't find such huge spike [20:51:30] I am wondering if the conjecture about high bit error rate --> positive feedback loop of retransmits is correct [20:51:45] what is the conjecture? I missed it [20:51:52] but I'd have to do some brushing up on TCP and also think about how that interacts with congestion control [20:52:15] paravoid: so let's say that the fiber is damaged in such a way where it still sees a bunch of traffic, but corrupts a large portion of it [20:52:28] the corruption would appear as packet loss [20:53:08] i.e. yes, bits on the wire would be corrupted but it's very unlikely you would see that in L2, the observable effect would be dropped packets [20:53:43] and on TCP flows that would result in missed ACKs and retransmits, yes [20:53:46] yes, you're right [20:54:10] so traffic levels would increase indeed [20:54:17] but not as drastically as I feared [20:54:20] well [20:55:02] packet loss was 5-7%, so retransmits of that level of traffic would be expected [20:55:02] with well-behaved applications written by socket veterans, you'd see a slight increase and TCP wouldn't behave too awfully in most such scenarios [20:55:22] yeah [20:55:32] but it's common for various libraries and applications now to short-circuit everything about TCP and just reconnect when there's the slightest whiff of loss or delay in the air, massively exacerbating problems. [20:55:39] bblack: I was worried that bad-checksum IP packets being delivered would lead to an immediate NAK leading to an immediate retransmit, which would be bad indeed [20:56:05] but I was missing that Ethernet itself has a check sequence, and it will discard packets that don't checksum (probably the interface on the switch was doing this) [20:56:15] which means the normal retransmit timers get involved and it isn't as bad [20:56:18] anyway. [20:56:49] I've seen so much of it over the years, it's never shocking to me anymore to see TCP traffic spike off through the roof when anything's a little wrong on a network, basically. [20:56:59] yeah, kinda disappointing :) [20:57:45] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=an-worker1082&var-datasource=eqiad%20prometheus%2Fops&var-cluster=analytics&from=1563909421091&to=1563912841310 [20:58:16] the disk being saturated by so many retries of remote reads is kinda funny [20:58:30] that seems unrelated I think? [20:58:46] looks like a hadoop job or something, or am I missing it? [20:59:07] https://grafana.wikimedia.org/d/000000562/network-errors-by-cluster?panelId=2&fullscreen&orgId=1&from=1563909421091&to=1563912841310 [21:00:06] I think a Hadoop job started at 19:30, but (because of the network issues) was mostly thrashing instead of making real progress [21:00:19] the end of disk thrash correlates with you disabling the fiber [21:00:24] looking at the graph farther out it doesn't seem to be rare [21:00:38] lots of events like that just in the past month [21:01:12] mm, maybe it is just coincidence [21:05:51] ok, I'm off, take care all [21:27:19] chaomodus: godog: volans: thought I just had -- we should be using prometheus's relabel functionality to add row/rack annotations to all our metrics. that would make it trivial to do stuff like show TCP retransmits / interface errors aggregated by rack, which would likely have been really helpful in diagnosis today [21:28:40] yes [21:29:15] we could have even seen mcrouter issues by rack [21:29:20] how would we do that [21:29:22] cdanis: netmon1002: host2rack mw1257 -> D5 (quick hack we just did) [21:29:23] D5: Ok so I hacked up ssh.py to use mozprocess - https://phabricator.wikimedia.org/D5 [21:29:34] mutante: nice :D [21:29:58] chaomodus: it's a tricky one, I suspect we'd need some sort of pipeline from netbox data into puppet hiera [21:30:05] anyway I have to go now but I will file a task later [21:30:19] kay :) [22:05:31] I have implemented a slightly less hackish script also https://gerrit.wikimedia.org/r/c/operations/software/netbox-deploy/+/525165 [22:06:50] i will assume it's ok to keep working on the scap upgrade now. i just dont want it to be mixed versions and we seem in the clear enough now [22:07:07] but if you disagree please stop me [22:15:45] i wonder why 2 random db hosts have scap unlike the rest of them [22:16:06] they were down or something when it was last upgraded? [22:17:47] chaomodus: it's not about the version, it is 2 hosts have the packages installed and all others do not have it at all [22:18:01] Ooh that is weird ys [22:18:05] Historical reasons? [22:18:27] hmm. yea. somehow. they are also not the oldest by number [22:18:58] ah, they are special in a way. they are the m4 shared. eventlogging [22:19:01] shard [22:19:15] maybe that gets deployed with scap [22:31:08] hmm.. restbase cumin aliases outdated. need to fix them to cover all hosts [23:23:08] we've talked before about either a) a puppet fact, or b) a puppet parser function, to poll from netbox [23:23:27] this would be to replace stuff like profile::cassandra::rack [23:23:32] and profile::elasticsearch::rack etc. [23:25:36] and also expose it to puppetdb and thus cumin [23:26:13] the netbox api isn't that hard, so the task itself isn't super complicated [23:26:58] I expect the complexity to be more so around "if this is a parser function how do we expose it to cumin" (for option b) or "do we really want to distribute a netbox API key to every host and have 1200 hosts poll the netbox API" (for option a) [23:27:24] and "do we really want puppet runs to fail if netbox is down", which is why I've been thinking about this as coming after the whole netbox HA project that chaomodus is working on [23:29:43] puppet has introduced "trusted facts" and "server facts" too, but I'm not sure if these are extendable or are only the built-in ones [23:34:48] can it cache facts? eg. if netbox is down still run with whatever previous data it returned