[11:59:16] akosiaris & godog: https://phabricator.wikimedia.org/T227139 is goign to happen today if you can do the items you list as needing drain? [11:59:26] we wont be doing it until 10am [11:59:34] but figured if i tell you now you can do what you gotta do =] [11:59:48] (dont have to this second, but if it can happen for 10am pdu swap it would rock) [12:00:07] robh: ok will do [12:00:15] thank you! [12:00:31] after we do A3, A5 or A7 is next for today if you wanted to look ahead [12:00:42] they are all linked off the parent task https://phabricator.wikimedia.org/T226778 [12:08:26] do you have a notion of the schedule (which ones which days) for these ones coming up? it might help folks to plan their own schedules a bit [12:25:08] apergos: i just have today for now [12:25:13] A3, then A5, A7 [12:25:20] after that i have to coordinate a lot with cloudteam [12:25:33] I have a list of row b that clears DBA master review [12:25:54] I have no clue who is on cloud team and awake now [12:26:04] i tend to talk to the cloud team folks in pacific time ;D [12:26:40] so yeah, i'll track someone down from there later today, and that will allow us to schedule row b racks for wednesday and thursday [12:26:59] robh: arturo might be able to help for the cloud related things on this timezone [12:27:09] marostegui: thank you for clearing up a3! [12:27:11] o/ [12:27:30] arturo: Heyas, I haven't documented the contents of each rack in row B, but it is in netbox. https://phabricator.wikimedia.org/T226778 [12:27:36] the short story is this [12:27:49] we are swapping PDUs in rows A and B, as the existing ones are 6+ years old, and not great [12:28:12] (they use fuses, cannot be hot changed, have a single tower chassis for both infeeds rather than two independent towers) [12:28:36] Wednesday and Thursday, we would like to swap out as many racks in row B as we can (b5 is already done) [12:29:16] according to dba team, they have no issues with b1, b2, b4, b5, b6, b7 all going, none have db masters [12:29:22] * arturo context-switching to this [12:29:39] so, if you can review cloud team items sometime today to give us the cloud team recommended order of rack swaps, I'd appreciate it [12:30:02] ok, reviewing things [12:30:31] ie: wednesday maybe drain all of b1, b2 into b3 and b5 [12:30:35] robh: what is the expectation for power loss? [12:30:40] robh: b2 as long as it is done after thursday 05:30AM UTC [12:30:48] marostegui: oh, yes, sorry! [12:31:06] arturo: please read marostegui's comment on https://phabricator.wikimedia.org/T226778 in relation to what racks we can do first [12:31:25] the expectation of power loss is 4 out of 10. [12:31:29] maybe 3 [12:31:31] 3.5? [12:31:39] servers? [12:31:43] We didn't lose anything unexpectedly in a4 yesterday [12:32:00] but, it does involve moving a live PDU tower around in the rack and hoping you dont unseat any power cables [12:32:15] with two of us, we move very deliberately and we had no power loss yesterday [12:32:19] but, it can happen. [12:32:25] ok [12:32:41] my initial thinking is that completely draining the servers is not realistic for us [12:32:42] robh: when do you expect to do b5? [12:32:52] marostegui: b5 is done [12:32:56] Ah cool [12:32:57] thanks [12:32:57] it was the test rack [12:32:59] =] [12:33:12] marostegui: so put anything you want into it from row b and you wont have interference from us =] [12:33:15] so our approach would be more like select specific "important" VMs in each affected rack and reallocate them previous to the operations. [12:33:24] robh: yep, thanks [12:33:26] arturo: that sounds reasonable to me [12:33:57] arturo: so if you want, we can plan to do half (higher half) on wednesday [12:34:12] b7, b6, b4 [12:34:23] or atleast b7, and b6 [12:34:38] robh: I will comment this plan with the cloud team today, I need to get andrew involved in the conversation, just in case this interfere with other work he is currently doing [12:34:41] but b4 would also be nice. then the only thing in b3 is wikitech master [12:34:52] arturo: ok, please also note i am only in eqiad to assist chris this week [12:35:01] and after this week, the new guy, who seems good, will be helping [12:35:06] but of course i think im better ;D [12:35:21] ie: I recommend you guys reshuffle whatever you can for this week [12:35:23] robh: for b3, as I said, it is up to cloud :) [12:35:37] arturo: and then we'de like to move b3, and b1 on thursday [12:35:42] it is "their" database, as in: it is only used by them and wikitech [12:35:45] and then friday b2 [12:36:07] marostegui: we would also take special care triple checking that systems power connections [12:36:09] I only see cloudvirt1027 in B3, the other are decom [12:36:41] arturo: cool, do i need to copy notes to https://phabricator.wikimedia.org/T226778 or did you want to summarize after you investigate? [12:36:55] im about to head out to home depot for a drill bit and then onto eqiad, so going afk for about 45 minutes [12:36:57] I will investigate and write a proposal from our side [12:39:02] thank you! [12:42:55] robh: ok will do, I'll start in ~1h [13:38:29] ? [13:38:36] oh buffer playback issue [13:38:37] nm [13:41:20] robh: are hosts going to be downtimed in icinga for the duration of the work? [13:41:36] only if you tell me to like you did the ms-be system [13:41:38] otherwise no [13:41:46] if something loses power we want it to immediately alert imo [13:41:54] the goal is zero power loss [13:42:21] agreed, ok I'm not going to downtime unless I'm powering off things [13:42:48] yeah yesterday I think netmon and kubestage powered off but that wasn't unplanned and we got notifications [13:47:11] ok, i have to go afk and stage the pdus for the isntallation, ill check back in here and will also admin log before i do anything in a3 [13:47:40] but yeah, should be good to start the swap at 10am (so in 15min) [14:05:44] Ok please note we are working on A3 right now https://phabricator.wikimedia.org/T227139 [14:14:06] !log a3-eqiad pdu swap taking place now via T227139 [14:14:06] robh: Not expecting to hear !log here [14:14:08] T227139: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139 [14:14:14] bah [14:14:17] whatever, works [14:16:20] heya godog yt? [14:16:42] been trying X-Container-Read but not having much luck [14:16:49] ottomata: yep I'm here [14:16:58] am looking at https://docs.openstack.org/swift/latest/overview_acl.html [14:17:07] have tried doing swfit post ottotest0 to set --read-acl [14:17:13] have tried uploading with X-Container-Read [14:17:25] have also tried creating new container and uplaoding with X-Container-Read [14:17:43] all still seem give Unauthorized for unauthenticated GET [14:19:49] ottomata: mhh ok, under the analytics swift account ? [14:19:56] yes [14:20:07] oh [14:20:12] maybe i'm requesting the url wrong? [14:20:22] possibly, what's the url? [14:20:34] e.g. [14:20:41] curl https://ms-fe.svc.eqiad.wmnet/v1/AUTH_analytics/ottotest0 [14:20:51] but, maybe i shouldn't have th AUTH_analytics part in there [14:21:53] the prefix should be fine, though there you are requesting a listing [14:22:04] does it work if you try downloading a single object ? [14:22:19] hmmmm it does! [14:22:40] ok now am trying eric's prefix then [14:22:42] with e.g. ?prefi= [14:22:44] like [14:23:01] curl -v 'https://ms-fe.svc.eqiad.wmnet/v1/AUTH_analytics/ottotest0?prefix=location_test1' [14:23:13] then yeah you are missing the listings acl I believe, it is mentioned in the acl overview [14:23:22] ah mh [14:23:30] AHHHH [14:23:31] got it [14:23:32] ok [14:23:50] although I think going with single objects instead of listings might be desirable, see my comments on the cr [14:24:08] not feeling very strongly, I can see arguments for both approaches [14:24:10] i agree i think too, but i think having listing ability would be good [14:24:15] yeah. there might be a lot of files... [14:24:19] and [14:24:31] it might be hard to get the list of files from the upload job [14:24:36] hadoop usually abstracts the actual files away from you [14:24:41] since it splits them up [14:24:52] depends a bit on how they are being written [14:25:34] heh for that option I guess what we could do is list the container instead and generate kafka messages with the files found inside [14:26:17] to me having a simpler client seems preferrable, but I don't have a clear picture of all uses cases either [14:26:35] aye, either way i think allowing listing will be ok, at least for debugging purposes [14:27:04] *nod* yeah [14:27:20] hm ok, still having trouble, even with .rlistings [14:27:29] i tried modifying with swift post [14:27:35] but also now created a new container [14:27:36] ottotest2 [14:28:06] with an upload [14:28:06] e.g. [14:28:08] /usr/bin/swift upload --header 'X-Container-Read:.r:*,.rlistings' ... [14:29:07] AH ha ok [14:29:13] just got it with ottotest0 with swift post [14:29:50] ok that's good, but i need it to work with swift upload and X- Container-Read [14:29:52] hm [14:30:37] the acl need to be set only when creating a new container though, could you detect that and swift post accordingly? [14:30:52] it might be that swift upload doesn't touch the container directly like post does [14:30:59] hm [14:31:04] could [14:31:47] ya ok ca ndo [14:32:46] godog: in that case, should I only pass X-Storage-Policy when creating the container with swift post [14:32:51] and not in the swift upload? [14:32:59] ok both new pdu towers installed in a3 without issue. we are about to reroute the infeeds and kill infeed b [14:33:01] that can't be changed per object anyway, riht? [14:33:11] but it could end up being a, and then we have to replug it in and wait a minute before moving to b [14:33:18] also one of these will cause all mgmt to flap [14:33:22] ottomata: that's right, yeah set the policy and acl at container creation time and you should be done [14:33:31] (i admin logged this, just echoing in here) [14:45:01] side b in a3-eqiad migrated [14:46:45] ok, swaping the side a towers, mgmt didnt blink on side b [14:46:48] but it WILL on this [14:55:22] robh: dbproxy1003 went down apparently [15:00:18] robh: and it is now back [15:04:19] yep, no bueno [15:04:27] atleast it was a non-in-use server [15:04:37] but we double checked everything, that flap was after we had the new pdus in [15:04:42] and re-shuffling power [15:04:49] so, noted, and hopefully wont happen again [15:05:05] robh: fyi, there are still hosts in icinga with a message saying that PS are not redundant, is that expected? [15:05:35] some of them are clearing up now [15:05:55] marostegui: they all show green here [15:05:56] let me force an icinga check [15:07:05] robh: so the only outstanding things on icinga for now: elastic1031 with no PS redundancy and db1127 mgmt down [15:07:50] db1127 just cleared, elastic1031 remains [15:09:49] marostegui: elastic1031 is green on powersupplies but organge on error led [15:09:54] chris is checking [15:10:03] cool [15:10:04] thanks [15:10:47] ha [15:10:52] the psu error for that system is a psu fan [15:11:00] unrelated to our work, but is a psu failure in icinga [15:11:24] so maybe a ticket for our elastic sres? [15:11:29] chris is making one now [15:11:40] lovely thanks [15:15:49] ok, once chris is done making that task [15:15:58] we're ready to move onto a5 https://phabricator.wikimedia.org/T227141 [15:17:27] akosiaris: are you about and can you return your nodes in a3 to use [15:17:32] and drain nodes in https://phabricator.wikimedia.org/T227141 ? [15:18:00] robh: yeah, will do so now [15:18:35] thank you! [15:18:56] I won't be around for the next one though [15:19:03] probably that is [15:24:51] well, you have the directions on how to depool [15:24:58] just seemed easier to ping you when you were about =] [15:34:52] robh: done [15:46:19] ok, both new pdus mounted in the rack, old pdu still live but not mounted [15:46:25] we're going to kill side b of a5-eqiad now [15:56:47] ok side a done doing side b in a5-eqiad [16:07:54] a5 all done [16:08:18] correction, chris is cleaning up the cables but the pdu swap is done [16:19:11] akosiaris: ok you can return a5 items to service [16:19:14] we are all done there [16:19:45] a7 will be after lunch (in about an hour or so) [16:22:47] not sure how to migrate services back to a ganeti node [16:23:17] sudo gnt-node add it seems [16:23:27] wait, no thats if removed for a long period [16:23:36] we just migrated never removed, so not sure [16:23:54] Anyone other than akosiaris know how to migrate back to a node? [16:23:57] moritzm: ^? [16:26:00] robh: maybe you want 'gnt-instance migrate' ? [16:26:03] never done it though [16:27:05] thats to migrate away afaict [16:27:13] sudo gnt-node migrate -f ganeti1008 pushes ganeti1008 out of service [16:27:16] which is what alex did [16:27:20] but now i want ot bring it bakc [16:27:36] gnt-instance migrate is different from gnt-node migrate [16:28:00] oh, yes, sorry [16:28:01] you'd pick some of the instances -- the VM names -- that are supposed to be on ganeti1008 [16:28:05] oh, ok [16:28:14] i did not output what was there beforehand [16:28:16] so bleh. [16:28:27] * robh didnt migrate away and assumes alex has that list [16:28:32] so it can wait for him i suppose [16:28:45] a7 doesnt have a ganeti node to drain so we're just down 1 [16:29:16] bblack: heyas, you note that traffic has to be around for a7 [16:29:21] https://phabricator.wikimedia.org/T227143 [16:29:38] it is 12:30 here, can we do this in 90 minutes? [16:30:22] robh: yes [16:30:33] robh: this will tell you what could be running on ganeti1008 but presently isn't: sudo gnt-instance list -o name,snodes | grep ganeti1008 [16:30:43] anyway I would not worry too much about it [16:31:27] ok, we are going to go snag some lunch, back in less than an hour, then we'll start a7 [16:40:10] cdanis: makes sense. ack. thx [16:40:27] I actually think it's probably not worth doing any ad-hoc rebalancing [17:18:44] robh: just proceed, we can rebalance the Ganeti cluster when the PDU maintenance is over [17:18:57] it's a fairly long-running operation anyway [17:19:02] https://wikitech.wikimedia.org/wiki/Ganeti#Cluster_rebalancing [17:40:47] robh: ^ that [17:41:35] and cdanis is correct on that oneliner ofc. pro tip: list -o +snodes is ever better ;-) [17:41:47] ooh ty akosiaris [17:48:58] cool [17:49:03] ok, chris and i are back from lunch [17:49:13] we are goig to start work on a7-eqiad [17:49:27] bblack: let me know when we are ok to continue on a7 [17:49:31] since it specifically had lvs system [17:49:45] ref: https://phabricator.wikimedia.org/T227143 [17:52:34] also ping vgutierrez or ema i suppose as traffic folks =] [17:54:14] hopefully one of you is around or else we have to stop swapping pdus for today [17:54:21] non-ideal. [17:57:20] robh: Valentin is on holiday and I think that Ema is already out, probably Brandon is your best bet :) [17:57:39] yeah, i thought he would be around but it may be lunch time for him [17:58:05] yep [18:02:40] robh: I'm here now [18:02:57] cool, we are ready to work on a7 when you are =] [18:03:06] just prestaging other items while i was waiting [18:04:11] so if/when you can kill the lvs system that is live there, let us know! [18:04:20] we will shutdown the ms-be systems there after that and do the pdu work [18:07:11] robh: lvs/cp are ready to go (1013 traffic is failed over to 1016) [18:07:24] awesome, i'm killin gms-be systems now [18:07:26] is this a definite downtime or just a possible? [18:07:37] ah definite [18:07:39] ok [18:07:56] so should I soft-poweroff these cp/lvs? [18:09:10] uhhh [18:09:11] its possible [18:09:13] not definte. [18:09:16] definite even [18:09:24] we just shtudown ms-be because godog prefers it [18:09:29] oh ok [18:09:38] we'll just wait and see then [18:09:39] the ms-fe we depool and leave up [19:09:12] ok, returning a7 servers to service [19:09:54] ok to repool lvs/cp? [19:10:19] yep, we are done in there now [19:10:22] thank you [19:16:04] ok relocating out of dc floor to hotel a block away [19:16:10] back online shortly to schedule row b and stuff [19:20:26] robh: <_joe_> can we please suspend all maintenance for the day? [19:21:34] ^ [19:21:54] row B should hold for now, at least until we understand all the badness that correlated with the A7 timeframe [19:22:07] (which may or may not be related, but either way, the noise makes it extra confusing) [19:25:15] <_joe_> 10.64.0.83 can we confirm this is in A6? [19:25:41] so recapping some salient points that are hard to dig from IRC with all the noise in the past ~1.5h: [19:25:43] for the record A6 pdu wasn't touched today or the task isn't updated [19:25:46] _joe_: that is mc1022 which is indeed A6 [19:25:58] <_joe_> 10.64.0.80 - 84 are all giving timeouts to some appservers [19:26:19] re: A7 power work - work started ~18:00 with some depools and prep work, and the actual power cuts were ~18:33 and ~18:42 for the two separate legs [19:26:32] <_joe_> and that's when the problem started [19:26:46] the early depools were ms-fe and ms-be though, which seem kinda out there to be related [19:26:56] was there mw depools as well? [19:27:06] should we phone robh or Chris to make sure everything on site looks good? [19:27:18] <_joe_> cdanis: can you depool all appservers that are in A7 please? [19:27:29] the other early depool work was traffic: we depooled 1x node each in text-eqiad and upload-eqiad for A7, and also failed over LVS public text traffic from lvs1013 to lvs1016 (which seemed fine) [19:27:30] FWIW the overall rate of 50x does not look too bad [19:27:33] _joe_: rgr [19:27:54] on mw1270 - /var/log/php7.2-fpm/error.log - script '/srv/mediawiki/docroot/wikipedia.org/w/index.php' .. executing too slow .. child .. stopped for tracing [19:27:58] I checked bandwidth usage on mc1022 with ifstat, it seems normal [19:28:03] <_joe_> mutante: that's ok [19:28:26] * akosiaris checked for network errors on mc1022 and net util and looks fine [19:28:31] ah snap [19:28:31] Jul 23 19:28:11 mw1271 mcrouter[919]: I0723 19:28:11.842501 1051 ProxyDestination.cpp:453] 10.64.0.81:11211 marked hard TKO. Total hard TKOs: 1; soft TKOs: 0. Reply: mc_res_connect_error [19:28:38] _joe_: done [19:28:43] <_joe_> elukey: I am operating in the hypothesis there is some network problem in A7 at this point [19:28:44] about 5MB/s thatn a couple of hours ago, but otherwise ok [19:28:56] less than* [19:29:12] ok [19:29:15] but those 5MB/s less traffic on mc1022 might very well be related [19:29:28] XioNoX: ^ see above: 19:28 < _joe_> elukey: I am operating in the hypothesis there is some network problem in A7 at this point [19:29:35] drop seems to start at 18:51 [19:29:47] https://grafana.wikimedia.org/d/000000377/host-overview?refresh=5m&panelId=8&fullscreen&orgId=1&var-server=mc1022&var-datasource=eqiad%20prometheus%2Fops&var-cluster=memcached [19:29:54] yeah the big badness is mostly ~18:[45]x and onwards [19:29:55] fwiw, regarding the scap upgrade. i upgraded it all on codfw mw hosts and some other machines, but not mw eqiad, and they did not involve any restarts [19:29:59] but there are bad signs earlier in graphs [19:30:06] <_joe_> cdanis: can you please !log your actions? [19:30:14] _joe_ the hard tko is something that I haven't seen in mcrouter before (only soft), I am not sure if the shard have been depooled for good [19:30:21] by some mw appservers [19:30:28] i am not ar home but is there sth i can help with for logstash? [19:30:29] _joe_: I did [19:30:35] <_joe_> elukey: it's possible. that appserver is in A7 though [19:30:49] godog: pretty sure logstash is tertiary fallout, it's being overwhelmed by other things going wrong and logging about it [19:30:53] <_joe_> mc1021 has a puppet failure too [19:31:09] bblack: kk, thanks for the context [19:31:11] godog: logstash appears to be a victim of whatever is going on [19:31:11] <_joe_> next mitigation possible is we remove the mc servers in A6 from the pool [19:31:12] (mostly the logstash impact is coming from memcache errors being logged, I think) [19:31:45] all PHP7 rendering alerts are green again except: mw1312 and mwdebug1002 [19:31:54] <_joe_> I see no change by depooling those servers though [19:31:58] <_joe_> in the mcrouter metrics [19:32:05] <_joe_> maybe I need to wait the moving average [19:32:07] the earliest hard graph evidence that seems definitely-related, that I've seen is the ~18:03 massive ramp-up in MC requests [19:32:21] https://grafana.wikimedia.org/d/000000316/memcache?orgId=1&panelId=41&fullscreen [19:32:26] em [19:32:27] https://grafana.wikimedia.org/d/000000562/network-errors-by-cluster?panelId=2&fullscreen&orgId=1&from=1563909907302&to=1563910330003 [19:32:30] what is this ^ ? [19:32:38] <_joe_> this https://grafana.wikimedia.org/d/000000549/mcrouter?orgId=1&var-source=eqiad%20prometheus%2Fops&var-cluster=All&var-instance=All&var-memcached_server=All&panelId=33&fullscreen&from=now-15m&to=now gives a pretty compelling story [19:32:58] analytics? [19:32:59] yep [19:33:15] there are two an-worker hosts on that rack [19:33:18] (A7) [19:33:30] <_joe_> shdubsh: can you downtime logstash? [19:33:34] <_joe_> so that it stops paging us [19:33:39] ack [19:34:00] <_joe_> ok, anyone is looking at connectivity between appservers and those mc hosts somehow? [19:34:08] network traffic from an-worker10{81,82} does not look alarming [19:34:09] <_joe_> please say what you're doing when you're doing it [19:34:18] looking if those 2 hosts justify those TCP retransmits [19:34:20] <_joe_> cdanis: let's focus on the appservers/mc issue [19:34:21] but the retrans rate is interesting right [19:34:34] cp1077 and cp1078 are two others in the same rack to peek at, if it's a rack-wide network issue [19:34:37] the retrans rate made me suspect rack switch saturation [19:35:10] )the big dropoutt of traffic on those two cp hosts is because they were depooled) [19:35:10] <_joe_> do A6 and A7 share some network apparatus that could saturate? [19:35:11] I am chasing down that lane (the analytics TCP retransmits), will report back here, please everyone, don't get sidetracked [19:35:24] _joe_: all An racks are part of a shared switch stack [19:35:50] each An has a separate top-of-rack switch, but they're all interconnected into one larger "virtual switch", they can affect each other for sure [19:36:16] <_joe_> Jul 23 19:23:52 mw1280 puppet-agent[38807]: Could not retrieve catalog from remote server: request https://puppet:8140/puppet/v3/catalog/mw1280.eqiad.wmnet interrupted after 0.864 seconds [19:36:27] <_joe_> mw1280 having issues connecting to the puppetmasters too [19:36:46] <_joe_> elukey: I think we should declare that mc rack lost for now and deploy a mcrouter config change [19:37:01] <_joe_> that will ease the pressure on logstash, at the cost of reduced caching [19:37:24] XioNoX: what does "Poller Time" mean? https://librenms.wikimedia.org/graphs/type=device_poller_perf/device=160/from=1563824100/ [19:37:37] is the switch taking literally 3x as long to respond to SNMP scraping? [19:38:00] <_joe_> if XioNoX is off, let's call faidon I guess [19:38:09] he was here 15 minutes ago... [19:38:09] https://librenms.wikimedia.org/graphs/to=1563910500/device=160/type=device_bits/from=1563888900/legend=yes/ [19:38:14] this graph can't possibly be real, right? [19:39:51] <_joe_> we keep getting [19:39:53] <_joe_> Jul 23 19:39:33 mw1280 mcrouter[9835]: I0723 19:39:33.730286 9841 ProxyDestination.cpp:453] 10.64.0.82:11211 marked hard TKO. Total hard TKOs: 1; soft TKOs: 0. Reply: mc_res_connect_error [19:39:55] <_joe_> Jul 23 19:39:37 mw1280 mcrouter[9835]: I0723 19:39:37.800527 9841 ProxyDestination.cpp:453] 10.64.0.82:11211 unmarked TKO. Total hard TKOs: 0; soft TKOs: 0. Reply: mc_res_ok [19:39:59] rescheduled icinga service checks on the 2 remaining hosts showing rendering alerts. did not go away. but also no new ones showing up. mw1312, mw1270, mwdebug. should i try to restart hhvm on 1312 ? - Fatal error: entire web request took longer than 60 seconds [19:40:09] <_joe_> yes [19:40:12] <_joe_> please do [19:40:30] !log restarting hhvm on mw1312 [19:40:30] mutante: Not expecting to hear !log here [19:41:08] <_joe_> ok so [19:41:15] <_joe_> it's clear from logs from different servers [19:41:17] got your ping _joe_, I was checking metrics [19:41:21] <_joe_> that servers in any rack/row [19:41:39] <_joe_> are having the same issues connecting to the same servers [19:41:41] <_joe_> Jul 23 19:40:30 mw1340 mcrouter[48804]: I0723 19:40:30.489393 48810 ProxyDestination.cpp:453] 10.64.0.82:11211 marked soft TKO. Total hard TKOs: 0; soft TKOs: 1. Reply: mc_res_timeout [19:42:01] it is weird since telnet work from mw to mc [19:42:03] <_joe_> this server is in row c [19:42:09] RECOVERY - OSPF status on cr2-eqdfw is OK [19:42:19] <_joe_> elukey: but maybe you can't retrieve a big blob of data [19:42:22] <_joe_> before the timeout [19:43:06] this is mc1022 https://librenms.wikimedia.org/device/device=160/tab=port/port=14186/ [19:43:30] there are big holes [19:43:32] I suspect there are some monitoring artifacts in librenms because of network issues [19:43:44] should we contemplate a DC failover? [19:43:45] but there are big holes and there is a completely unreasonable spike of packets and traffic [19:43:48] for asw2-a-eqiad [19:44:01] how confident are we currently on codfw-switchover stuff? [19:44:12] <_joe_> bblack: not sure about the databases [19:44:22] <_joe_> but for everything else, we shall be ok-ish? [19:44:32] I can move the traffic edge stuff out of eqiad, it may help in some indirect sense in reducing general network load there [19:44:47] <_joe_> frankly I'd first cordon off the mc servers that are showing issues [19:44:48] (and getting rid of some reqs to active/active services in eqiad) [19:44:50] cdanis: yeah but I don't see it for mc1030 https://librenms.wikimedia.org/device/device=162/tab=port/port=14758/ [19:44:55] ok [19:45:04] elukey: it's on asw2-c-eqiad [19:45:10] so something happened with asw2-a I think [19:45:19] (or connectivity *from* librenms *to* asw2-a) [19:45:27] mw1280 - could run puppet just fine manually - we have been seeing some of those intermittent puppet issues before this issue started [19:45:37] mc1021 as well https://librenms.wikimedia.org/device/device=160/tab=port/port=14185/ [19:45:41] that is A5 [19:45:54] <_joe_> mutante: it depends if we have a network issue ongoing during the puppet run [19:46:01] <_joe_> network issues are intermittent [19:46:18] elukey: assuming that the letters in the access switch hostname also means row, I suspect you'll see the same for the port of any mc host in row A [19:46:29] _joe_: ack, just saying we have seen some of them before the main issue started [19:46:40] _joe_: TKOs are not as bad as they were 20 minutes ago, at least [19:46:48] <_joe_> elukey: what do you think of my proposal? [19:47:03] cdanis: yes (the letters) [19:47:26] <_joe_> timeouts haven't changed significantly [19:47:27] current appservers showing mcrouter issues, in descending order by current TKOs: mw1312 mw1327 mw1329 mw1264 mw1326 [19:47:29] for what is worth, the retransmits on the analytics cluster seems to be a byproduct of the entire cluster running at fully network capacity right now. Most an-workers (e.g. an-worker1078 up to an-worker1095 and from analytics1042 up to analytics1077 have network traffic upwards of 60MB/s and usually in the 120MB/s [19:47:34] there are more but those are the large ones [19:47:38] elukey: does this make any kind of sense? [19:47:43] _joe_ I agree with proceeding but if we establish first what is the impact to mediawiki/dbs/etc.. My point is that if we currently see timeouts but things can allow us for some debugging, better to proceed, otherwise let's remove the shards [19:47:46] why would these hosts move SO MUCH DATA around ? [19:48:04] akosiaris: distributed filesystem and lots of shuffling, that's the nature of mapreduce [19:48:14] akosiaris: I can check, there is probably a big job running, but usually it is not an issue [19:48:15] <_joe_> cdanis: can you extract the tkos by remote host IP too? [19:48:17] it seems to me that almost all are saturating their network interfaces [19:48:25] <_joe_> I'm pretty sure they're all the same 5 servers [19:48:42] <_joe_> elukey: I'd kill such a job in this moment maybe [19:48:50] actually, elukey scratch that. You are on the MC front [19:48:56] is otto around? [19:49:11] <_joe_> akosiaris: not in this room, go fetch him :) [19:49:26] not on any room currently it seems [19:49:29] _joe_: 10.64.0.174 10.64.32.48 10.64.32.66 10.64.0.59 10.64.32.47 [19:49:45] hey [19:49:48] <_joe_> cdanis: those are the memcached? [19:49:51] akosiaris: I can check now, it will be quick for me. You are saying that all hosts are saturing their ports? or the row a ones? [19:49:53] _joe_: these are appservers [19:49:54] <_joe_> uhm. not what I see [19:50:06] <_joe_> cdanis: no I mean the IPs of the memcached servers going tko [19:50:09] so [19:50:10] paravoid: Memcache request rates spiked up early on (~18:00), various things go worse over the 18:xx hour, there were lots of MW servers failing rendering checks from icinga, all in A7, around the worst of it coming on. [19:50:15] elukey: all I think [19:50:16] can anybody check what is the current overall impact? [19:50:18] I'm looking at asw2-a logs [19:50:25] various A7 hosts have obvious signs of network problems: TCP retrans graphs, puppet failing to connect, etc [19:50:26] I mean to mediawiki requests etc.. [19:50:35] <_joe_> elukey: it's small right now [19:50:37] lots of stuff happened (which I can go in detail in a bit), but ended at 19:27 [19:50:40] <_joe_> it was worse at some point [19:50:44] <_joe_> we have some added latency [19:50:47] i.e. there is nothing useful in the logs past 19:27 [19:50:54] elukey: look at https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&from=now-1h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=analytics&var-instance=All&panelId=84&fullscreen [19:50:54] ok [19:51:02] <_joe_> paravoid: but we still see this for instance [19:51:09] <_joe_> https://grafana.wikimedia.org/d/000000549/mcrouter?orgId=1&var-source=eqiad%20prometheus%2Fops&var-cluster=All&var-instance=All&var-memcached_server=All&panelId=33&fullscreen&from=now-30m&to=now [19:51:10] paravoid: ok, good to know, we're still seeing mcrouter issues fwiw [19:51:18] and the network row that gets expanded to all hosts [19:51:20] memcached general request rate is still elevated: [19:51:22] https://grafana.wikimedia.org/d/000000316/memcache?panelId=41&fullscreen&orgId=1 [19:51:30] <_joe_> the 5 servers in a specific rack have higher timeouts [19:51:40] A6 or A7? [19:51:41] which rack? [19:51:48] <_joe_> let me check [19:52:15] <_joe_> A6 [19:52:26] so a cross-switch (= cross-rack) switch was going up and down [19:52:29] <_joe_> mc1019 to mc1023 [19:52:36] FPC6 indeed [19:52:40] <_joe_> they're still showing problems [19:52:43] _joe_: I'll have a list for you of most-TKO'd memcacheds in a minute [19:52:49] <_joe_> cdanis: <3 [19:52:50] A6 task doesn't have any update about pdu work there [19:52:55] we did not work on A6 [19:53:01] dbprov1001 (A7) shows PSU critical alert since about 20 minutes [19:53:03] <_joe_> marostegui: it's a fallout from other work, clearly [19:53:13]