[05:33:47] as FYI, I am going to roll restart nutcracker on all the mw2* nodes (very slowly) to pick up the new config changes (no more memcached config) [06:18:30] elukey: :D [06:44:08] <_joe_> why slowly [06:44:13] <_joe_> go on and jfdi [06:44:22] <_joe_> it's codfw who cares [06:51:24] elukey: no need, these will be rebooted today or tomorrow anyway [07:01:47] moritzm: already done, forgot about it! [07:03:47] ok :-) [07:11:30] moritzm: now that you have mentioned it, I'd like to apply the same change to nutcracker in eqiad, but it needs a restart of the daemon so redis sessions might be affected.. my original thought was to depool/restart/pool, but maybe we can couple it with reboots? [07:12:23] yeah, let's do that [07:13:42] I have https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/510153/ ready to merge to test the change on 3 mw1* servers [07:13:55] I'll merge/deploy later on just to make sure that nothing explodes [07:14:03] sounds good, that way we ensure that we don't run into any surprises when the reboots happen [07:14:07] yep! [07:15:02] moritzm: the -R 200 option we have been adding to memcached has been working alright so [07:15:24] ack [07:15:30] I will push a change at sometime to prevent memcached from being restarted when we make changes [07:16:01] so the change will be there waiting for the next srv restart/reboot [07:17:34] sounds good! [07:38:36] jijiki: it might be tricky since the -R change goes into the systemd unit definition, maybe we could rollout the change before every reboot? (since we'll have to wait for the shard to be re-populated anyway before rebooting another one..) [07:39:09] moritzm: there is also anoter caveat - if we reboot a mc host, then mcrouter will fail requests up to when it comes back up [07:39:24] so say for some reason we get a hw failure upon reboot [07:39:46] then we'll need to re-shard everything.. (so remove the shard from the mcrouter config0 [07:40:39] we have a plan (still not sure when it will be executed) to add the "gutter" pool to the memcached shards, basically 3/4 hosts that will act as "backup" cluster [07:41:03] do we have time to do this now though? [07:41:03] rebooting with that should be way better [07:41:06] nope [07:41:09] haha [07:41:12] I mean, we still need hw [07:41:29] it's not that bad, we're run into a broken mc* hosts upon reboots before [07:41:43] my take was that since we are rebooting them, they could come back up with -R [07:42:00] moritzm: even with mcrouter? [07:42:11] (asking because of curiosity) [07:42:27] if we change the unit file without restarting the service [07:42:34] it should be ok [07:42:48] good point! that might have been pre-mcrouter actually [07:45:00] jijiki: not sure if I am wrong or not but systemd might play the "restarter" role in here, since one of its unit changes [07:46:00] iirc systemd will not restart the service [07:52:24] if so all good [07:52:32] but we need to be super sure [07:54:32] well we could disable puppet, push the change and try one [07:54:42] the usuall way :p [08:03:02] if in codfw, yes :) [08:15:41] elukey: so looking at the code a bit, the reload is triggered by base::service_unit [08:15:53] we have Boolean $refresh = true, [08:17:10] so in theory if we set this ti false, it will not notify the service [08:17:21] anyway, we'll figure it out [08:47:49] <_joe_> what is still using base::service_unit? [08:48:03] <_joe_> first step is to convert that to systemd::service [08:48:29] <_joe_> systemd::service has refresh = false by default, too [08:48:30] ok we can do that as well then [08:48:39] good, I will do it [08:49:36] <_joe_> moritzm: we're done with trustys, right? [08:50:08] there's still labstore1003 [08:50:40] Brooke is in the process of migrating it to the cloudstore* hosts currently [08:51:44] that should be really really close to completition [08:58:02] <_joe_> but I expect that not to use memcached [08:58:18] <_joe_> what's the situation of toollabs? any idea? [08:58:52] all resolved [08:58:56] same for cloud vps [09:01:45] hey, first time using conftool here, would like to validate the CMDs beforehand with any of you [09:02:02] I want to depool labweb1001 in a couple of hours due to an operation in the rack [09:02:12] these are the commands I generated: [09:02:15] https://www.irccloud.com/pastebin/90LopRvY/ [09:02:29] * jijiki bbiab [09:14:01] arturo: looks good, you can drop the "--service labweb", then confctl will depool all services (and there's only labweb on this server) [09:14:10] or simply run "depool" on labweb1001 itself [09:14:41] ok [09:14:44] thanks moritzm ! [11:37:14] are the diamond::remove hiera keys still needed? [11:37:26] or they are just leftovers? [11:40:04] jynus: there is one last thing in labs [11:40:10] so we still need it [11:40:16] I see, thanks [11:40:56] however, some extra roles may have been added since, may need review [11:41:55] iirc we are waiting for wmcs to port some metrics to prometheus [11:42:27] oh, I don't have a problem with keeping those [11:42:43] just that the current intended state and the actual one may be different [11:42:51] hmm [11:43:05] e.g. maybe we are starting to install it on production on some hosts [11:43:12] with new roles [11:47:30] godog: can I ack alerts for services for restbase1020-1027 ? [12:14:05] jynus: I'll take care of that! no worries [12:14:24] silenced the hosts before puppet ran with the final role -> those checks were not silenced [12:14:25] I can also do it, it anoys me specifically [12:15:18] but not touching them without your ok [12:16:24] hehe no worries -- done [12:16:29] thanks for the heads up [12:17:06] no, thanks you for doing it! [12:17:18] jynus: these were used for roles to opt out of Diamond once all Grafana dashboard/exporters related to the role were fixed, stricly speaking these can be added to new roles, but as soon as https://phabricator.wikimedia.org/T210850 is resolved, Diamond will be removed from all production hosts [12:17:27] so don't bother, not really worth it [12:18:00] moritzm: my concern is if it will be installed on new hosts with new roles where that is not addded? [12:18:17] or may be you mean yes, but not a huge issue? [12:19:54] it's not a huge issue to have Diamond running on a new role, as it will be removed in a few weeks once T219850 is resolved, just wastes a little diskspace/CPU cycles [12:19:55] T219850: contint1001: DISK WARNING - free space: /srv 88397 MB (10% inode=94%): - https://phabricator.wikimedia.org/T219850 [13:36:42] ottomata: I'm looking at T219544, on what host did you run distcp from ? [13:36:43] T219544: Make hadoop cluster able to push to swift - https://phabricator.wikimedia.org/T219544 [13:41:45] I'll guess deployment-hadoop-test-1 [13:54:42] akosiaris: i plan to start rebooting ores any issues i should be aware of? [14:23:26] so just to discard things, moritz restart was on codfw, which shouldn't affect [14:23:48] and jbond42 shouldn't affect either, and I think he started after the event [14:23:55] 's ores [14:24:30] i have not done the ores one was waiting for akosiaris. i have rebooted aqs1004 and was abouit to start on 1005 [14:24:49] yeah, so unrelated [14:24:55] yep [14:27:06] let's log the reboots in the SAL please [14:27:12] even at high level of clusters [14:27:13] ps1-b5-eqiad shows down 34 minutes ago, which could cause some issue? [14:27:17] so we know what's being worked on [14:27:20] (not in theory) [14:27:28] but it is the only thing I can see correlating [14:27:41] jynus: are you talking about the db conns increase? [14:27:45] yes [14:27:56] I am trying to find some explanation [14:27:58] is the timeframe 13:44-54 more or less? [14:28:05] oh i thought sre.hosts.downtime looged something i will do [14:28:06] #ops was really noisy [14:28:12] jbond42: <3 [14:28:42] jynus: if so, one interesting thing happened is https://grafana.wikimedia.org/d/000000549/mcrouter?orgId=1&from=1558013808052&to=1558014892928 [14:28:46] yeah, it logs [14:28:54] elukey: I'd say 52-55 [14:29:12] ah ok [14:29:34] but could it be a fallout? granularity of prometheus is 1 minute [14:29:37] so we had one or more codfw mcrouter proxies (mw2* hosts basically) down due to reboots [14:29:44] not mw2* [14:29:48] this was eqiad [14:29:56] yes lemme explain :) [14:30:00] ah [14:30:02] mcrouter [14:30:07] exactly [14:30:11] the errors were in eqiad [14:30:16] for mediawiki [14:30:21] maybe codfw -> eqiad mc -> db dependency [14:30:31] a loss of mc may mean lots of db connections [14:30:36] yeah [14:31:03] althought logically it shoun't, it should only lead to stale values [14:32:36] some keys for set/delete are replicated to codfw, atm in a sync fashion.. so I think that a codfw proxy down (or more) may lead to mcrouters in eqiad being upset, returning errors to mediawiki, that in turn hammers the db [14:32:59] for example, lots of hammeting to check replication status [14:33:21] which leads in turn to query killer to start killing connections [14:33:47] maybe this should lead to investigation to mitigate this kind of fallout [14:34:07] things can allways fail, and it shouldn't escalate [14:34:36] in this case it "only" added 2 seconds of latency per connection [14:35:01] but maybe we could enourage app behaviour to not overload if mc is unavailable [14:35:16] (ether to fail harder or more gracefully) [14:35:31] any thoughts? [14:36:55] elukey: this wasn't an issue for all previous mw* reboots, which change should that have been made failing? [14:37:34] yeah, the weakest link is how come the cross-dc affection (the other is more understandable) [14:37:56] is it because of the sync propagation and no asyc? [14:37:56] moritzm: I am only speculating now, so not really sure.. maybe this time was more than one proxy down at the same time? [14:38:17] which hosts are the mcrouter proxies in codfw? [14:38:20] lemme check which ones we have [14:39:05] mw2235 mw2255 mw2163 mw2271 [14:39:13] from mcrouter.yaml [14:40:36] only mw2163 was rebooted so fa [14:41:31] are we sure already https://grafana.wikimedia.org/d/000000549/mcrouter?orgId=1&from=1558013808052&to=1558014892928 correlates with the reboot? [14:42:10] as I assume that is still in progress and no other big spike [14:44:06] mw2163 came up at 13:45:46 UTC, I don't have the exact time when it went down, but with that hardware usually 2-3 minutes [14:44:51] so yeah, the spike in TKOs certainly correlates, but the question is whether/how it explains the DB issues [14:45:35] so looking at db, I am not sure it is the same [14:45:49] several got stalled (low running connections) [14:46:03] but I don't see a spike of new connections [14:46:29] we can simply try to repro with the reboot of mw2235, mw2255 or mw2271, but not today with the PDU swaps overlapping [14:46:37] +1 [14:46:42] to both suggestions [14:48:04] let's do that tomorrow? [14:59:18] sorry I was in a meeting [14:59:24] +1 from my side [14:59:41] I hope it is only a coincidence, but it was the only thing that lined up [15:01:14] jynus: Woops we've got a double merge in puppet-merge [15:03:29] deploy [15:03:42] I was about to click yes [15:03:56] ahah okay mine's trivial if you have it going so it's a yes [15:04:35] merging [15:04:36] it is ongoing now [15:04:40] ah okay [15:05:01] cool thanks :) [15:05:23] we should follow ethernet way or solving conflicts [15:05:30] wait a random time, then ask again [15:05:52] yes [15:34:40] jbond42: as long as ores[12]* hosts happen in small groups (1-3) hosts it should be fine. same goes for orespoolcounter[12]*. oresrdb is a different thing however. That one will cause an outage without a proper process [15:35:41] akosiaris: cheers ill leave oresrdb for now then