[04:50:08] we will disable es5 writes in 10 minutes to do a switchover on its primary master, if all goes well, it should be transparent for everyone [05:37:48] this was done smoothly [05:41:43] marostegui: congrats :) [07:03:02] <_joe_> how manual is a master switchover right now? [07:03:25] it's quite manuel [07:03:46] ha [07:05:10] I think the main thing left is dbctly integration [07:05:16] *dbctl [07:05:25] I started working on it but I got distracted [07:05:54] however in this case, there is not dbctl support for es operations, as they work differently [07:06:09] (es* services can depool a master, unlike metadata ones) [13:36:56] yeah we could make dbctl be smarter about es instances [13:37:37] not sure, it is diminishing returns at some point [13:38:17] for example, es configuration is not as time critical as metadata, as the service doesn't go fully read only [13:38:39] however, automation is the part that would benefit from controling more on etc than static files [13:38:44] *etcd [13:39:30] e.g. to be able to do automatic failover at some point [13:39:35] hello folks, I am planning to rollout https://gerrit.wikimedia.org/r/c/operations/puppet/+/607026, that will require roll restarting mcrouter [13:39:52] if this is not the right time please speak up :) [13:42:17] how long do you thinkg it will take? [13:43:11] if it is a long time, do you mind if I merge another mw-infra patch first, so they don't colide? [13:43:34] nono please go ahead, it will take a while since it needs depool/restart/pool, no hurry :) [13:43:56] ok, mine should not interfere, but it is mw, so that way we have clear logs [13:44:46] give me 5 minutes tops [13:45:48] thank you, that why I unblock releng [13:45:52] *way [13:46:38] yes yes even 30 mins, I'll check later :) [13:46:48] I will ping you when done [14:16:02] elukey: sorry it is taking more than I expected, but it is finishing now [14:16:28] thanks! np, I had some analytics-related work to do [14:21:13] wait, elukey [14:21:16] please don't deploy yet [14:21:27] it is about to finish, but it hasn't yet [14:21:45] I need puppet to run on deployment hosts [14:22:35] ah I misunderstood sorry, puppet is disabled on a lot of nodes, also deployment ones, feel free to reenable and run in there [14:22:38] no problem [14:22:48] I thought it was done, apologies [14:22:59] ok, I am going to reenable it on deploy2001 [14:23:06] also I noticed Info: /Stage[main]/Mcrouter/File[/etc/default/mcrouter]: Scheduling refresh of Service[mcrouter] [14:23:10] I can disable it again with the same message [14:23:15] if you want [14:23:16] nono no need [14:23:16] ok? [14:23:37] ok, running puppet on deploy2001 [14:25:14] I just need to make sure things are clean for you to continue [14:25:34] sorry for the delay [14:28:57] cdanis,rzl time for a chat? [14:29:39] the mcrouter's systemd::service settings list restart => true, of course I only realized it now [14:30:16] for me that should be false [14:30:19] thoughts? [14:30:28] you don't want any auto-restart on a crash? [14:31:15] no wait IIRC restart => true was for file changes [14:31:32] yeah, it's for file changes [14:31:36] ohh [14:31:41] ofc [14:31:51] and not something we want for mcrouter I think [14:31:57] okay, yeah, I agree [14:32:07] too much potentially that a subtle behind-the-scenes changes does harm [14:33:54] change in https://gerrit.wikimedia.org/r/610071 [14:34:27] hmm that makes sense [14:34:46] of course bad Luca should have checked in before doing maintenance [14:34:49] so the result is that when we intend to change the config, we'll have to explicitly restart mcrouter after running puppet, right? [14:34:54] *it [14:35:05] rzl: exactly [14:35:16] like Info: /Stage[main]/Mcrouter/File[/etc/default/mcrouter]: Scheduling refresh of Service[mcrouter] [14:35:37] that is not a graceful reload like the json config change afaik [14:35:40] but a brutal restart [14:35:47] nod [14:35:57] so we'll want a rolling depool-restart-repool kind of maneuver? [14:36:42] exactly, I was trying to do it via 'restart-mcrouter' when I realized the auto-restart when testing on one node [14:37:14] okay cool [14:37:32] +1ed, thanks for spotting it! [14:37:54] thanks! merging + continuing the maintenance [14:38:56] some codfw mgmt issues? [14:47:00] I have to enable puppet on deploy1001, too, elukey [14:47:04] ok? [14:47:15] +1 [14:49:42] ok so I keep seeing mcrouter restarted :D [14:50:18] notify => Service['mcrouter'], [14:50:23] * elukey cries in a corner [14:58:40] elukey: sorry, we finally finished [14:59:00] puppet is enabled on deploy* hosts [14:59:07] let me know if you want me to disable it back again [14:59:18] I am re-enabling it now! all good [15:23:47] sorry again for the interruption [15:36:16] so there are some tkos registered https://grafana.wikimedia.org/d/000000549/mcrouter?panelId=9&fullscreen&orgId=1&from=now-1h&to=now [15:36:29] that is weird [15:37:21] ah no it is of course legit, the mw2 proxies [15:38:46] so I made mistakes: [15:39:09] 1) I should have kept mw2* proxies aside [15:39:24] 2) I didn't think about the effect of 1 minute proble time on them [15:40:01] the mw2* proxies are still black holes, so moving from probe time 3s -> 1minute increases the impact when they are down [15:40:45] rzl: need your opinion, sorry for the extra pings today [15:42:01] this might require a rollback until we have a good solution for mw2* mcrouter proxies [15:42:31] or we just accept this new tradeoff [15:46:43] (I haven't rolled back yet since I think I already hit all the codfw proxies so the "damage" is already done) [15:47:00] hmm I don't think I have full context -- why are they black holes? [15:47:09] ah yes sorry [15:47:20] also feel free to make a decision and fill me in later, I trust your judgment [15:47:58] so those mw2* mcrouters are effectively like memcached nodes, so subject to the TKO evaluation process like any other mc10xx shard, BUT we don't have a gutter pool set for them (yet?) [15:48:38] ohh okay, I didn't realize that [15:48:50] there's no gutter pool in codfw at all? [15:49:09] for the codfw proxies there is none [15:49:19] since they are a separate pool [15:49:21] oh just the proxies [15:49:25] okay yeah that makes sense [15:49:49] my idea was to add other 4 mw2* nodes as gutter pool [15:49:49] <_joe_> but the proxies themselves do reach a gutter pool [15:50:19] <_joe_> so no reason for that, if they still go tko, it means we need a higher timeout maybe? [15:51:01] a mcrouter in eqiad, calling a mcrouter proxy in codfw, would not reach any gutter pool IIUC [15:51:37] they basically default to traffic blackhole [15:51:54] (when a tko happens) [15:52:42] I was saying about 4 new mw2* nodes since they would act as 1:1 replacement of the current codfw proxies, when any one of them is down [15:52:58] (we could also have only two mw2* as proxy-gutter) [15:53:12] (If I am missing something please tell me) [15:58:43] in theory there shouldn't be any inconsistency in doing so, since if one of the mw2* is marked as TKO (say host down for maintenance) then the mcrouters in eqiad would divert to another m2* in the "codfw-proxy-gutter" and just follow consistent hashing to reach mc2* shards [16:08:23] elukey: I think I agree with you about the end state, especially because we want a basically symmetrical arrangement so that we can do a DC switchover and not end up with a different layout [16:08:40] but for the immediate fix I don't think I have any strong opinions [16:14:04] rzl: so completed the rollout, I'd keep it for the moment, we can discuss the proxy-gutter this week if you have time [16:14:37] sure, effie too