[15:05:32] volans (or whoever), I have a couple of servers that I'm going to rebuild today and I'd sort of like to rename them as well, but the rename is optional. What is the current state of the docs/script/netbox setup for renaming? Should I postpone the renaming until work there has settled down? [15:18:34] Riccardo is off today, but Cas might know when he's around later [15:19:58] Ah, it's Epiphany isn't it [15:20:29] not sure, one of those Catholic ones :-) [15:24:11] but if this is about labweb* maybe just reimage them with the lab* naming scheme, after all they will be folded into the main app server pool at some point anyway [15:38:41] herron: hi, I've been told that you can help me sorting out some magic happening on cloud openstack wmflabs.org domain, is that correct? [15:41:43] hey dcaro, questions re wmflabs are best directed to the team in the #wikimedia-cloud channel [15:43:40] herron: that's me :), I've been told you helped setting it up though [15:46:56] ah I thought it was an openstack question, what sort of magic are we talking about? [15:47:41] mail magic :-) [15:50:32] moritzm: that's a good point, although the wikitech move seems stalled [15:50:50] herron: kinda yes :), there was a dns record mx-out01.wmflabs.org, that stopped resolving internally and externally, though I still see it on openstack as active [15:52:05] I can not see it though on the pdns mysql database (in the records table) [15:52:28] the ptr is still there [15:52:43] (see T271322) [15:52:44] T271322: [mx] check what happened to mx-out01.wmflabs.org - https://phabricator.wikimedia.org/T271322 [15:53:22] hmm I am not sure about the inner workings of dns in that env, but I think some changes were happening yesterday regarding the acme cheif in wmflabs and bits to support that, arturo andrewbogott does that symptom seem possibly related? [15:53:43] *DNS bits to support that [15:54:12] it's related but I don't think I deleted any entries. [15:54:23] so maybe this is going to be an artu.ro question instead [15:56:03] herron: I have a theory, will investigate and poke you later if need be [15:56:16] andrewbogott: kk sounds good [16:59:15] epiphany indeed, and it ain't just for catholics, the greeks have it off too :-P [17:36:39] boy, mcrouter really doesn't work at all if one of its servers is offline :( [17:38:57] andrewbogott: that does not match our expectations :) if you're seeing problems, can you say more? [17:39:09] effie: fyi ^ [17:39:22] yeah, I'm making a task, one moment... [17:40:21] some work is in progress on https://phabricator.wikimedia.org/T213089 so if you're seeing disruption it's probably related, but that was supposed to be relatively unimpactful, just some keys being rerouted (and consequently being flushed and then re-cached) [17:41:11] rzl: https://phabricator.wikimedia.org/T271349 [17:41:27] ohhhh [17:41:32] wow okay you're talking about a totally other thing :) cool [17:41:40] yeah [17:41:58] I imagined that I had a redundant cluster running there but instead I guess it's a two-engine jet :( [17:42:09] lmk if you see an obvious mistake in my mcrouter config there [17:42:36] yeah, it's likely that AllSyncRoute is not what you want [17:42:57] AllSyncRoute [17:42:59] Immediately sends the same request to all child route handles. Collects all replies and responds with the "worst" reply (i.e., the error reply, if any). [17:43:05] oh? I need a token created on one host to be visible on the other host... [17:43:17] AllSyncRoute returns an error if one of the backing requests was unsuccessful, even if the other succeeded, so your request might have worked fine on the host that was up [17:43:40] oh! Ok, that's clearly the worst case scenario :) What do I want instead? [17:43:54] (I remember I changed it to AllSyncRoute after encountering split-brain issues earlier) [17:43:57] it's not necessarily the worst case! it just means that error doesn't mean the request failed completely [17:44:09] so your options are to change to a different route handle or to treat the error differently [17:44:36] think of that error as being extra-cautious, telling you that something went wrong even though something else might have gone right [17:45:23] https://github.com/facebook/mcrouter/wiki/List-of-Route-Handles is worth a read, there's a lot of choices but whichever one best matches your use case will likely stand out [17:46:26] note though that if you're concerned about split-brain issues, you'll have other problems [17:46:58] no matter what route handle you use, if you take one host down, send some writes, and then bring it back up, the two hosts will be in different states [17:47:13] The split-brain was when both hosts were up [17:47:30] basically if I logged in and reloaded I had a 50/50 chance of still being logged in depending on which host lvs chose [17:47:39] sure, I'm just saying if you really care about consistency across two different memcache hosts, you'll still need to address that [17:48:01] and also, if you really care about consistency across two different memcache hosts, you are doomed :) memcache is not good at that [17:49:18] looks like AllFastestRoute is closer to what I want. AllMajorityRoute won't work with only two backends... [17:49:31] (for writing at least) [17:50:41] sure -- that just means that if writes fail on one host, you won't find out about it [17:50:55] (and later reads from that host will come up empty, even if the other host has an entry for that key) [17:51:11] yeah. Which is OK since 'get' is using AllLatestRoute which includes a failover [17:51:20] at least as I understand this [17:52:51] note a cache miss is not an error -- I'm not sure but I doubt AllLatestRoute goes to the failover on a cache miss [17:53:02] oh :( [17:53:32] So maybe I want MissFailoverRoute [17:56:13] sure, as long as you don't care about having the opposite problem with deletes :) [17:56:21] e.g. if I log out, but the request fails on one host, I'll still be logged in [17:56:54] that's a lot less likely, but also maybe a lot more severe [17:57:47] hmmmm [17:58:46] That would require the delete request to fail but for the cache to still be there, later. Which presumes that the system was up but writes failed... [17:58:55] I guess you said, "a lot less likely" [17:59:14] a temporary network interruption is one of the fairly plausible ways you could get that [17:59:27] yeah [18:00:09] the changes you suggested are definitely improvements on what you've got, but I think you're trying to make memcache do something it's not good at [18:00:36] there is probably no way to mcrouter-config your way out of this unfortunately :) [18:01:32] yeah, I'm just looking for a config that will make things 'better' — clearly 'perfect' isn't going to happen without a redesign [18:01:47] sure