[07:04:51] greetings [08:58:20] morning! [09:38:11] morning [11:45:53] * dcaro lunch [15:31:09] I'm going to schedule the eqiad1 openstack upgrade for (my) Monday afternoon. Concerns, anyone? [15:32:15] do you need any help/support/extra eyes? [15:33:43] I shouldn't unless something very surprising happens [15:34:46] I picked the late-ish time to avoid meeting overlaps. [15:35:24] ack, I have no issues with it (and no blockers I know of) [15:39:54] SGTM, I'll likely push the switch reboot test to wed [15:40:23] godog: if you were going to do that on Monday I can delay. I'm entirely flexible! [15:40:50] andrewbogott: no all good, I was planning on Tues, though Wed works too I'm in no rush [15:40:56] 'k [15:41:37] andrewbogott: speaking of which when you have time I have a question for you re: cloudvirt rack drain in https://phabricator.wikimedia.org/T417393#11696203 [15:44:33] godog: I responded on task -- that cookbook is pretty much what you need. [15:44:56] sweet thank you [15:46:09] the only hangup with that cookbook is that live migration relies on RAM syncing across two hosts. A VM with a ton of ram or very very busy memory access sometimes never syncs and fails to migrate. There's no general solution although I've been chipping away to make it less likely. [15:47:32] makes sense, in such cases what's the effect on the cookbook when a live migration times out I guess ? [15:49:25] The cookbook moves every VM it can, and retries failed VMs, and then errors out if the retry fails. [15:51:02] ok thank you [15:52:37] sometimes rerunning the cookbook helps (though not always) [15:52:52] quick review (unrelated) https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/314 , adding `restore-all` to `toolforge_deploy` in lima-kilo [15:53:18] oh, I think there's already a script xd [16:05:45] topranks, taavi, I'm working on refreshing cloudgw2002-dev and it has so many IP addresses! Are these all things I can manually allocate and assign in netbox for the new host? 172.20.5.18, 2a02:ec80:a100:205::18, 185.15.57.11, 2a02:ec80:a100:fe04::2002:1, 208.80.153.189, 2a02:ec80:a100:fe03::2002:1 (see https://netbox.wikimedia.org/search/?q=cloudgw2002-dev ) [16:05:59] ...and is there any router magic needed to get traffic to the new IPs? [16:08:35] for the latter: https://netbox.wikimedia.org/extras/changelog/264567/ which I'm just now deploying [16:10:55] basically the vlan2107/vlan2120.cloudgw2002-dev.codfw1dev.wikimediacloud.org addresses will need to be moved from 2002-dev to 2004-dev [16:13:23] oh, so you're thinking move the addresses rather than add a third working node and then remove the oldest [16:13:41] that's certainly easier on me [16:13:57] those v4 subnets are /29s so you couldn't even fit a third node in there [16:16:53] ok. Do you care when I start breaking things [16:16:55] ? [16:17:40] not in this case [16:24:53] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1250638 [16:26:09] seems reasonable [16:26:40] thx [16:27:00] want me to swap the dns names in netbox? [16:28:19] I need the practice, let's see if I can. [16:31:26] hm, 2004-dev is missing a v6 address, I wonder if that was a dcops misstep [16:32:56] oh no it isn't [16:34:05] taavi: ok, after making all those netbox dns changes I need to run sre.dns.netbox cookbook, correct? [16:34:23] indeed [16:35:18] also you'll want to move the vlan interfaces from https://netbox.wikimedia.org/dcim/devices/3026/interfaces/ to 2004-dev [16:38:33] is that possible? hostname is greyed out on the edit interface, do I need to delete and recreate? [16:39:01] hrm, probably :/ [17:00:05] hmph [17:05:28] need any help? [17:07:28] no, I think things are working right now. I just had 2002-dev in an in-between state for a minute [17:07:41] let's try a failover to the new node... [17:08:20] nope, no dice. things on 2004-dev look right but traffic doesn't work [17:08:32] so yes, now I need help :) [17:09:06] cloudgw2002 still has those addresses assigned which probably explains at elast some weirdness [17:09:26] I'm just going to remove those VLANs from the cloudgw2002-dev switch port to stop that [17:09:31] ok [17:09:43] 208.80.153.190 and 185.15.57.9 you mean? [17:09:55] vlan2107@eno1 UP 185.15.57.11/29 2a02:ec80:a100:fe04::2002:1/64 fe80::2eea:7fff:fe7b:e104/64 [17:09:55] vlan2120@eno1 UP 208.80.153.189/29 2a02:ec80:a100:fe03::2002:1/64 fe80::2eea:7fff:fe7b:e104/64 [17:10:28] oh yeah, that would do it [17:10:38] {{done}} [17:10:41] any better now? [17:10:53] not so far [17:11:13] going to flip to 2003-dev and back [17:12:13] hm, now those addresses aren't attached to 2003-dev or 2004-dev. I thought /that/ part was working [17:12:41] I see them on 2004-dev [17:12:44] oh [17:12:55] did you reboot 2004-dev after the puppet run that applied the role to it? [17:13:25] I didn't [17:13:40] shall I? [17:13:49] yes [17:14:10] profile::wmcs::cloudgw does various things with interface::post_up_command which needs that [17:14:15] ok [17:14:53] meanwhile, traffic is now working properly on cloudgw2003-dev so that's something [17:15:50] not having duplicate IPs in the network will indeed do that :P [17:17:15] ok, going to try the failover again [17:17:30] works [17:17:40] neat [17:17:52] yep! [17:18:00] I'll give this a bit and then decom 2002-dev. Thanks! [17:22:25] * andrewbogott -> lunch [18:28:00] * dcaro off [18:28:02] cya tomorrow!