[09:22:15] hello there, I'm not sure who is online today that can review/approve this: [09:22:16] https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/203 [09:24:27] looking [09:25:49] arturo: something seems to have happened to the comments with `tofu plan`, they are very hard to read :/ [09:26:22] taavi: I know. The gitlab support for code blocks inside comments, and specifiicaly, inside collapsed notes, is very limitted [09:27:02] is there a reason why there's a rule for v4-traffic from the dualstack network but not for v6-traffic from it? [09:27:29] no, I think that's a good point, I will add it now [09:33:27] done [09:37:53] the tofu plan diff is so big because rules are reordered in the tofu state array, so instead of seeing +1 rule, it sees a massive array sorting operation [09:40:07] does that mean tofu will delete and re-create those rules? :( [09:40:40] we should maybe give those resources stable names so that does not happen in the future [09:42:01] yes, instead of an array, use a map [09:42:05] that's the way to avoid this [09:43:45] yes, rules will be re-created [09:43:56] also, the openstack API doesn't help either [09:44:14] as a simple description change forces a replacement [09:44:16] https://www.irccloud.com/pastebin/fShXOoJN/ [10:53:57] taavi: I need to merge that tofu-infra patch to test a few things [10:59:43] arturo: ack. do you plan to fix https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/203 to use a map or leave that for later? [11:00:30] taavi: We would need to relocate the state objects. There be dragons, we can do later. [11:01:24] yeah. i'm not super happy about re-creating some of those rules, but guess that it's fine [11:02:37] that diff is really annoying to parse though [11:02:42] anyway, approved [11:02:57] thanks [11:04:52] tofu apply just completed, no issues [11:10:49] anything left to do in T380728? [11:10:49] T380728: openstack: network problems when introducing new networks - https://phabricator.wikimedia.org/T380728 [11:12:02] no, I will close it [11:14:33] taavi: I would like to merge this one next: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/204 [11:15:03] I'm undecided if we need to schedule an operation window [11:16:18] maybe not, but i'd also not deploy that on a day when this many people are not around [11:20:49] fair [11:33:41] today is a global holiday in the WMF, so I guess we will introduce this change wednesday [12:03:53] topranks: does this looks good to you? https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/205 [12:14:28] i left a comment [12:25:10] thanks [12:31:38] I think today isn't a global holiday but tomorrow is? [12:31:53] But today is probably a holiday in most european countries so it might as well be :) [12:33:24] andrewbogott: if you are online today, then I will definitely reconsider the IPv6 thing :-) [12:34:12] I will be a bit distracted but am definitely working today. [12:39:25] arturo: hey, I'm not working today [12:39:41] that dns change looks correct though, for the two /64s in question [12:39:44] we need to merge this one: [12:39:45] https://gerrit.wikimedia.org/r/c/operations/dns/+/1113527 [12:40:16] which delegates the entire /56, but as I recall from last time designate will only set up zones for the subnets in use. should be fine I think. [12:42:59] We should probably wait until Cathal is also around before we roll things out. [13:29:27] ack [14:55:39] chuckonwu: I just made a small change to https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/2 in order to make the pipeline green, after that its all yours [14:58:48] 👍 arturo, I'm watching the changes [15:05:02] andrewbogott: I left comments on https://gitlab.wikimedia.org/repos/cloud/cloud-vps/go-cloudvps/-/merge_requests/2. note how that branch was rebased to pick up the new CI pipeline [15:05:41] not touching the tofu-cloudvps patch yet, that will need to be merged/tagged first [15:05:50] chuckonwu: I have now merged https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/2 and I will stop doing more changes. All yours now [15:07:50] Thanks arturo [15:07:51] * arturo notices the typo in the commit title too late [18:36:46] andrewbogott: it seems like both quarry workers have now filled their disks :( [18:39:21] let's see what happens when I tell magnum to provision more nodes [18:44:00] did rebooting really help before, or were we fooling ourselves? [18:45:14] it helped temporarily, but did not address the root cause [18:48:17] andrewbogott: unsurprisingly trying to resize the cluster has done absolutely nothing [18:48:35] that's interesting. No warning or anything? [18:48:41] nothing i can see [18:48:58] i don't really see a way forward except somehow getting shell access to the cluster, or just completely nuking it and creating a fresh one [18:50:06] I can get console access but that only helps if they have default account + pwd... [18:50:10] * andrewbogott checks the console just in case [18:51:10] yeah [18:51:10] quarry-127a-g4ndvpkr5sro-master-0 login: [18:52:58] actually magnum thinks it's resizing... [18:53:12] node count 4, update in progress [18:53:18] I can't tell if yinz have kubectl access but if you do you can launch a debug pod to get host access with chroot /host [18:55:18] I think we do -- taavi, do you? [18:57:21] taavi, the kube config is quarry-bastion.quarry.eqiad1.wikimedia.cloud:/home/rook/quarry/tofu/kube.config [18:57:24] yeah, let me try that [18:59:03] aha [18:59:06] found the issue [18:59:09] quarry is leaking tmp files [18:59:18] so that's why rebooting helped [18:59:26] for the reference: $ kubectl debug node/quarry-127a-g4ndvpkr5sro-node-0 -it --image debian:stable [19:00:16] worker-0 should have more free space now [19:00:19] * andrewbogott thinks there must be newer quarry docs than https://wikitech.wikimedia.org/wiki/Quarry/Quarry_maintenance_and_administration [19:01:11] hmm, apparently i can't launch a debug pod on worker-1 because its disk is too full [19:02:32] want me to reboot it? [19:02:46] Rook: are there admin docs anywhere? [19:02:51] sure, we can give it a try [19:03:44] Probably in the readme would be the most updated [19:04:33] 'k [19:05:36] taavi: did the reboot help? Looks like you already tried a few minutes ago [19:09:31] andrewbogott: no [19:10:32] I can't tell if it actually rebooted [19:11:45] looks like it did [19:16:54] time to redeploy, or do you still have ideas? [19:19:41] nope [19:21:28] do you think we can/should try to fix the leak before we deploy? [19:21:47] nah, we can do it later [19:25:15] * andrewbogott increases to 4 workers while we're at it [19:25:58] don't think that's really necessary [19:26:01] 3 maybe, but 4 seems overkill [19:26:51] we are not short on compute resources! [19:27:09] but 3 is fine with me, I just want us to have room to maneuver [19:30:20] can you tell why it can't push to quay? Expired token maybe? [19:31:22] what? where? [19:32:14] https://github.com/toolforge/quarry/pull/77 [19:33:17] no idea, and i don't seem to be a member of https://quay.io/organization/wikimedia-quarry [19:33:46] me neither I think [19:33:56] rook, can you add us? [19:34:35] I have no idea if I still have access to that. It isn't in some larger wiki group? [19:35:15] I thought it would be but seems not [19:36:23] Oh I do still have access. Let's see... [19:39:00] Ok andrewbogott taavi did yinz get an invite link or something? [19:39:15] yes [19:40:52] Excellent [19:42:26] and now I can push! [19:44:45] Rook: What happens if I deploy.sh? Will it delete and replace the existing deployment? Are we set up to do a proper blue/green in quarry or do you usually just delete/replace? [19:49:09] * andrewbogott is going to find out [19:50:22] * andrewbogott predicts that this will do nothing at all [20:09:54] indeed [20:10:26] so now I'm stuck on the question: Is this stateless enough that I can just delete the magnum cluster and start over? I'm pretty sure the answer is 'yes' but I don't like deciding that on my own [20:11:06] Yeah it will just deploy to as usual. Has the same blue green deploy as paws. You have to setup the new cluster first for a blue green [20:12:11] how do I tell it to deploy to a new cluster rather than update the existing one? [20:12:18] I believe you can do a usual blue green without much more than people needing to log back in. The state lives in NFS [20:13:26] Like paws. Duplicate the tf file that deploys the cluster update the name. Be sure to remove the kube config from the current one [20:14:41] 'k [20:27:00] hm, this is going very poorly so far [20:27:54] network name changes [20:36:05] now new cluster shows as create_in_progress which seems hopeful [21:14:17] taavi: I've deployed the new three-node cluster and pointed quarry.wmcloud.org at it. It seems... fine? If it stays fine for a day or two [21:14:30] I'll tear down the old one and get these (minor) changes merged. [21:14:54] bd808: I'm also interested in your opinion about the current state since you were first to notice last time Quarry broke [21:19:35] I noticed because I watch the Phabricator feed for cloud things. [21:22:09] It looks like stuff is happening at https://quarry.wmcloud.org/query/runs/all [21:32:15] Rook: predictably, all of your deployment code worked like magic once I caught up with the new network name. I'd really appreciate it if you read through my hurriedly-written docs about blue/green deployment in https://github.com/toolforge/quarry/pull/79 and comment if any of what I'm saying sounds wrong. No rush on that though! [21:33:36] I'm especially interested in if I have defied convention in my understanding of which is blue and which is green [21:34:36] thank you for the after-hours work taavi! [21:34:51] I'm going to go take a walk if things don't crash in the next 3-4 minutes