[08:54:04] Morning [08:54:40] hello! how's the jet lag? [08:56:00] Quite ok for me I'd say, even if I'm still waking up in the middle of the night, I'm able to fall asleep again :) [08:56:36] it's been worst when coming back than when going for some reason [08:58:28] I think that's common, but I don't know if there's a scientific explanation :D [08:59:00] how's everything going? I've been out for a bit [08:59:13] I still haven't sleep well since returning, but at this point I'm decided to just continue with normal routine [09:01:51] dcaro: all pretty quiet I would say, we fine-tuned the "kernel error" alerts that are now way less spammy [09:02:13] security reboots of all cloud servers, which took a while [09:02:36] there was an issue with VM migrations so cloudvirt took longer [09:03:01] toolsdb issues last week, seems stable now fingers crossed [09:04:02] was the ceph/vm rebooting issue resolved? [09:04:37] yes, I think a.ndrew rebooted all VMs in the end [09:04:46] there's a follow-up ticket for how to avoid that in the future [09:05:01] T385288 [09:05:01] T385288: Changing the IPs of cloudcephmons should not require VM reboots - https://phabricator.wikimedia.org/T385288 [09:57:48] dcaro: you have a ticket for ceph <-> tools k8s ? [09:59:24] T384596 [09:59:25] T384596: [toolforge,storage,infra,k8s] Investigate persistent volume support - https://phabricator.wikimedia.org/T384596 [10:00:37] thanks [10:17:11] arturo: there's a tofu-infra alert, I think that's because https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/92 was merged but not applied [10:19:49] dhinus: ok, easy fix then! [10:25:18] hmm, maybe we should apply and then merge like in toolforge-deploy? (hopefully in an automated way, so no merged patch is not applied) [10:27:14] I added a note here, with a link to a nice blog post with pros/cons https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/tofu-infra#Automated_workflow_via_cookbook [10:28:06] I slightly prefer "merge before apply" [10:28:12] but it's not a strong preference [10:29:40] related, there's a new decision request about tofu-infra that could help with this type of issues T385604 [10:29:41] T385604: Decision Request - How openstack projects relate to tofu-infra - https://phabricator.wikimedia.org/T385604 [10:30:34] "The main branch always acts as the single source of truth (and thus reflects the current state of your infrastructure)." [10:30:48] it says that's a benefit of apply after merge, it's the other way around though [10:31:08] (unless you apply right after merge, and revert if the apply fails) [10:32:41] I tried both, and I had times where the main branch was not in sync with reality in both cases :D [10:32:48] and this also seems not real "The main branch often falls behind the actual state of your cloud infrastructure which is against everything that GitOps has taught us." [10:32:58] for the apply before merge [10:34:13] hahah, this is actually a good thing disguised as bad "Developers start spending a lot of time rebasing branches to pull in the latest changes and avoid conflicts.", it would be really bad to do that at the end of the flow, instead of "pushing it left" xd [10:35:51] anyhow, yep, both have benefits/drawbacks definitely [10:43:01] I think it's mostly a communication problem, before a technical problem [10:43:56] a.ndrew merged the PR yesterday but he's not worked with tofu-infra much so my guess is that he forgot that the current process requires you to apply after you merge something [10:44:21] that's where automation helps imo [10:45:15] yep, I think we should definitely automate it sooner or later, in the meantime the alert is a "fallback" [10:47:07] if we should stop managing all projects in tofu-infra (see the decision request), we could experiment with automation in a single project, where the risk is lower [10:47:42] well, we could do that in any case, e.g. in toolsbeta [10:48:05] I would only add automation for critical things like cloudinfra after some testing [10:48:18] this is similar to what happens with puppet [10:48:55] instead of the alert, we could have an automatic run every 30 minutes or similar [10:49:55] I think we can do better than that, that part of the puppet implementation has been a source of pain since the beginning (at least for me), forcing many round-trips of 'code, review, merge, wait, fail, code, review, merge, ...' [10:50:02] just check the git logs xd [10:50:25] part of it is unavoidable I think, but I'm sure we can reduce friction a bit [10:50:41] gitlab CI could run immediately after merge, instead of waiting 30 minutes for example [10:50:45] that also means that many of the commits there are not really working nor reflecting the state of the infra [10:51:45] I think in puppet doing apply before merge would be even worse, because we have so many open patches, that it's hard to understand which one is being applied [10:52:00] https://www.irccloud.com/pastebin/CosefWAX/ [10:52:05] xd [10:53:06] dhinus: that's what the rebase before apply strategy forces, and the puppet-merge forces too [10:54:02] zuul was written for that too, to streamline that process [10:54:07] well anyway let's put puppet aside for a moment... let's try to see which issues we're facing with tofu-infra [10:54:28] I think the main issue is that we are not all on the same page with what the current process is [10:54:33] I don't think we are facing any issue :-) [10:54:44] well we have alerts :) [10:55:05] remove the alert [10:55:22] have a timer [10:56:12] my proposal: 1. make sure that everybody knows how the process works 2. remove the projects as you suggested in the decision request [10:57:57] I'm pretty sure I have communicated the process and how it works, multiple times [11:00:44] let's include a.ndrew in the conversation, maybe during the checkin later [11:04:43] sure [11:04:49] 👍 I don't think it's an urgent thing in any case though, just improvements on the current state [11:29:50] did anybody see this error before with lima-kilo? "Failed to unlock disk "cache". To use, run `limactl disk unlock cache`" [11:31:09] nope, do you have more than one instance running? [11:32:40] not that I know of :D [11:33:24] xd, `limactl list` should show [11:34:22] oh wait I think the error was while deleting the old instance [11:34:30] I was running start-devenv to recreate [11:34:43] now the old instance is gone, I can create a new one with no errors [11:34:57] I'll check what happens the next time I recreate [11:36:22] 👍 [11:36:24] * dcaro lunch [11:36:39] I have a doctor appointment too, so I might be a bit late to come back [12:01:21] I know the tofu-infra process, I just forgot to apply. Shall I do that now or did someone else do it already? [12:04:39] andrewbogott: did you use the cookbook? [12:05:02] hmm... I guess not, as it was deleting a project, not adding it right? [12:05:40] I didn't do anything but merge on gitlab [12:06:31] ack [12:10:23] I'm running the cookbook now [12:13:44] seems happy [13:30:01] arturo: any reason not to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/1114997 today? 1004 seems to be working well. [13:30:33] (I don't understand about the pcc error but it seems like a red herring) [13:38:23] FYI everyone, I just merged the patch removing .eqiad.wmflabs from resolv.conf. Please keep your eyes peeled for unexpected dns failures. [13:42:57] andrewbogott: mmm ideally we would failover the server, to make sure the new one in the cluster can handle the load [13:43:07] (and don't wait for an outage to test) [13:43:29] arturo: ok. That would be after the reimages though? [13:43:40] andrewbogott: yes [13:44:05] so I guess that means the answer is: yeah, lets merge and reimage! [13:44:09] ok. So how about I reimage today, and you schedule a failover for later in the week? [13:44:16] sounds good [13:44:23] dhinus: we still had a pending cloudnet reboot? [13:45:14] arturo: no, I did it last week [13:45:37] ok, great. I'm sorry for the delay :-( I promised and then never delivered [13:46:19] no problem, it was quick & easy, there was an automatic failover during the weekend so the one I rebooted was not the primary [13:46:34] I didn't check why there was an automatic failover though [13:47:01] ok [13:51:35] * dhinus is slightly annoyed by the [Resolved] emails to cloud-admin-feeds having a red header instead of green [13:51:50] the emails from wmcloud alertmanager have a green header, the ones from prod alert manager a red one [13:56:38] arturo: I have thought about this more and think we should do the reimage closer to the failover test so we aren't running on one node in the meantime [13:57:06] andrewbogott: fair, feel free to schedule an op window [13:57:26] Do you think we need to notify users about the failover test or should it be unnoticeable? [13:57:47] the failover is usually only experienced by IRC bots [13:58:04] arturo: thanks for preparing the onboarding phab task for chuck. I spotted a small error in the template, I fixed both the template and the task [13:58:06] there are definitely people who notice the problem [13:58:11] dhinus: thanks! [13:58:17] ok. Let's do it at this time tomorrow? [13:58:47] andrewbogott: works for me, well 23h from right now? [13:59:06] 23 or 24 as you prefer! [13:59:15] (works best for me than 24h from now, as 24 would be my usual lunch time) [13:59:54] sure [14:00:31] now... why can't I edit the wmcs team calendar [14:00:53] * arturo food time [14:01:21] and why /can/ I edit dcaro's calendar? [14:13:52] still fighting with google calendar but I think I sent a (personal) invite [14:46:44] andrewbogott: you might want to reboot the coredns workers in toolforge k8s to make sure they pick up the resolv.conf changes now (and not the next time we do a cluster-wide reboot) [14:47:21] sure. Anything I need to know about rebooting them or is it safe as long as I do one at a time? [14:48:41] i think just one at a time should be fine [14:48:50] sounds good [14:49:24] andrewbogott: hmm... gcal superpowers? [14:49:47] if so they're very selective [15:07:28] hmm... got an email from cloud-admin-feed about alerts on toolforge, but I don't see any on alerts.w.o or prometheus-alerts.wmcloud.org, suspicious [15:10:24] I just restarted a few nodes (as per discussion above) and now I'm waiting for big dns issues to appear... [15:11:04] stashbot, still here? [15:11:04] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [15:11:30] okok, I'm suspicious because the other day one of the prometheus in metricsinfra was not sending alerts due to network issues (showed as alerts not showing sometimes in alerts.w.o) [15:11:35] fwiw I meant re-creating the individual pods and not the entire VMs [15:12:05] taavi: yeah, I got that right after I did the reboots :) [15:12:09] hahaha [15:12:37] reboots kind of make sense for cache-purity reasons but in that case all I did was reschedule those pods on nodes that hadn't been rebooted [15:13:16] connectivity looks ok yep, less suspicious now [15:13:32] also I'm not sure if rebooting the node will actually cause the container to be re-created for the effect you want [15:14:38] taavi: ok, I maybe don't know what you mean by re-create then. Not just kill and let them reappear? [15:15:14] basically what I meant was `kubectl delete`-ing those pods [15:15:49] and that doesn't happen when a node is drained? [15:16:14] not automatically! [15:16:23] ok, will delete [17:27:06] hmm, while rebooting one of the tools k8s workers, got this message, first time I see it: [17:27:14] I0218 16:30:05.163654 36420 request.go:697] Waited for 3.209305283s due to client-side throttling, not priority and fairness, request: GET:https://tools-k8s-control-7:6443/api/v1/namespaces/tool-nada/pods/newall-2-28998261-bmt9h [18:21:05] * dcaro off [18:21:07] cya!