[08:30:19] arturo: just in case, a bunch of codfw VMs started failing to reach the cloudinfra-internal-puppetserver-01 node (connection refused), if you are not looking into it let me know (as you were testing the network vxlan stuff) [08:30:51] mmmm [08:31:02] I don't think I have changed anything related to that [08:31:53] moreover, yesterday I ran the network tests for codfw1dev a few times, and they were all green. The tests include puppet agent runs and such [08:33:02] I'll try to have a look then, it started a couple days back iirc (emails) [08:33:37] ok, let me know if you need assistance [08:46:58] draining the last ceph osd node \o/ [08:47:05] 🎉 [08:47:48] might saturate the c8 switch for a bit [08:48:02] ack [09:09:37] we are getting some packet drops on D5 yep, should not last long though [09:10:06] https://usercontent.irccloud-cdn.com/file/TDkYQwRw/image.png [09:31:19] something is borked in codfw yep [09:31:26] https://usercontent.irccloud-cdn.com/file/y257DXiY/image.png [09:31:39] can't reach any of the cloudinfra vms (that's horizon itself) [09:33:49] hmm, the VMs are down [09:33:56] `error: The domain is not running` ? [09:35:23] stopped by nobody [09:35:27] https://usercontent.irccloud-cdn.com/file/E8NINuNt/image.png [09:36:07] I think the cloudvirt died [09:36:16] *is dying (hard drive at least) [09:36:21] https://www.irccloud.com/pastebin/YDjmdlDl/ [09:36:34] I'll try to reboot it [09:36:46] oops [09:36:51] https://www.irccloud.com/pastebin/2pWVj4Sm/ [09:36:53] xd [09:37:27] is ro `/dev/mapper/vg0-root on / type ext4 (ro,relatime,errors=remount-ro,stripe=64)`, something went wrong yep [09:39:03] it's quite unresponsive, doing an ipmi hard reboot [09:39:38] ouch! [09:50:26] some of the instances were stopped (by nobody) on Sep 8th, some today, just manually started the ones in cloudinfra-codfw1dev project [09:51:09] the journal logs for `--boot -1` of the cloudvirt reach only until 08:46, 5 min before the stop action, probably it was not able to write to disk [09:52:25] puppet runs are still failing, will debug them after lunch [09:52:37] (feel free to debug too if you feel like it) [09:52:39] cya in a bit [09:53:56] dcaro: apologies for taking so long to get to the sample-complex-app reviews! 🙈 [11:00:53] Np, thanks a lot! [11:19:05] btw, I discovered this today after forgetting to export poetry.lock to requirements.txt after an iteration, and it's very nice: https://python-poetry.org/docs/pre-commit-hooks/ [11:21:22] that's awesome! [11:21:27] I want that export one xd [11:22:30] the other ones are nice too (I commited a broken poetry.lock at some point too...) [11:23:11] (but really, I think I'll be switching my personal projects to https://docs.astral.sh/uv/) [11:31:37] I've seen people pushing that one lately [12:36:56] hmm... cloudinfra-db-02 on codfw is blocking ssh connections from the bastion xd [12:37:17] and it holds the db for the puppet-enc (and it's currently down), so it will not be able to run puppet to fix itself [12:45:47] xd, long chain, db was down, and failing to get up as puppet server was down, and failing to run puppet as enc was down because it needs the db (probably been failing for a while, at least since the last time we rebuilt the bastion, as the new bastion was not allowed in the db firewall) [12:46:00] Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, DNS lookup failed for cloudinfra-db-01.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud Resolv::DNS::Resource::IN::A (file: /srv/puppet_code/environments/production/modules/profile/manifests/mariadb/grants/cloudinfra.pp, line: [12:46:00] 10, column: 9) on node cloudinfra-db-02.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud [12:46:03] and a bit more it seems [12:56:31] quick review? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1071870 [13:03:49] manually applied that on codfw, puppet is back running [13:10:57] https://wikitech.wikimedia.org/wiki/Help:Toolforge/API [14:04:57] arturo: any chance you could take a look at this MX update for wmcloud.org, https://phabricator.wikimedia.org/T374278 [14:05:11] not sure who would be best to make the change [14:05:24] jhathaway: I'm precisely working on that at the moment [14:05:29] woohoo! [14:05:45] but I tried the long road :-( [14:05:48] sorry should have read the back scroll more carefully [14:06:00] nah, there is nothing in the backscroll [14:06:03] let me explain [14:06:05] phew [14:06:21] right now, the value is manually hardcoded in the openstack database [14:06:36] and we are trying to move into opentofu (terraform) instead [14:06:46] so I'm working on a patch to put that into a nice yaml file, see [14:07:09] https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/40 [14:07:31] I guess the actual record you want to change is this one [14:07:31] https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/40/diffs#dc86aec11085320083cfcb0f510c33383a6949cc_0_9 [14:07:46] but I believe I'm hitting an openstack tofu provider bug :-( [14:08:01] and cannot import the record into the tofu state [14:08:34] :( that sucks, terraform importing is tricky in my limited experience [14:09:42] nah, I think is just a small bug [14:10:15] I'll report upstream and see if a small patch fixes it, otherwise will go the old hardcode route to unblock you... maybe tomorrow? [14:10:21] would that work for you? [14:17:28] no super rush, just didn't want it to get lost, it is the last thing blocking the decom of the exim vms I believe [14:21:41] jhathaway: ack. Ping me tomorrow for further news, hopefully they'll be good ones [14:21:52] thanks again [14:29:34] np [15:19:32] Raymond_Ndibe: I see you finished provisioning the new workers :), now we can start removing the old ones [15:19:40] we should probably do the same for the non-nfs workers [15:19:47] (and maybe ingress?) [15:25:21] hmm... about ceph, I think we will not be able to get it more even, the issue is that the E4 rack is getting full, and it becomes the limiting factor, as it's the smaller of all of them, and already at 81% usage [15:29:26] ack [15:51:05] * arturo offline [16:47:45] Raymond_Ndibe: maintain-kubeusers is down, did you do anything? [16:47:54] (alert, looking if it's for real) [16:48:10] the pod is runnig [16:51:24] this works [16:51:28] root@tools-prometheus-6:~# curl --cert /etc/ssl/localcerts/toolforge-k8s-prometheus.crt --key /etc/ssl/private/toolforge-k8s-prometheus.key --insecure -v https://k8s.tools.eqiad1.wikimedia.cloud:6443/api/v1/namespaces/maintain-kubeusers/pods/maintain-kubeusers-8d67b68bb-mc6n4:9000/proxy/metrics [16:53:28] hmm... the last log from prometheus is from 15:02 UTC (~2h ago, when the alert shows it started failing) [16:56:40] aaahhhh... it's toolsbeta 🤦 [16:57:49] interesting, it's restarting itself saying 'pod sandbox changed' [16:58:32] ohhh, I think it might be getting killed [17:00:14] blancadesal: Raymond_Ndibe any of you around? [17:00:26] what's up? [17:01:15] Ohh yes I am [17:01:50] quick +1 https://phabricator.wikimedia.org/T374476 [17:01:54] ? [17:02:00] (nothing critical xd) [17:02:11] done! [17:02:15] thanks :) [17:04:54] restarted maintain-kubeusers and is up and running again [17:05:16] ok. wanted to go looking now [17:06:00] but the old one is stuck in terminating with an interesting error [17:06:05] https://www.irccloud.com/pastebin/UB7Fbg1B/ [17:06:17] (if you want to keep digging xd) [17:08:47] I'm about to log off, I did not check much more on that (it was running on toolsbeta-test-k8s-worker-nfs-6, maybe it was not correctly setup or something?) will continue testing tomorrow if you don't find the fix xd [17:08:51] cya! [17:08:53] * dcaro off [17:10:10] oh yes [17:10:16] calico crashed on nfs-6 [17:11:13] there's something wrong with nfs-6 :/ [17:11:16] (toolsbeta) [17:12:08] Let me attempt to recreate it [17:13:00] I'll let you play with it xd