[07:39:54] !log melos@tools-bastion-13 tools.stewardbots ./stewardbots/SULWatcher/manage.sh restart # Disconnected [07:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stewardbots/SAL [09:42:56] !log admin hard-rebooting cloudvirt2004-dev (codfw1dev) having io/hardware issues [09:43:00] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [12:17:17] Hello everyone! How can I read custom envvars (toolforge create envvars…) from the PHP webservice? getenv() is empty. But default vars such as TOOL_TOOLSDB_USER is fine. [12:18:00] Iluvatar: that should be working ok, did you restart your webservice to pick up the new vars? [12:20:25] dcaro: oh, thanks :D [12:31:00] !log lucaswerkmeister@tools-bastion-13 tools.sal webservice restart [12:31:02] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.sal/SAL [12:31:23] (feels like this tool could use a health check URL 🤔 maybe I can do that later) [15:28:10] o/ howdy, in the catalyst project we removed a volume from a stopped instance, resized another volume, and boot dropped us in emergency mode now. This is for the k3s instance in the catalyst project, any idea? [15:28:42] (any idea how to get us ssh so we can actually login to the instance) [15:30:23] the failure is failing to mount the disk we deleted, guessing we need to edit /etc/fstab? Hard to say looking at just the log of the instance boot up but it does mention failure to mount [15:31:49] this "rescue instance" thing seems like what we want? [15:57:47] well. I guess rescue instance isn't what we want since couldn't get a shell there/vnc to work. Anyone with root want to help me edit /etc/fstab? [16:00:10] !help ^ thcipriani needs a WMCS root's help to unbreak an instance with fstab problems [16:00:10] If you don't get a response in 15-30 minutes, please email the cloud@ mailing list -- https://wikitech.wikimedia.org/wiki/Help:Cloud_Services_communication [16:09:24] taavi: are you around? [16:10:23] thcipriani: I can take a look [16:10:35] it's a bit unfortunate that there's no console access to the VM :/ [16:10:53] thanks if we could delete the /etc/fstab entry for /mnt/k8s-logs then it'd be happy [16:10:56] thcipriani: what's the name of the instance? [16:11:10] k3s in the catalyst project [16:12:03] hmm, there's no such entry [16:12:08] root@k3s:/etc# cat /etc/fstab [16:12:08] PARTUUID=2e399785-b692-4a7b-99e5-664729140a16 / ext4 rw,discard,errors=remount-ro,x-systemd.growfs 0 1 [16:12:08] PARTUUID=5ccf4045-7f8b-4da7-844a-391da609c90a /boot/efi vfat defaults 0 0 [16:12:22] Is that because it's in rescue mode? [16:12:46] We can take it out of rescue mode [16:14:10] maybe, let me try to mount the drives somewhere and check in those [16:14:57] dcaro: maybe when this is done, you could walk me through accessing the console through the openstack cli, and we can document it on wikitech? [16:15:20] there's a cookbook [16:15:37] (you can run from your laptop, or cloudcumin1001) cookbook wmcs.openstack.cloudvirt.vm_console [16:16:03] we can go through the 'manual' process too [16:17:13] I can actually ssh to the VM [16:19:26] this does not look like emergency mode? [16:19:26] Can I restart the VM to make sure? [16:20:14] root@k3s:~# systemctl list-units --failed [16:20:14] UNIT LOAD ACTIVE SUB DESCRIPTION [16:20:14] ● cloud-final.service loaded failed failed Execute cloud user/final scripts [16:20:54] it seems it failed to run puppet as part of that [16:21:48] let me cleanup the key on the puppetmaster and retry [16:22:02] it was a reimage without changing the VM name or similar right? [16:22:50] (if the cert cleanup did not have time to kick in, the cert might still be there on the puppetserver when the new instance comes up with the same name and fail the puppet runs) [16:23:24] Uh... we did not reimage the image, we just deleted a volume that was attached to it [16:23:33] We then rebooted it into recovery [16:23:53] hmm, interesting [16:24:12] and it still lists itself as "rescue" on the horizon ui [16:26:03] okok, I see, so it booted from an old image, that might also explain the puppet errors (using an old cert) [16:26:43] sorry for the confusion xd [16:26:49] I commented the entry in the sdb1 drive [16:26:54] https://www.irccloud.com/pastebin/6K8J0pgI/ [16:27:23] and umounted it, you can row unrescue the instance [16:27:29] <3 [16:27:36] it should boot back from the (current) sdb drive [16:29:00] OK great! Thank you! [16:29:03] neat, alright, let's try it, thanks dcaro. [16:34:24] kindrobot: for the console you need to find first which host your instance is running on (cloudvirt*) and the domain id (idXXXXX), sshing to a cloudcontrol* node and running sudo env OS_PROJECT_ID=myproject wmcs-openstack server show , then ssh to that cloudvirt and `virsh console idXXXXX` [16:34:35] we can try at some point if you want [16:38:10] ok, once we get the instance back up, I'll ping you [16:45:28] 👍 [16:50:08] /ac/wg 3 [16:53:22] dcaro: we wanted to make a snapshot of the k3s-data before resizing it, but we're out of space. If I made a ticket to add ~80Gb storage, could it be fast tracked? [16:54:04] kindrobot: I can give it a look [16:55:03] OK, I'll make the ticket [16:58:36] dcaro: https://phabricator.wikimedia.org/T374476 [17:04:15] kindrobot: should be done :) [17:09:25] I'm clocking out for today, let me know how it goes! (anything critical feel free to ping me) [17:51:23] We're good, thanks dcaro!