[07:36:31] I received a bunch of pages around 4:30 UTC, they all auto-resolved in a minute or so [07:36:48] morning, what were they about? [07:37:01] they were all like "cloudvirtXXXX/nova-compute proc minimum", on 8 different cloudvirts [07:37:13] maybe a network glitch? [07:37:21] I suspect they could all be in the same row [07:37:34] you can see them in #-cloud-feed [07:37:37] (or in victorops) [07:38:28] this seems correlated https://gerrit.wikimedia.org/r/c/operations/puppet/+/1161148 (similar timing) [07:38:31] no, they were in different rows [07:39:09] oh that's possible, good catch [07:44:04] I think andrewbogo.tt rebooted all the daemons for that yep https://sal.toolforge.org/admin?d=2025-06-19 [07:44:48] that adds up yes [09:01:36] quick tofu review (new project) https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/249 [09:03:28] LGTM though, so if nobody reviews in ~30min I'll go ahead with it :) [09:03:39] thanks taavi [09:03:41] yeah, I don't see much point in getting additional reviews for those, but done [09:06:51] Looking for +1s for https://phabricator.wikimedia.org/T396840 and https://phabricator.wikimedia.org/T397266 , seem ok to me [09:07:17] tofu failed :S [09:07:25] │ Error: Error creating openstack_identity_project_v3: Expected HTTP response code [201 202] when accessing [POST https://openstack.eqiad1.wikimediacloud.org:25000/v3/projects], but got 504 instead:

504 Gateway Time-out

[09:07:27] looking [09:07:33] should be safe to retry right? [09:07:49] yeah [09:08:48] the project is actually there in horizon, I guess it did not get to create the security groups and such, checking [09:09:48] hmm... retry fails with [09:09:53] │ Error: Error creating openstack_identity_project_v3: Expected HTTP response code [201 202] when accessing [POST https://openstack.eqiad1.wikimediacloud.org:25000/v3/projects], but got 409 instead: {"error":{"code":409,"message":"Conflict occurred attempting to store project - it is not permitted to have two projects with either the same name or same id in the same domain: name is toolsbeta-logging, project id [09:09:54] 94d453f99e23466e98f26025cf9a647e.","title":"Conflict"}} [09:10:24] Is there a way to 'refresh' the tofu state from openstack? [09:10:41] should I create one of those 'import' entries in the tofu code? [09:11:51] that, or ssh to a cloudcontrol where the repo is cloned and run `tofu state import` by hand [09:12:08] okok [09:12:11] s/`tofu state import`/`tofu import`/, https://opentofu.org/docs/cli/import/usage/ [09:13:17] that sounds nicer to me [09:17:14] hmmm.... that syntax is tricky [09:17:24] https://www.irccloud.com/pastebin/91600AG0/ [09:17:45] but there's such `resource` entry in the project module [09:18:04] https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/blob/main/modules/project/project.tf#L2 [09:19:48] you need the full opentofu resource path, which should've been logged just before the error snippet you copy-pasted here [09:21:34] `module.project["toolsbeta-logging"].openstack_identity_project_v3.project[0]` fails too [09:21:50] (that's what it showed) [09:24:53] that should work [09:24:57] mind if i try? [09:25:01] sure [09:25:12] thanks [09:27:41] this worked for me: [09:27:41] taavi@cloudcontrol1006 /srv/tofu-infra $ sudo tofu import 'module.project["toolsbeta-logging"].openstack_identity_project_v3.project[0]' f048baba9d7e46a49dcc72221af0da13 [09:27:52] the openstack project id in your command was completely wrong [09:28:03] oh, what's that id? [09:30:14] the openstack internal project id? [09:30:22] where did you get the id in your command from? [09:30:46] I got it from the error [09:30:54] │ Error: Error creating openstack_identity_project_v3: Expected HTTP response code [201 202] when accessing [POST https://openstack.eqiad1.wikimediacloud.org:25000/v3/projects], but got 409 instead: {"error":{"code":409,"message":"Conflict occurred attempting to store project - it is not permitted to have two projects with either the same name or same id in the same domain: name is toolsbeta-logging, project id [09:30:54] 94d453f99e23466e98f26025cf9a647e.","title":"Conflict"}} [09:31:47] that is the id of the project that tofu tried to create but failed because the name was already in use [09:32:50] anyway, you now need to run tofu again to create the rest of the resources [09:33:07] so that id is from tofu or openstack? [09:33:35] from openstack, but for a project that does not exist [09:33:47] so some kind of temporary id? [09:34:24] so basically tofu tells openstack "hey, create a project with this name 'toolsbeta-logging'" [09:34:53] openstack gets that request, and because tofu did not specify a project id, it generates one randomly [09:35:02] (that's the ID included in the error message you saw) [09:35:14] and only after that it realizes that project name is actually in use already [09:35:26] so that project ID would have been used if the project was created successfully then [09:35:38] but it wasn't, because there was an another project already using the name [09:37:18] sounds like a high chance of leaking ids xd [09:37:32] done \o/ [10:33:26] * dcaro lunch [12:02:44] hmm, seems like I'm going to need a tools-logging project right away as well since making that optional in tofu is going to be somewhat not easy [12:10:30] T397446 [12:10:31] T397446: Request creation of VPS project - https://phabricator.wikimedia.org/T397446 [12:41:14] andrewbogott: sorry I didn't get back to you the other day about the cloudceph / private vlan stuff [12:41:25] did we come to any conclusion about how to proceed on that? [12:41:44] hmm... tofu failed again when trying to create a project with the same error (Gateway timeout) [13:03:58] toolforge is having issues, see #-cloud [13:04:04] I see an alert ToolNFSDown [13:04:33] i'm declaring an incident, will be IC [13:04:52] thanks taavi [13:05:04] I cannot even login to the bastion [13:05:15] I would start looking at the NFS server if that alerted specifically [13:05:17] is Ceph ok? [13:05:49] checking ceph [13:05:59] status doc is https://docs.google.com/document/d/1K11lvIFcwBUwpXdvPLaLPDrmyal7Cyrt-KU8JYQcKjA/edit?tab=t.0#heading=h.95p2g5d67t9q [13:05:59] I can login to the bastion as root [13:06:15] ceph HEALTH_OK [13:06:26] Project tools instance tools-nfs-2 is down [13:06:47] ok, let's reboot that via openstack? [13:07:14] doing [13:07:55] rebooted [13:08:23] did you do a soft or hard reboot? [13:09:05] "server reboot" without flags,if only the help would tell me which is the default :) [13:09:10] I'll retry with --hard [13:09:15] i think bu default that's a soft reboot [13:09:38] that looks likely, I could still not ssh. now trying with --hard [13:09:58] checking the console also [13:10:18] just before the reboot, the console had nothing since the last boot in march [13:10:39] I can see the prompt in the console [13:10:39] * dcaro paged [13:10:44] and I can ssh now [13:11:01] * dcaro reading backscroll [13:11:01] nfs-server.service loaded failed failed [13:11:14] exportfs: Failed to stat /srv/tools/home: No such file or directory [13:11:51] well that directory certainly exists now [13:12:30] I'll try restarting the unit [13:12:36] dhinus: i think puppet already did that [13:12:39] so please don't [13:12:41] ack [13:12:57] follow-up: figure out why that didn't start immediately at boot [13:13:17] are things recovering? [13:13:51] yes! [13:13:58] I can open e.g. https://replag.toolforge.org/ [13:14:25] other tools seem slower [13:14:34] dhinus: can you check if the bastions are fine? [13:14:39] (i.e. they're not loading) [13:14:41] checking bastions [13:14:41] I will start the cookbook to restart all k8s nfs nodes [13:14:49] sounds good [13:15:32] bastion seems still struggling [13:16:02] the other tools I was trying to load did load eventually [13:16:30] is anyone looking at the nfs server logs to figure out what went wrong? [13:16:49] not yet [13:17:14] I cannot ssh to bastions [13:17:32] I can with root though [13:17:58] that sounds like what I would expect when nfs misbehaves [13:18:11] so you're going to need to reboot those :-) [13:18:59] doing [13:19:26] can you do the same for the mail server? [13:21:39] sure. bastions rebooted, I can ssh but it's hanging before getting to the prompt [13:21:42] ok now that resolved [13:21:55] I'll reboot tools-static [13:22:12] thanks, I forgot about static [13:22:23] rebooting tools-mail-4 [13:24:05] toolsdb is readonly, I'll reenable it [13:24:40] wait why is that read-only now? [13:25:37] it crashed [13:25:39] I checked in the logs [13:25:52] I don't like how many things are crashing at the same time [13:25:53] hmm... does it use NFS for anything? [13:26:09] no, so I suspect there was a Cinder issue of sorts [13:26:21] that maybe caused the NFS issue as a consequence? [13:26:39] the nfs shows a kernel oops right before the restart [13:26:41] huh, that's not great [13:26:51] dhinus: when you have a moment, paste the toolsdb crash logs to the doc? [13:27:53] sure [13:29:16] hmm... ceph show no new crashes either, network ok, no osds getting busy, or traffic increase/errors/etc. [13:29:21] done. there are some scary "Crash recovery is broken" logs I'm not sure about [13:29:41] from mariadb? [13:29:48] yes. but it seems to be working fine now. [13:33:08] quick look does not show anything interesting in openstack logs (logstash) [13:33:16] no increase in the amount of logs either [13:34:39] the mariadb logs are weird for a crash, it's like some lines are missing [13:35:26] also it looks like there was something wrong for more than 3 mins [13:38:09] is there anything still that's acutely broken? [13:38:15] not that I can see [13:38:23] maintain-kubeusers maybe? [13:38:34] (judging from alerts.wm.org) [13:38:43] I assume that'll fix itself with the k8s reboots, although we could manually restart that somewhere else if we wanted to [13:38:50] let's wait [13:40:08] i'm declaring the incident more or less closed then [13:43:26] just manually rebooted maintain-kubeusers [13:43:35] started back ok [13:47:50] sorry for messing up the styles of the doc, but it was making pasting logs really hard :/ [13:50:40] that's ok I think it's more readable now [13:53:48] I suspect mariadb did not actually go in read-only mode, but it was struggling and reported as "Down" for a few minutes, which triggers ToolsToolsDBWritableState [13:54:05] that alert should probably be renamed because it doesn't check the "writable state" at all [13:54:46] it does "sum(mysql_up) - sum(mysql_global_variables_read_only) != 1" [13:55:43] +1 for rename [13:56:24] the idea of that expression is to check if there's exactly one toolsdb instance that has read_only = 0, is that not what that expression is doing? [13:56:52] hmm it alerts for 2 very different situations [13:57:01] 1. both instances are read-only [13:57:14] 2. the primary is down, and the read-only is up [13:57:25] *the replica is up [13:57:36] so I thought it was 1 when I saw the alert, but it was 2 [14:00:17] I can see other similar errors in the last few days with "[ERROR] InnoDB: Crash recovery is broken", but the timing of today's one aligns very well with the NFS issue [14:01:42] maybe `ToolsToolsDBNoWritableInstance`? [14:01:54] I can confirm mariadb did NOT restart, the uptime in "SHOW STATUS" is high [14:02:06] I jumped to a conclusion too quickly [14:03:24] dcaro: I'll open a task to rename that alert, and maybe split it into two alerts [14:04:28] 👍 [14:06:29] inspired by our conversation earlier this week, and T391538, I created T397459 [14:06:29] T391538: `become` command not working properly on login-buster.toolforge.org - https://phabricator.wikimedia.org/T391538 [14:06:29] T397459: Lock down tools-sgebastion-10 (login-buster.toolforge.org) to only members of tools with known dependencies on it - https://phabricator.wikimedia.org/T397459 [14:07:10] 👏 thanks! [14:07:46] probably something to do early next week (so that we don't break people's stuff and then immediately go away for the weekend) [14:08:15] but I added that to our meeting agenda to discuss today [14:08:28] 👍 [14:09:01] sounds good [14:10:56] created T397460 [14:10:56] T397460: [toolsdb] Revisit WritableState alert - https://phabricator.wikimedia.org/T397460 [14:20:57] I was about to vanish for the day but I see some interesting backscroll, are there any openstack blockers I should look at before I go? [14:21:23] nothing was noticed on openstack side :/ [14:21:37] (so far at least) [14:22:10] and there's an outage in progress? Or just slowness? [14:22:23] outage but already fixed [14:23:27] :( ok [14:23:59] I'll go chase after my blue grossbeak then! [14:24:03] nfs vm stopped working (so everything else suffered too) [14:24:17] * dcaro looks up that weird bird name [14:25:11] hmm... looks like a blue sparrow :) [14:25:41] not a very rare bird overall, just doesn't belong here and supposedly there's one hanging around in the next town over. [14:25:47] * andrewbogott really going now [14:32:35] btw. is anyone rebooting workers? [14:32:54] (there's a few stuck on nfs, I can go around kicking them if nobody is) [14:34:08] there's a log for taavi running the cookbook some time, though there's no other logs/updates on it (maybe it does not log each worker?) [14:34:34] dcaro: yeah, it's still running, will take a while but get there eventually [14:34:49] awesome, thanks :) [15:51:43] hmm... I might be going crazy, but I think I pushed 3 times to gerrit, and pipelines did not trigger (no they tried to), just pushed now and they ran :/