[08:01:08] Morning! [08:02:30] greetings [13:04:00] * andrewbogott waves [13:09:00] volans: when you have a moment, I'd appreciate your thoughts on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1302236 [13:10:36] can someone please take care of the cloudlb and cloudservices parts of https://phabricator.wikimedia.org/T429285 ? [13:12:06] andrewbogott: ack [13:14:06] moritzm: I will [13:14:51] thanks [13:15:27] just a package update or does it need e.g. a reboot or something after? [13:46:36] only needs an update of bird2 [13:50:52] ok -- all done [13:56:12] thx [14:51:53] "4 OSD(s) experiencing slow operations in BlueStore" -- is that something anyone is doing, or spontaneous? [14:52:31] not me [14:54:39] hm, I wonder if ceph will tell me /which/ osds... [14:58:26] andrewbogott: they might be the ones doung deep cleaning irght now: 5 active+clean+scrubbing+deep [14:58:54] probably! I've never seen that alert/complain before though [14:59:15] true [14:59:15] health: HEALTH_WARN [14:59:16] 4 OSD(s) experiencing slow operations in BlueStore [14:59:55] there was a way iirc to get that info, did you try with `ceph health detail`? [15:00:11] osd.58, 62, 107, 109 [15:00:15] oh, that works! I was using 'ceph status' which did not [15:00:38] then you can check which nodes they are in with `ceph osd find` iirc [15:00:41] I'm goint to wait a bit and see if they recover. Meanwhile I see a couple of other OSDs that are just down altogether so working on those. [15:01:02] 281 and 289 forgot their configs for some reason [15:01:33] all in rack C8 [15:01:48] cloudcephosd1016/7 (2 instances per host) [15:02:11] :/ hmm [15:04:27] I don't see any unusual network activity [15:04:43] 2026-06-16T14:42:42.081+0000 7f43842086c0 0 bluestore(/var/lib/ceph/osd/ceph-58) log_latency_fn slow operation observed for _txc_committed_kv, latency = 5.097474098s, txc = 0x55d25ab15200 [15:04:47] it might be the rockdb [15:04:54] *rocksdb [15:04:58] dcaro, happen to know how to convince an OSD to regenerate its config in /var/lib/ceph/osd ? Otherwise I can just remove/re-add those two lost OSDs [15:05:23] restarting did not do the trick? [15:05:30] (the osds I mean) [15:05:32] no, fails for lack of config [15:05:37] that's weird [15:05:42] dcaro: how safe is to run compact on the osd? [15:05:46] "failed to fetch mon config (--no-mon-config to skip)" [15:05:54] and the dir is empty [15:06:11] andrewbogott: hmm... I think that might happen on osd init then [15:06:25] better rebuild them imo [15:06:41] volans: I have had no issues in the past, but I never ran it when slow ops were hapenning [15:07:07] "the internet" says that it might help, that a potential cause for that slowness could be fragmentation in rocksdb [15:07:25] oops it's catching, 2 more complaining now [15:07:26] we can try, iirc there was a process that ran it from time to time (though I might be wrong, it's been a while) [15:08:19] there is a osd_compact_on_start [15:08:26] haven't checked if it's true [15:08:27] yet [15:10:40] if it only happens on start, might be good to roll reboot periodically (or manually compact, though I think roll reboot has also other benefits to reliablity and such) [15:11:07] andrewbogott: you can try restarting the osds with slow ops, that helped in the past [15:11:13] I did upgrades and reboots yesterday [15:11:18] so this is most likely side-effect of that [15:11:33] volans: want me to try restarting the complaining OSDs or do have other ideas? [15:11:52] I'm a newby in regard of ceph maintenance [15:12:48] ok, I'll try restarts first. [15:12:48] so I don't have great ideas, I can dig more around [15:12:55] ok start with one [15:13:02] one host (maybe both osds [15:18:42] I restarted two of them, they cleared but two new ones 127/128 popped in to replace them [15:19:40] still it might be helping, I'm going to restart the other complaining ones and see if I can catch up [15:20:33] assuming we do compact on boot, and those were rebooted yesterday, why restarting them now helps? [15:21:32] My only-partially-informed theory is: those 'slow' stats are not totally valid because they involve averages over a sliding window. [15:21:38] restarting resets the window [15:21:52] dcaro, does that sound like nonsense? [15:21:59] are you confusing slow ops with heartbeats? [15:22:07] probably! [15:22:10] xd [15:22:17] I thought that both involved that kind of average sampling [15:22:32] but... why do you think restarting helps with slow ops rather than making them worse? [15:22:32] I think slow ops do not, you can actually extract the operation that's getting stuck [15:22:55] restarting them (I think) forces the op to be rescheduled and retried [15:23:33] the operation is the one I pasted before [15:23:34] bluestore(/var/lib/ceph/osd/ceph-58) log_latency_fn slow operation observed for _txc_committed_kv, latency = 5.097474098s, txc = 0x55d25ab15200 [15:23:44] so in this case maybe they're rescheduled to other OSDs and then /those/ OSDs show up in the slow ops list [15:24:07] and the internet points to failing hardware but from a quick check it seems fine [15:24:30] and distributed over multiple hosts [15:25:15] we have gone down many rabbit holes with slow ops yep xd [15:26:15] andrewbogott: in the past retrying seemed to get them unstuck, that's one of the pointers that made me think that there was a specific sector of the drive that was misbehaving (pointing to the hardware issues that volans mention, and the counters raising for the dell drives) [15:26:37] yep [15:26:48] I don't think the issue now is confined to the 'cursed' servers but I haven't checked closely [15:27:20] yep, last time we had slow ops, those were not the ones affected [15:41:20] I've restarted all the affected OSDs and now it's showing HEALTH_OK but with some misplaced objects. Let's see if it can finish balancing without getting unhealthy again... [15:42:46] 🤞 [15:46:25] k [15:50:05] everything is cleaned up, still showing healthy [15:50:17] and basically no idea what that was all about [15:50:25] but now I will try to get those stopped OSDs back online [15:52:31] did you rebuild those osds yesterday? (do you know how they got to that state?) [15:53:03] just package upgrades and reboots [15:53:14] but I don't know for sure that they were up before that [15:56:26] okok, it's weird that it went away :/ [15:58:09] yeah [15:58:40] ...and we're back [15:58:43] https://www.irccloud.com/pastebin/n24XDgUN/ [15:59:01] :S [15:59:16] different osd [15:59:36] probably same host [15:59:42] nope, it's one of the ones I restarted before [15:59:47] I just restarted it again because why not? [15:59:52] xd [16:03:07] anyone want to come to the ceph sig meeting? [16:03:25] meeting [16:03:27] anything interesting happening? I'm kinda in teh middle of a patch [16:03:33] * dcaro in meeting [16:03:55] not yet, probably talking about the next upgrade version [16:03:57] and recent events [16:40:12] I suspect the issue with those failed/stopped OSDs is crazy device assignments after reboot... [16:40:16] https://www.irccloud.com/pastebin/dyC5SKKX/ [16:40:26] not sure what to about that other than reboot again and hope to get lucky... [16:45:40] 🤦‍♂️ that again? [16:46:14] didn't matthew fix that in puppet/partman? [16:51:33] maybe! Would that help with reboots though? [16:51:46] * andrewbogott is rebooting and hoping to be lucky [16:53:56] is the problem re-ordering of disk numbering in linux? [16:54:21] that's what I'm assuming [16:54:25] well, disk lettering [16:54:51] although in this case I think it's that one of the volumes simply didn't present at all [16:54:59] so that might be unrelated to the shuffled letters [16:55:31] * andrewbogott lols at the drives having different, even weirder assignments after reboot [16:56:27] oh, but the reboot caused the broken osd to mount! [16:56:58] this was some related work https://phabricator.wikimedia.org/T324670 [16:57:06] So I think we have two things going on: 1) silly drive re-lettering on each boot which I take to be harmless 2) a boot race that means sometimes not all the volumes are present after a boot [17:02:27] well, three things, I don't think either of those are causing the slow ops [17:03:10] maybe if I eat lunch everything will be fixed when I get back [17:18:05] bugs on bugs on bugs, bugs all the way down! [17:53:57] andrewbogott: I see you're trying the compact options, is it helping? [17:55:34] volans: I don't know if I'll be able to tell since it takes a while for the issue to return. [17:55:52] I opened T429387 but i'm still largely convinced that this is not an actual problem. [17:55:53] T429387: cloudceph "HEALTH_WARN 17 OSD(s) experiencing slow operations in BlueStore" - https://phabricator.wikimedia.org/T429387 [18:03:45] ack [18:05:04] 131 is back in the list, so compact/restart doesn't seem to fix [18:44:36] hmpf, that'd would have been nice [18:55:10] * dcaro off [18:55:17] good luck! Cya tomorrow!