[02:42:24] OK, before I go to bed... I got the rest of cloudcephosd1010 pooled and now I have a script running in a screen session that's adding the OSDs in 1037 one by one. It is taking forever, probably will still be in progress when you start working tomorrow. [02:43:52] There was at least one alert about the network being flooded -- that happened before I started pooling 1037 but might've been something with 1010. Got some pointed questions in -operations and linked to https://grafana.wikimedia.org/goto/6Ay4ilrIg?orgId=1 [02:44:38] Seems like traffic spikes right when a drive is added and then drops down to pretty slow. [02:44:59] That's all I know for now! Probably not much do be done except watch 'ceph status' and wait for 1037 to come online [02:46:23] oh and 1038 is imaged and should be ready to go [06:37:56] thanks! [07:58:33] Hmm, according to that graph, we have peaks of effectively 117Gbit/s, when the interface is 10000Mbit [07:58:35] https://www.irccloud.com/pastebin/aOgwTImZ/ [07:58:45] something does not match xd [07:59:12] aaaahhh, that's the sum of all nodes xd [08:01:01] that makes sense then, when the rebalancing starts it has to shift data around the whole cluster, so there's a cluster wide peak, then little by little the data gets in place, and it's only moved to the new drive (so it can't move as much data at the same time, single drive and network nic limits) [08:08:25] btw. I don't see any screen/tmux on 1037 :/ [08:12:06] I'm thinking that adding 1038 to the cluster might not reduce at all the used space, as it's added on the same rack as 1037 (F4), so the racks still are imbalanced (C8, F4, E4) [08:12:07] https://www.irccloud.com/pastebin/siDj1nom/ [08:12:11] current balance [08:12:23] the limiting factor there will be E4 [08:12:27] (bottleneck) [08:44:09] * dhinus paged ToolsToolsDBWritableState [08:44:36] I don't see it in alerts.wikimedia.org so it was probably a short blip [08:47:49] I did not notice either :/ [08:48:06] ah it was actually ack'd yesterday by dcaro and the ack expired after 24 hours. I think victorops failed to receive the "resolved" message [08:48:21] oh, that's not nice [08:48:39] sorry for the mess, there were a couple others [08:48:42] no probs [08:49:04] not your fault, you acked the first one and it should have auto-resolved [08:49:20] the thanos graph confirms there was a blip yesterday around 8:30 UTC [08:49:38] yep, it was due to the ceph cluster misbehaving when adding the new node all at once [08:49:39] which quickly resolved [08:49:58] just resolved in splunk [08:50:01] thanks [08:50:14] I think I was surprised yesterday b/c I didn't receive a "resolved" email either [08:50:24] so it's not a problem with victorops, but with alertmanager [08:50:54] maybe the email also got lost due to the ceph issues? [08:51:16] maybe, it should have been sent by the metricsinfra prometheus I think [08:51:34] so that's a VM, that might be affected by ceph misbehaving [08:52:47] btw. do you have any idea on what is 'trashing' images on ceph? I remember there were some changes in the cleanup scripts and such, but I'm not sure [08:53:11] and the trash is not cleaning properly as there's still snapshots lingering for the images, but those are not reachable unless you restore the image :/ [08:53:13] (or so it seems) [08:53:21] yep known issue I think [08:53:26] let me find the task [08:53:49] T358774 [08:53:50] T358774: [wmcs-backup] Backup snapshots of deleted volumes are never cleaned up - https://phabricator.wikimedia.org/T358774 [08:53:58] does it match what you're seeing or is it a different issue? [08:54:32] hmm the task is about //volumes// and you're talking about //images//, but maybe it's the same thing [08:54:45] not sure we take backups of images though [08:54:57] yep, [08:55:02] that's exactly it :) [08:55:19] yep, rbd images (openstack volumes in this case, cinder pool) [08:55:38] ok [08:55:57] I started looking at the backup/cleanup script but it's long and complicated :P [08:56:51] so I eventually gave up, if you want to give it a go, otherwise I might look again myself but not sure when [08:58:14] if you just want to do some manual cleanup, there's a tip in my last comment in the task to delete things without restoring first, but it's still a bit tedious [09:14:20] yep, I'll try to give it a go [09:14:35] I had two big refactors for it long long time ago, that never got merged [09:14:42] that would have been nice to have now :/ [09:14:52] and moving it to it's own repo [09:15:04] (so we can split the file into an actual python module, have nicer tests, ...) [09:15:16] won't do that now though [09:27:18] dhinus: quick review (type fixes and one bugfix that popped up with it) https://gerrit.wikimedia.org/r/c/operations/puppet/+/1060771 [09:52:07] dcaro: sorry, I didn't see the ping. I'm looking now... is mypy not running on the ci? [09:52:18] nope [09:52:38] (we might be able to add it now, back then it was not doable) [09:52:59] ok! [09:53:09] I remember adding some tests but I didn't notice mypy was missing [09:53:39] we should also be able to switch the ci to py3.11 which is the version used on the backup servers [09:56:19] that's nice, I was still trying to keep it 3.7 compatible xd [11:37:50] * dcaro lunch, will be a bit late for the coworking space [12:40:55] dcaro: the screen session is on cloudcumin1001. It's a little more than halfway through. [12:55:47] ack, unfortunately, the limiting factor for the cluster fillup is the rack E4 now, it has less capacity already than F4, so adding more to F4 does not add more overall to the cluster [12:55:53] (once D5 is out) [12:56:25] andrewbogott: I'm removing all the expired images in the trash (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1060784) [12:56:38] it's taking a bit, but should free a big chunk of space [12:56:48] great! [12:57:15] hm are 1037 and 1038 both F4? [12:58:14] yep [12:58:28] we got one in C8, one in D5, and two in F4 [12:59:08] ok. Should I ctrl-c the pooling of 1037 so we can go back to draining things? [12:59:35] maybe, yes [12:59:42] are you using the cookbook? [12:59:46] yes [12:59:52] okok [13:00:05] well, multiple cookbook runs, one per osd [13:00:30] you can pass --batch-size=1 and would be the same [13:00:33] so I assume that right now all the cookbook is providing is play-by-play as this osd pools so if I interrupt we can just watch ceph-status instead [13:00:42] yep [13:01:15] there's also a cookbook `wmcs.ceph.wait_for_rebalance` [13:01:19] ok, done. I imagine it'll still take an hour or more for this osd.283 to finish balancing though [13:01:25] I use it with a notify script I have [13:01:48] it says ~1:20:00, but well, probably more, it slows down by the end [13:04:37] oh, it sped up xd [13:04:40] <1h [13:09:09] I'll believe it when it's actually done :) [13:16:07] hmpf... puppet ci is using an old python, and can't use the new types [13:22:53] less than 60% full \o/ [13:23:44] nice [13:23:52] is that enough to drain everything we need to drain? [13:27:05] It's been 20 minutes and the time remaining has increased from 30 minutes to 40 :) [13:35:31] dcaro or dhinus could I get a quick review of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1059958 ? I've tested it quite a bit but it still makes me nervous :) [13:38:19] andrewbogott: I have no idea how the whole proxy thing works, so it's hard for me to review :/ [13:38:28] I can do a quick check for anything obvious... [13:38:52] The delete-a-proxy part is easy, it's the decide what to delete part that's scary :) [13:40:03] can we test this in codfw? [13:40:46] yeah, I have been [13:41:09] although it's not hooked up right now because I re-enabled puppet last night [13:41:37] LGTM [13:42:06] (relying on your testing though) [13:42:12] thanks [13:43:39] you're right, it should 404 if there's no project [13:48:50] dhinus: im currently on vacation in the middle of nowwhere in northen italy, i only breifly scanned the back chat but if this is stil an issue on monday i can take a look. [13:50:45] jbond: thanks for looking, btullis already came to the rescue and the problem is solved :) [13:51:06] ahh great :) [13:51:39] jbond: more importantly, I am in northern italy, so if you go through Milan I can get you a pint :) [13:52:36] * andrewbogott waves to jbond [13:54:09] We are flying back on satuerday and will be in milan for a couple of hours probably from ~12:00 -> 15:00 when we head to the ariport. however we will have a bunch of luggage and will probably just grab some food by the train station. but if you are around pm me your number and i can whatsapp you when with more details on satuerday if you are around [13:55:06] we are currently about 40k north of asti [13:55:16] * jbond wave andrewbogott [13:55:57] jbond: I'll ping you :) [13:56:15] cool :) [13:58:57] safe travels! nice to hear your whereabouts! [14:09:45] We are (somewhat by surprise) visiting Jenna's family next week so I'm going to work short days, probably just an hour or two in the morning to keep up with meetings and email. [14:10:26] And I'm saying that to ask: what do y'all think about me upgrading codw1dev to C today or tomorrow and then being not-super-available if it goes haywire? [14:11:06] (partly that appeals to me because then I can come back the following week, confirm that everything is still working, and upgrade eqiad. Trying to get that done before sabbatical.) [14:12:44] I think it makes sense, worst case we'll have some codfw issues but that doesn't seem too bad [14:13:00] and I agree it would be useful to let the new version run in codfw for a few days [14:13:13] ok! [14:13:33] I might get to that this afternoon yet -- other than the designate thing the changelogs don't show anything of real interest [14:13:52] dcaro: rebalance is done! Shall I drain cloudcephosd1014? [14:14:38] And also: what % should I watch for as a warning sign that means 'stop draining things'? I've been watching the 'capacity used' gauge on https://grafana.wikimedia.org/d/7TjJENEWz/wmcs-ceph-eqiad-cluster-overview?orgId=1 [14:14:49] yes please, if you can use https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1060173 to test it the better, I changed all the options to be similar, and to wait by default [14:17:02] the cluster is complex :/, the so it's more reactive, if the cluster start having 'slow operations' that's a clear sign it's going too fast (it should clear by itself), also there's a 'not enough space to backfill, add space if it does not clear' kind of warning, that one might happen, just wait for it to go away, you can try checking the individual 'osd df' see if any is going too high, but at the end of the reblance all should [14:17:02] be similar (~2 standard deviation) [14:18:25] ok, 1014 is draining [14:18:36] Ok, so if we run it (slowly) to say 85% filled, that's OK? [14:18:50] 85% is too much [14:19:00] (sorry I just got the question xd) [14:21:50] this is what we have now [14:21:51] https://www.irccloud.com/pastebin/fxsc0UvF/ [14:22:19] ok, so 80% is the lucky number [14:22:21] it can be reconfigured a bit, but things might start failing [14:22:23] unlucky number [14:22:33] we can bump to 85 for a bit if needed [14:22:53] though that 80% is not really doing anything, it just shows a warning [14:23:21] it's the 90% that starts stopping activities in the cluster, so we can say right below 90% is the max spot [14:23:46] hitting 90% means that we should free asap, hitting 95 cluster is read-only [14:24:09] (90 means that there's no bacfill happening, so no recovery, though kinda functional) [14:24:17] oh, ok! seems like we'll be able to drain everything and stay under 90 [14:24:32] 🤞 [14:24:34] (me has no actually done that math) [14:26:00] hmm, got an trashed image that is failing to purge [14:26:02] https://www.irccloud.com/pastebin/ZOVigSne/ [14:26:34] that's using my scary trash-and-purge script? [14:27:29] oh, did not know you had a script! [14:27:36] I wrote a subcommand for wmcs-backups [14:27:37] xd [14:27:38] rbd: error removing snapshot(s) 'snapshot-8ea087a6-6997-41b2-928b-f0293d41e3d1', which is protected - these must be unprotected with `rbd snap unprotect`. [14:30:52] I thought you merged it, now I'm trying to find where that was... [14:31:44] maybe... my memory is not awesome [14:33:28] nope, I was confusing it with your patch [14:34:00] dang, dhinus remember that wmcs script I wrote that untrashes rbd images, purges, retrashes, and and then empties the trash? Which you thought was too scary to merge? [14:34:06] I have lost it :/ [14:34:12] hahahaha [14:35:10] Ah, ok, here it is https://gerrit.wikimedia.org/r/c/operations/puppet/+/999218 [14:35:43] dcaro: use with caution but that should cure what ails you [14:37:23] andrewbogott: yep I was looking for that one earlier and couldn't find it! [14:37:34] gerrit is sneaky [14:37:43] should I add a big warning to the top of it and then merge so we can find it next time? [14:37:46] I think dcaro's patch from today is the slightly more safe version of that [14:37:53] yeah [14:38:12] but also slightly less uber-powerful apparently :) [14:38:30] safe and powerful, pick one :P [14:42:24] * andrewbogott merged it [14:45:36] oh, sent an update to the other one https://gerrit.wikimedia.org/r/c/operations/puppet/+/1060861 [14:45:40] xd [14:45:45] (just read the messages) [14:46:30] your fix is better in the long run [14:47:37] using the libraries directly is neat though, we should do that (iirc it was not so easy at the begging, not sure anymore) [14:53:31] the rebalancing is now ~5x faster than the previous :) [14:56:50] I missed something, what made things faster? Just freeing up space? [14:58:50] no idea [14:59:08] probably that it's just the beginning of the rebalancing, it might slow down [15:38:08] andrewbogott: I've added a note to the %use graphs for ceph with the limits and the outcomes [15:38:10] https://usercontent.irccloud-cdn.com/file/VMRiZEOg/image.png [15:38:17] not much, but should help [15:38:41] yep, thanks [15:41:39] there's the limits there, tried to make the graph bigger so it shows the numbers better (they overlapped before), feel free to tweak a bit better [15:41:41] https://usercontent.irccloud-cdn.com/file/whxXtDmF/image.png [17:00:08] * dcaro off [17:00:44] @cteam feel free to ping me if ceph goes bonkers or similar, I'll try to keep an eye too but just in case [17:01:16] dcaro: ack thanks! [17:01:30] cya!