[05:37:56] dcaro, I'm starting the drain of cloudcephosd1020 and then going to sleep. I think we're going to wind up a bit over 80% usage so if there's an easy way to adjust the warning threshold that's probably worth doing. The drain cookbook is running in screen session on cloudcumin1001. [12:28:58] I'm fixing the issue with the vm backups (added an exception when it should be just a log) [12:29:37] the ceph cluster is currently in warning status due to being too full, will leave it for the weekend to try to rebalance and address it on monday [12:29:56] (the osds that are too full right now are actually the ones in D5, that will be drained) [12:35:16] thanks dcaro! while it rebalances it shuld stay around 85%, correct? [12:35:18] *should [12:35:37] as in, it should not increase further and if it does there might be something wrong? [12:36:11] yep, it should actually decrease to ~70, worst case, stay the same (if it decides there's nothing to rebalance), but definitely not increase [12:36:19] ok! [12:41:25] hmm.. the backups are getting postgres errors [12:41:26] Aug 09 12:40:40 cloudbackup1003 wmcs-backup[263233]: psycopg2.ProgrammingError: named cursor isn't valid anymore [12:44:21] this is nice, progress in the process name xd [12:44:23] https://www.irccloud.com/pastebin/SOQVRcMJ/ [12:45:08] should finish now though, I'll check in later, cya in a bit! [13:16:26] dcaro: what should we do about draining cloudcephosd1023 and cloudcephosd1024? Should I drain one if/when the cluster goes out of error state? Or do we need to reschedule the switch maintenance? [13:18:25] andrewbogott: you can try starting by the osds that are fullest if you want, now it needs some hand holding [13:19:56] Check with `ceph osd df D5`, and pick the one fullest, then the next and so [13:20:46] ok -- but only after it's out of error state? [13:21:08] (Also -- is it going to keep trying to pack things into D5 because it thinks it needs replicas there?) [13:33:13] we can do when it's in warning, as we are taking out the osd that's fullest and spreading the data around [13:33:26] ok, I'll try one now then [13:35:08] okok, /me around if needed [13:35:16] I can't tell if it's doing anything :) ceph status doesn't show any rebalancing [13:36:10] oh because I typed the wrong command... [13:36:51] which one did you take out? [13:36:52] aaah xd [13:37:33] I did [13:37:36] https://www.irccloud.com/pastebin/tClwFgno/ [13:37:54] but it seems like maybe the cookbook won't forward the command to ceph if the cluster is unhealthy? [13:38:04] '0:01:16.387144 have passed, but the cluster is still not healthy, waiting 0:00:10 (timeout=0:30:00)...' [13:38:21] yep, by default it waits for it to be healthy, you can try --force [13:38:33] * andrewbogott tries it [13:39:08] (or take out the osd manually worst case, `ceph osd crush reweight osd.XXX 0` + `ceph osd out osd.XXX` [13:39:10] ) [13:39:22] I see remapping started [13:39:29] and recovery was triggered [13:39:38] yeah, --force did something [13:40:04] let's see where it places the data, should go out of the D5 rack [13:41:58] if it does not help, I'd wait for monday to try undraining the rest of the D5 rack down at once, to force it to put the data on the other racks [13:42:24] (given that it's friday, and a holiday) [13:43:09] the total usage of D5 seems to be going down [13:43:10] TOTAL 26 TiB 21 TiB 21 TiB 1.4 GiB 67 GiB 5.2 TiB 79.97 [13:43:15] TOTAL 26 TiB 21 TiB 21 TiB 1.4 GiB 67 GiB 5.3 TiB 79.80 [13:43:20] that's good [13:43:33] 6 nearfull osd(s) [13:43:34] cool [13:43:47] yep, seems to have forced it to move stuff out of D5, nice [13:43:57] (we had more before) [13:44:28] yeah, so I'll continue to drain the >80% nodes and see if that cheers things up [13:44:57] yep, ping me if anything goes awry (telegram if I don't reply) [13:45:02] thanks! [13:45:07] np [15:08:29] hm, seems like it's doing at least some rebalancing into D5 still. 176 is up to 89% now :( [15:08:52] Guess I'll give it a break and see what happens