[04:29:57] dcar.o: I'm about to go to sleep; I've repooled a bunch of osds but there are a lot left to go (obvious from 'ceph osd tree'). It seems to go a little faster if I pool one osd on each of several nodes rather than multiple osds on a single node, so I've been doing that. [04:30:51] The only wrinkle from today is that there are 4 osds on 1011 that are 'in' but empty; I assume they're in some in-between state that's invisible to the cookbooks so I'm hoping you'll be able to convince ceph to start using them. [07:09:50] ack thanks [13:33:23] ceph is so much happier now! [13:33:45] Seems like there's a wide range in how long it takes for a given osd to populate but maybe that's my imagination [13:41:04] depends on how much space it has to do in-place things too [13:41:19] so the healthier it is, the easier it gets healthier xd [13:42:02] I'm adding them in batches of one osd per host [13:42:29] that way the network traffic is more or less spread (the rack switch is still a bottleneck, but the host card is not) [13:43:52] it feels like keeping the bonfire alive xd, you just throw some logs from time to time, if you put too many it does not run, if you put too little, it does not heat enough... [13:54:35] yep :) [13:55:01] re: pooling one osd per host, it also avoids cpu maxing right? Or is cpu never the bottleneck? [13:55:47] as far as I have seen, cpu was not the bottleneck (that I know of at least, maybe it's self-limiting), ram also, though it always trise to use high 90% [14:00:12] 'k [14:01:42] we have one of the new 103x hosts ready to pool in that rack too, right? Should we start adding those osds one by one as well? (I'm not sure if the cookbook supports that for a brand new host) [14:11:03] I'm doing it by hand for now [14:11:16] but the cookbook should be able to handle new hosts without issues [14:11:32] (bootstrap_and_add if there's no osds on it yet, so it formats + adds the osds) [14:11:46] though it's currently balancing the cluster, so you might have to pass --force [14:12:32] (and maybe --no-wait... as it will wait for the rebalance to finish, though that will add all the osds :/ ) [14:14:03] I'd wait until the cluster is free xdf [14:22:21] btw. andrewbogott had to do this to get some cookbooks running https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1062964, with regards to auth, is that expected? [14:22:47] (the stack of patches) [14:24:32] Not exactly expected but it makes sense [14:25:50] okok, I was not sure if I was doing something wrong [14:26:48] There's some ambiguity in the openstackcli between 'use this project for auth' vs 'act on this project'. In some cases it takes a --project-id flag for the 'act on' question but not always [14:28:17] Which means I'm frequently surprised [14:28:23] xd [14:46:48] * dcaro going to stretch my legs before the team meeting [14:56:10] dhinus: can you fill in the 'web services' section in the etherpad if you know anything? [14:58:02] andrewbogott: I'm working on some patches for quarry/superset but they're not complete yet. I'll add them next week. [14:58:31] I don't remember other things related to web services [14:59:47] that's probably all :) [14:59:48] thx [15:15:38] andrewbogott: we might want to get https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1062964 merged [15:16:07] isn't it? [15:16:56] wrong link xd https://gerrit.wikimedia.org/r/c/operations/alerts/+/1062962 [15:17:09] (firefox lately does not update the navigation bar correctly :/) [15:18:33] lgtm! [15:19:26] thanks [15:23:19] dcaro: let me know when you want me to take over tending the campfire [15:24:37] andrewbogott: I think that the current batch will go already over my day, you can take over already (I'm running manually `for osd in 114 138 146 178 185; do ceph osd crush reweight osd.$osd 1.74657 && ceph osd reweight osd.$osd 1 || break; done` for a manually chosen set of osds that belong to different hosts, as the cookbook can't do so) [15:24:52] but feel free to use the cookbook instead, the worst case scenario is that it goes a bit slower xd [15:25:09] ok! I'll check the status in a while [15:54:03] dcaro: that new alert is already firing :/ [15:54:13] oh, that's not good [15:54:26] feel free to check the runbook xd [15:55:36] hm, wait now I don't see it [15:55:49] nevermind I guess [15:56:30] hmm, I get stats now [15:56:37] might have been temporary [15:56:59] the mgr daemon is the one gathering the stats, I suspect that when it gets too loaded it might take too long [15:57:01] I got a text '5065: CephClusterInUnknown wmcs (ceph,cloudvps eqiad) - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&sea... (Ack: 11578, Res: 45103)' but it doesn't show up on alerts.wikimedia.org [15:57:16] I tweaked some options to try to make it a bit more reliable [15:57:19] I acked it [15:57:27] ah, now I see it :) [15:57:43] so you think it'll recover on its own? [15:57:46] T372528 [15:57:47] T372528: [ceph] Metrics started not responding during the drain - https://phabricator.wikimedia.org/T372528 [15:57:52] with the details of what I did today [15:58:51] ok! [15:59:18] we can try making the alert less flaky by increasing the time to trigger [15:59:47] it's already 5m though :/ [16:00:35] wait, the metric it's using is the one I was using for testing 🤦‍♂️ [16:01:27] * andrewbogott blames the code reviewer [16:02:14] this uses the right one https://gerrit.wikimedia.org/r/c/operations/alerts/+/1063017 [16:05:47] merged, the alert should vanish soon, sorry for that xd [16:06:13] np [16:32:41] I think I'm going to call it a day already, feel free to page me if anything happens, cya in a week! [16:32:45] * dcaro away [16:39:02] * andrewbogott waves [16:49:00] * andrewbogott gives ceph a big pile of new osds and wanders off