[08:10:13] I'm tweaking some osds (ceph osd reweight .. 0.95) to force the data to rebalance better and get out of the backfill_full state [08:59:48] hmpf... I think we could do with more placement groups, it seems it's having trouble trying to move those around as they are too big and the space left is not so much, but changing that now will take probably more than a day to settle, so might be better to just wait for the reboot this afternoon [09:00:53] how big is 1 pg at the moment? [09:01:55] and is there any downside of having many small pgs? [09:02:10] https://www.irccloud.com/pastebin/T9fo5OoJ/ [09:02:21] around 14G [09:03:28] it has to be scaled in powers of 2, so next would be ~8192, the downside is that it has double the amount of pieces to manage, so the mgr daemons will increase the load [09:03:34] and rebalancing becomes slower [09:03:36] I see [09:03:41] (the planning part) [09:04:02] we have the autoscaler set, that should be putting a "nice" value, that is that 4096 there [09:04:43] got it. maybe we should just have more free capacity so there's more space to rebalance? or maybe the core issue is that we have too many hosts on one switch [09:06:24] a mixture I guess [09:07:05] makes sense yes [09:09:49] having two switches per rack minimizes the amount of downtime (as upgrading a switch would be a noop, only catastrophes would bring the rack down, of which I have seen none in 3 years) [09:10:29] having the nodes spread in more racks would minimize the impact of a switch going down, but increase the frequency (as we have to upgrade the switches periodically, and we would have more switches) [09:11:08] having more space also minimizes the impact of a switch going down, without increasing the frequency, but decreases the cost/effectiveness (we have to have many empty drives) [09:11:35] we might try the maths of drives cost vs. switch costs :P [09:11:56] we did a bit in a spreadsheet at some point [09:11:57] iirc [09:12:01] ah cool [09:13:25] now I don't find it though :/ [09:47:42] oh, it seems clouddb1016 is down? [09:51:09] dhinus: are you reimaging it? [09:51:29] yep xd [09:51:30] !log fnegri@cumin1002 START - Cookbook sre.hosts.reimage for host clouddb1016.eqiad.wmnet with OS bookworm [09:51:30] dcaro: Not expecting to hear !log here [09:55:04] yep sorry [09:55:18] I mentioned it in data-persistence but not here :) [09:55:30] I silenced it though, did you see any alert? [09:58:34] yep, there was an haproxy backend down alert [09:58:50] https://usercontent.irccloud-cdn.com/file/5FleQP71/image.png [10:11:48] hmm I thought I depooled it [10:12:20] ah damn I forgot the depool, that's why! [10:12:24] sorry [10:12:53] I was following the checklist but somehow managed to miss it [10:16:46] ok I found out why: I did run the depool command, but on the wrong host, so it did not have any effect :( [10:51:17] xd [12:50:15] hey guys... thanks for being understanding re: the time change for the D5 switch reboot/upgrade [12:50:51] are we good to go with that? or what is the latest status? [13:04:58] dcaro: what are your thoughts? [13:11:21] I'm gonna try deploying this Quarry change: https://github.com/toolforge/quarry/pull/61 -- Rook told me they're available for support [13:11:51] I think I can try deploying this _before_ merging the pull request to check if it works [13:12:10] ah there's also a merge conflict I should probably resolve first [13:12:47] the conflict is just the pr- tag [13:15:05] Rook: I'm in the quarry-bastion host, can I just use your user, or should I clone the repo in my home dir? [13:16:14] You can use any user. I don't always cleanup after myself in my home dir but feel free to use it if you prefer. Just clear anything I have in my quarry clone back to main [13:16:46] And yes, you should deploy off the branch first, and if it looks good deployed then merge into main (if it doesn't look good, just deploy off of main to revert) [13:18:42] ok I cloned another copy in my home dir so I don't have to do any sudo, and copied the git-crypt key from your home dir [13:19:01] 👍 [13:19:34] do I need to copy some k8s auth files? [13:20:12] no, tofu should generate them for you. Should be the only thing tofu wants to generate. Just run `deploy.sh` [13:21:29] ok! [13:22:32] I checked out my branch T367415 and started "bash deploy.sh" [13:22:32] T367415: Allow Quarry to query its own database - https://phabricator.wikimedia.org/T367415 [13:22:43] Awesome [13:23:23] tofu created the kube.config file [13:23:37] Excellent. [13:23:46] Looks like the deployment is doing a rollout [13:24:00] ansible did warn about "No inventory was parsed, only implicit localhost is available" [13:24:23] Looks like everything was deployed [13:24:30] Yeah, it only uses localhost in this case [13:25:05] ok [13:25:11] let's see if it's working! [13:25:12] It is worth finding out how to suppress that warning since in this situation it is the desired action. Would you open a ticket for that? [13:26:25] ok, I will create a phab task! [13:26:52] tested one enwiki query: works fine. now testing a "quarry db" query [13:27:45] ok that one is not working: ""'Replica' object has no attribute 'database_name'"" [13:27:49] maybe just missing something in the config [13:28:53] but I think I now understand better the deployment process, so I should be able to do more tests on my own! [13:29:21] That's fine. You can update the patch and github will rebuild the image for you. If it doesn't look like a quick fix and might cause a problem to be running with the patch just do the same deploy off main and it will revert [13:29:54] I will test a deploy off main [13:30:32] At any rate glad that's working. Please open a ticket for any deploy/rollback process that requires more than `bash deploy.sh` off the desired branch. As that should be the only thing that needs run [13:31:21] that's great, thanks. I was just worried "deploy.sh" might fail and I would be lost... but it worked very smoothly :) [13:31:57] I added you to T372395 as it looks like superset will require a little updating before updating. This is the thing that I don't like about superset, it has kind of limited support for k8s and they like to change little things which prevent it from deploying (as is the case at the moment). [13:31:58] T372395: Improve idempotency detection with helm diff - https://phabricator.wikimedia.org/T372395 [13:32:14] ok thanks for looking into that! [13:33:41] No problem, I won't likely get to it for a little while as I'm about to go out and do some things, but that appears to be the blocker on updating it at the moment (Well aside from the other upstream issue of T364022). Superset remains a little sus to me [13:33:41] T364022: Upgrade to 4.0.0 - https://phabricator.wikimedia.org/T364022 [13:34:42] ok. let me know if you find a way around it, I'll follow the task for updates [13:35:58] the quarry deploy of "main" worked. thanks for the help :) [13:36:49] Perfect! Glad to help [13:52:16] I'm downtiming all the hosts mentioned in T372353 in icinga. dcaro or dhinus want to do the same in alertmanager? [13:52:16] T372353: Request creation of deployment_prep_s3 VPS project - https://phabricator.wikimedia.org/T372353 [13:52:29] huh, that's definitely not the right ticket [13:52:38] T371878 [13:52:38] T371878: [network,D5] reboot cloudsw-d5 - https://phabricator.wikimedia.org/T371878 [13:56:21] andrewbogott: I can try, let's see if I find the right pattern matching :) [13:56:51] maybe just listing them is easier [13:59:12] andrewbogott: hey perhaps you can answer my question re the switch reboot? [13:59:50] topranks: I missed it, what's the question? [13:59:52] as in - are we good to go? [14:00:07] Almost! I'll ping in -dcops [14:00:15] I was out yesterday and earlier so might have missed some of the discussion [14:00:27] andrewbogott: ok cool thanks I'll start the prep / downtiming things [14:00:30] we're in good shape, just downtiming/shutting down a few things [14:00:41] cool - no rush my side [14:02:49] andrewbogott: I added a crazy regex in alerts.wikimedia.org that should silence any alerts for those hosts. emphasis on "SHOULD" :P [14:03:02] dhinus: want to join #wikimedia-dcops? [14:03:12] andrewbogott: ok [14:03:17] dhinus: thanks! we'll expect some noise anyway. [14:03:50] sorry, now I'm around [14:04:07] got lost in apt virtual package meaning rabbit hole [14:13:37] there's an alert from toolschecker: string 'OK' not found on 'http://checker.tools.wmflabs.org:80/etcd/k8s [14:13:57] probably expected: "Connection to tools-k8s-etcd-22.tools.eqiad1.wikimedia.cloud timed out" [14:14:51] yep, that's from taking down the etcd node [14:14:56] (hosted in cloudvirtlocal) [14:15:26] yeah, I powered down two etcd nodes [14:58:20] Please open tickets or ping me for any questions or things that need updated. Though for now... [14:58:21] * Rook vanishes [14:58:39] for the toolforge monthly, we are still dealing with ceph, is it ok if we delay it a bit? [14:58:54] dcaro: I was about to write the same :) [14:59:06] I think we should delay it or possibly skip it if the switch takes longer [14:59:44] I can move it tentatively 30 min in the future, with the option of moving it to next week if things get busy [14:59:45] yep [14:59:52] * bd808 will idle in the meet room [15:57:21] ...are we having a toolforge outage? [15:57:34] seems not, but it's alerting [15:57:53] Oh, nevermind [15:58:02] it's just the k8s node thing, which I caused :) [16:52:57] it should not be so flaky though [16:53:59] let's make a plan for if the switch seems like it's working but we want to give it a day :) [16:54:41] If we put just one osd node (or half an osd node?) back in service can we keep things stable for that long without repeating this whole trip? [16:57:06] it should be stable, if the network misbehaves though it might become unstable [16:57:17] yeah [16:57:32] (if it's completely off it's ok, if it's completely on is ok, flapping is the troublesome state) [16:58:44] the bots issues, it could be the cloudgw/cloudnet tooo [16:59:04] ? [16:59:45] could be but that should all be ha [17:00:59] every time it switches it makes the network flap no? [17:01:09] (the VMs network) [17:03:11] oh, probably [17:03:20] cloundnet for sure does that [17:03:56] so cloudnet1006 is currently the active node, that means it flapped at least twice (cloudnet1006->1005, 1005->1006) [17:32:07] andrewbogott: I'll be calling it a day soon, I'd leave ceph as it is (nodes `up`, but not `in`) until we trust the network is stable, though if anyone tries to create a big image it might fail [17:32:24] for the others, we can leave the cloudvirts also drained until tomorrow [17:32:40] ok, sounds good to me [17:33:00] I have people waiting for me to drive them to the beach so I'll be heading out soon too, but will wait until cathal is feeling confident. [17:38:48] any idea why we run cadvisor on the cloudcephmon nodes? (it's using peaks of 25% cpu, though we have no containers, not even docker/podman installed) [17:39:18] no idea [17:44:34] anyhow, gtg [17:45:07] * dcaro off [17:45:09] cya tomorrow [17:45:22] Thanks for working late dcaro [17:45:38] and for generally keeping me calm during this unpleasantness