[01:21:19] dcaro: (in case you are still awake) I am now draining cloudcephosd1013 and I upgraded the task description with the draining progress [02:55:30] I'll probably be asleep before this finishes. I also staged the new hosts (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1060190) but I'm not actually pooling them because I suspect there's some per-rack crush map bits that I should leave to you. [06:57:55] ack [07:15:30] the puppet code (and ceph) automatically detect the rack bits, the only custom thing is making sure that the assigned internal ips are from the blocks of the specific rack (C+D, E and F), just fixed that [07:20:22] topranks (when you are around) I'm not seeing the second interface up for cloudcephosd1035 (`5: eno12409np1: mtu 9000 qdisc mq state DOWN group default qlen 1000`) [07:25:24] mtu is unset too, and the vlans are not the same as the rest :/ [07:25:29] hmm, looking [07:32:44] I'll wait, I don't think I have access to the switches yet [07:42:26] hmpf... I messed up the ips and now the node is unreachable xd [07:42:28] looking [07:42:37] dcaro: hmm... the link is set up correctly for cloudcephosd1035 on the switch [07:42:42] but the second link is down [07:42:57] I also can't ping/ssh to the host over the first link - although I was able to ~5 mins ago [07:43:13] I expect we'll need to get dc-ops to check the cabling for the second port [07:44:26] dcaro: actually scrap that.... I'd missed the interface was set to 'disable' :P [07:44:27] doh [07:44:29] it's up now [07:44:29] I just restarted the networking and it did not come back up, /me looking for the ipmi [07:45:03] yeah ifupdown2 is fairly shaky :) [07:46:05] I should have rebooted instead probably, I would have to anyhow to make sure it start up clean [07:47:11] yeah that's typically what I do [07:47:20] did you get onto it to reboot? I can do it if you want [07:48:33] just did with ipmi [07:48:45] should be coming up in a sec [07:49:49] ok [07:50:08] for the interface, do we have to explicitly set the mtu to 9000 on netbox? [07:52:28] we set it to 9192 in Netbox / on the switches by default [07:52:47] https://www.irccloud.com/pastebin/KTCCmrgc/ [07:54:01] dcaro: btw I'm adding you with super-user privileges, I think if you need one team member with it that's ok to be you [07:54:21] obviously please refrain from making any manual changes on the devices unless it's a total emergency [07:54:35] ack [07:54:41] most of the config is handled by homer/automation anyway [07:54:44] thanks [07:55:33] node is up and running :) [07:55:36] woot! [07:55:46] in terms of the jumbos it's controlled on the host then [07:55:59] we enable on the network everywhere - but hosts default to 1500 mostly [07:56:13] I think it might be added by puppet [07:56:30] for ceph [07:56:31] https://www.irccloud.com/pastebin/GFgmazFF/ [07:57:00] okok, so I'll do some cleanup on the others too, and start adding them to the ceph cluster [07:57:01] yeah I expect so [07:57:15] btw. the switch on other ceph nodes is part of the private network too [07:57:22] *switch interface [07:58:09] ex 1021 v [07:58:10] https://netbox.wikimedia.org/dcim/devices/3443/interfaces/ [07:58:40] 1035 shows nothing https://netbox.wikimedia.org/dcim/devices/5250/interfaces/ [07:59:20] oh, now when going to the switch specific interface (ex xe-0/0/42) I see the same vlans as 1021, okok [07:59:47] yeah - the IPs need to be set up on the host side for the private vlan [08:00:03] and then the puppetdb -> netbox import script needs to be run for them to show up there [08:00:04] one last question, from now on, instead of cloud-hosts1-eqiad, we should use cloud-hosts1-c8-eqiad right? (in-place replacement) [08:00:15] the vlan is on the switch-side though, so just needs the config on the host [08:00:22] ack [08:01:25] IPs are added via puppet but I can't remember the exact details [08:01:48] https://netbox.wikimedia.org/ipam/prefixes/653/ip-addresses/ [08:01:53] https://netbox.wikimedia.org/ipam/prefixes/654/ip-addresses/ [08:01:56] https://netbox.wikimedia.org/ipam/prefixes/655/ip-addresses/ [08:02:00] https://netbox.wikimedia.org/ipam/prefixes/656/ip-addresses/ [08:02:32] ^^ should be added to the right one of those based on the rack, with the dns_name set to .private.eqiad.wikimedia.cloud [08:02:41] I can allocate them there if it helps? [08:04:42] that has to be done manually? [08:05:02] (just asking, so I write it down somewhere for the next time) [08:05:19] cloudvirt1036.private.eqiad.wikimedia.cloud is there already [08:05:45] hmpf. that's a cloudvirt, ignore [08:12:25] yeah as things stand [08:13:12] okok [08:17:24] topranks: which rack is which ip block? Can I add a note in the description for each block mentioning the rack? [08:17:46] ah, I see it from the vlan VLAN production / cloud-private-d5-eqiad (1152) [08:17:49] nm [08:17:54] I don't think it's really neccecary, the blocks are attached to the vlan which has the rack name in it [08:17:55] https://netbox.wikimedia.org/ipam/prefixes/652/prefixes/ [08:31:13] 👍 [08:35:05] hmpf... ceph started having slow ops, should have added the node bit by bit [08:40:31] oh my..... the new ceph machine got the wrong hard drive for the OS [08:40:34] https://www.irccloud.com/pastebin/aAUxr0ca/ [08:40:56] that was supposed to be an osd drive [08:41:05] and this the os drive [08:41:07] https://www.irccloud.com/pastebin/N1OwvqE5/ [08:51:25] ceph seems happier now, starting to asses the damage and restoring any lost services [08:51:31] starting with static.toolforge [08:52:33] what host has the wrong disk layout? all the new ones? [08:52:34] static is replying now that nfs is back online [08:53:02] topranks: only checked 1035 for now (the one I just added to the cluster), but will check the others once I finish checking all the alerts [08:55:03] nfs hiccup [08:55:05] https://usercontent.irccloud-cdn.com/file/dkSPL5J2/image.png [08:59:14] dcaro: yeah I'm not sure what the scenario there is with the drives [08:59:27] it assumes sda and sdb will be the two smaller drives for the OS [08:59:32] topranks: yep, all the other nodes also have the wrong drives [09:00:11] do they have the wrong drives? looks more like they just aren't connected (or detected) where the automation expects? [09:01:04] well yep xd, the are using one of the big drives as OS, and one of the small drives as OSD [09:01:31] the raid is setup during the reimage right? that's parted? [09:03:40] yeah I *think* so [09:03:52] defined in modules/profile/data/profile/installserver/preseed.yaml afaik [09:04:39] cloudcephosd* are using this file definition: [09:04:42] modules/install_server/files/autoinstall/partman/raid1-2dev.cfg [09:04:44] ` 5 d-i>partman-auto/disk>--string>-/dev/sda /dev/sdb` [09:04:54] yep, just expects the names to be the right ones :/ [09:06:07] yeah, so I'm not sure if the best idea is to get dc-ops to go moving a bunch of cables, or a new partman receipe for these ones to use sda and sdi for the OS [09:06:33] it's not always sdi, it's sdj in one of them [09:09:41] we might have to do something like early_command or something [09:09:50] xd, too many somethings [09:11:58] I'm thinking that we might want to manually change these ones, and play with partman on the next batch [09:12:09] I'm sure we saw this issue with the wrong order before [09:12:38] I don't remember the details but I seem to remember that rebooting/reimaging helped [09:13:43] I think it happened at a later stage yes, as in, the reimage went ok, but then after rebooting the names were changed [09:14:09] maybe for the script that adds the volumes to the VMs, not sure [09:14:52] yep that's possible [09:15:14] I think that was while adding new hosts to the cluster [09:15:22] so no data was present on the hosts yet [09:15:36] what's the situation with this one? did any drive get formatted? [09:16:00] yep, all [09:16:16] (maybe cloudcephosd1036 only got one of the two formatted) [09:16:38] it's sort of tricky for the system disk though [09:16:51] yep, specially the bios boot partition [09:17:07] as in /dev/sdb2 is part of a raid1 that the OS is running from [09:17:34] a quick phab search led me to T308677 [09:17:36] not sure how you can remove it from that without borking the system [09:17:50] for that one I was thinking on creating the partitions on the sdi/sdj drive, adding it to the raid, waiting for the mirroring, and then removing the sdb one [09:18:00] which is open,needs triage :P but has some useful info [09:20:05] dcaro: that doesn't seem impossible, I guess might be worth a shot? [09:20:28] if they were consistent a new 'partman' receipe file for them would probably be better - so we could reimage in future without landing in the same state [09:20:38] which one did you see the small drive on sdj? [09:22:32] 1038 [09:22:34] https://www.irccloud.com/pastebin/qJbDHWIF/ [09:26:13] hmm... [09:26:26] they seem to be in the same physical location on both of these servers, yet have different names assigned [09:27:07] https://www.irccloud.com/pastebin/PjvzFuBu/ [09:27:18] https://www.irccloud.com/pastebin/69ivaS0s/ [09:28:53] definitely beyond my pay grade to know why though [09:31:25] my suspicion (but I could be completely wrong) is that on every boot the disks get more or less random names. probably not completely random, but I did see a disk getting a new name after a simple reboot (no reimage). [09:32:27] it shouldn't really, the udev rules afaik should mean the same name is assigned every time as long as the hardware layout hasn't changed [09:32:41] obviously you saw what you saw - not saying I don't believe you! [09:32:57] hehe I'm also confused! I seem to remember something changed in bookworm [09:33:08] using UUID is recommended to identify drives to prevent that - but I thought that was only when something else changed [09:33:18] but of course I'm not finding anymore the links I found last year [09:34:02] there is some info here [09:34:07] https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html/managing_file_systems/assembly_overview-of-persistent-naming-attributes_managing-file-systems#assembly_overview-of-persistent-naming-attributes_managing-file-systems [09:35:00] I guess in theory things would be consistent - but there are lots of race-conditiony things there [09:38:13] the issue using uuid and similar is that you need to know them in advance, and keep track of them somehow (manually entering somewhere, and pulling those with partman somehow) [09:38:34] yeah it's not really an option for us here [09:39:13] all these systems appear to have the same hardware layout, but I can't say why we have the sdi / sdj difference [09:42:57] jbond found some ways to improve things for the swift cluster that might be applicable here: https://phabricator.wikimedia.org/T308677#8420185 [09:45:58] xd, I was trying to do something similar (a bit more hackish) https://gerrit.wikimedia.org/r/c/operations/puppet/+/1060402 [09:46:24] I don't think we really care about the order of the data drives, the important thing is that the small drives get assigned to the os, and that should be possible with the "partman_early_command" written by jbond [09:47:03] I don't see it being used anywhere though [09:47:31] that's interesting... in the thread they seem to say they reimaged a few hosts after that change... was it reverted later? [09:47:53] we could try pinging matthew [09:48:49] found some usage modules/install_server/files/autoinstall/common.cfg [09:49:22] how many ceph hosts are currently borked because of this? only one? [09:49:52] this is how it's installed... `wget -O /tmp/partman_early_command http://apt.wikimedia.org/autoinstall/scripts/partman_early_command.sh && sh /tmp/partman_early_command` [09:49:56] you can also see the partman script used on those here [09:49:56] modules/install_server/files/autoinstall/partman/custom/ms-be_simple.cfg [09:49:59] all the new ones, that's 4 [09:50:03] dcaro: ack [09:51:18] I'm no partman expert, is it loading the stuff from the files it generates? [09:51:43] * dhinus didn't know partman was turing complete :D [09:52:58] I'm also not sure how those files are loaded... but I'm sure there are some partman experts in -sre :) [09:54:08] I wonder how many hosts we have with this disk config, maybe not that many apart from ceph and swift? [09:54:30] perhaps being too optimistic... [09:54:46] but if we add "cloudcephosd*" to line 79 of the partman_early_command.sh file would that work? [09:54:54] https://github.com/wikimedia/operations-puppet/blob/production/modules/install_server/files/autoinstall/scripts/partman_early_command.sh [09:54:57] topranks: I like your optimism :P [09:55:08] it'd run the "configure_cephosd_disks()" function for those hosts [09:55:20] which - to my untrained eye anyway - seems like it's trying to do what we need [09:56:04] right now that script executes and does nothing as the hostname doesn't match any its looking for [09:56:25] it's a shame john is no longer around to ask [09:56:45] I think he still lurks in irc though :D [09:57:16] [jbond] is away: food [09:57:57] lol yeah he pops up from time to time [09:58:16] oh that line for cephosd (not cloudcephosd) was added by btullis and he's around :P [09:58:20] https://github.com/wikimedia/operations-puppet/commit/9ac4733d4d9dc8c10b8ae8fb9e8daf3cfd5efaf3 [09:59:41] LOL I just found myself from the past giving me the answer: https://phabricator.wikimedia.org/T324670#8513627 [09:59:58] * btullis reading scrollback [10:00:19] btullis: tl;dr we probably need to do what you suggested 1 year ago in that task :) [10:00:28] "Would it help you if we added your recipe for the cloudcephsod* hosts to this script too?" [10:01:31] Yes, that's absolutely fine by me. [10:02:12] my firefox crashed [10:02:28] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1060402 [10:02:50] ^moved the script from the partman recipe inside the shell script [10:03:12] not sure though how to use it, afaict it will be used already right? [10:04:09] dcaro: is there something about configure_cephosd_disks() that won't work for the cloud ones? [10:04:27] i.e. do they need a separate function in partman_early_command.sh ? [10:05:05] yep, we use the smallest drives, and none of those are spinning disks [10:05:57] though it seems they show up anyhow as 'rotational' [10:06:10] but our sizes are different too, we have osd drives that are <3T [10:06:14] (old nodes) [10:06:29] ah yes of course [10:07:09] we probably could reuse most of it and use parameters, though this way we are not changing anything on the other's path (less cross failures) [10:08:19] sgtm, I would maybe add a comment saying that in the file, in case anyone in the future wonders why there are 2 funcs [10:08:29] yeah I guess the other side of the coin is duplication of code [10:15:30] dhinus: added [10:16:55] gtg get some papers from the "municipal population centre", the ceph cluster is still rebalancing things, feel free to try the recipe out or comment and such [10:17:01] * dcaro off for a bit [12:32:17] btullis: your current partman function is checking for disks <3TB, do you think it would still work if we changed it to <1TB ? [12:32:32] because then //maybe// we could share the same function [12:33:03] dhinus: I wonder if both could maybe be done the way David was trying... [12:33:16] you mean the current patch or the earlier one? [12:33:39] I was afk for a while so unsure :) [12:33:47] he was sorting and finding the smallest two drives [12:33:52] fdisk -l --bytes | grep '^Disk.*/dev/sd' | awk '{print $5 " " $2}' | sort | tail -n 2 | grep -o '/dev/[[:alnum:]]\+' | tr '\n' ' ' [12:34:19] yep, I think that works but also has risks... one is will awk be available in that env? [12:35:31] if it is, I'm fine with using that but I still think we should just have 1 function eventually, if the aim is exactly the same [12:35:45] I agree yeah [12:35:46] dhinus: I'm happy to lower our check to < 1TB. [12:36:29] btullis: opinions on your approach vs fdisk -l? your one is already tested so maybe "if it works don't change it"? :P [12:42:29] the partition layout is different too [12:43:13] (/me reading the partman stuff), I think that even our current partitioning is not really optimal (we don't use /srv for anything) [12:44:32] good point, I don't know why we partition the raid in that way [12:46:23] if you feel like experimenting you can try using the same partition scheme used by cephosd... but I'm also fine with keeping our existing one with /srv for now [12:46:26] I think it comes from the standard.cfg [12:46:31] ah [12:46:50] cephosd also has srv, and does not use the whole space [12:46:56] (for what I'm reading) [12:47:37] Technically, I don't think we actually use /srv for anything on the cephosd servers either, but it's nice to have it in case we ever want to. [12:51:10] One thing that our script does is to exclude nvme drives, but looking at /sys. Not sure if the fdisk appetising would do the same. [12:51:37] Sorry, typing on phone with auto-correct. [12:52:09] s/appetising/approach/ [12:52:54] this must be google/apple AI deciding that given it's lunch time you must be writing about food :D [12:52:57] not sure we would want to skip nvmes [12:55:20] right now it would skip them because it greps for "Disk.*/dev/sd", and nvmes are /dev/nvm* [12:56:25] Oh yeah. [12:56:29] I don't have a strong preference between using "fdisk -l" or looping through "/sys", but I have a slight preference for having a shared function if possible :) [12:57:04] if only so we can more easily help each other when the function doesn't work :D [12:58:00] we will need nvmes at some point though [12:58:52] I just need something that works right now [13:01:33] reusing the cephosd function with the limit lowered form 3TB to 1TB would probably work, but you would end up with slightly different partitions. if you prefer to use your custom function I'm also ok with it and we can try to merge them later [13:01:38] *from [13:03:43] updated the patch, we don't need to change the limit right now, only if/when we want to reimage the old hosts, we are in a hurry to get the switch rebooted [13:03:51] ok! [13:04:42] +1d [13:05:46] fingers crossed it will work... [13:07:51] reimaging [13:07:56] fingers crossed too [13:10:14] we will have to reimage 1035 eventually too :/ [13:10:38] (the one we just added to the cluster) [13:10:56] yep :/ [13:11:20] didn't we have a check for the volume sizes in the cookbook that creates the osd volumes? [13:12:24] I think so yes, though probably the root drives on this new node are already big enough [13:12:28] I expected the cookbook to notice that the volume sizes were wrong, but maybe it only did when it reached sdi [13:12:51] hi. can someone with access please restart stashbot? it's gone for breakfast :) [13:13:35] sukhe: I'll restart it [13:13:45] thanks dhinus <3 [13:13:52] ah, no, we don't check the size, just that there's 8 in total, and 6 of them have no partitions [13:13:56] this is the best support channel ever :-] [13:14:36] dhinus: can you check on which node it's currently running? [13:14:52] it should have been restarted with the batch of workers reboot I did earlier [13:14:52] thanks again dhinus [13:15:22] it works now https://sal.toolforge.org/log/HQ_6LJEBKFqumxvtlfYt ! thank you ;) [13:16:05] dcaro: I already deleted it :/ [13:16:09] (the old pod) [13:16:10] okok, np [13:16:37] I'll review a few more workers with only a few D processes to make sure [13:16:37] https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview?orgId=1&var-cluster=tools&from=now-30m&to=now [13:18:26] SAL seems to work now but on doing so we get: [13:18:30] > failed to log message to wiki. Somebody should check the error logs. [13:19:13] yep you can see the logs in https://sal.toolforge.org/ but there is some issues writing them to wiki [13:19:49] I pinged b.d808 in #wikimedia-cloud [13:24:32] I think it might be tools-elastic [13:24:56] hard to tell as there's no timestamps in the logs [13:26:13] aahh, I can see them in the pod logs [13:26:50] yep, no idea, just says there's error writing to wiki [13:27:49] the reimaging almost finished, running puppet now [13:28:17] Oh man, I'm so sorry that everyone spent all day working on partman :( A terrible way to spend a day [13:28:59] andrewbogott: :D [13:29:41] if it works, the solution was already there xd [13:30:33] I assume you already had the pleasure of googling for partman docs and discovering that the only partman docs in the world are on wikitech and written by us [13:31:17] kinda yes [13:35:24] * dhinus awk for ~ 20 mins [13:38:21] I manually cleaned the leftover partitions on sdb, but the host looks ok now [13:38:23] https://www.irccloud.com/pastebin/iVXrmqCx/ [13:38:56] I'll try adding it netx [13:39:18] I also have to tweak the weights to match the TB of the drive (otherwise they have the same weight as the smaller disks) [13:41:23] * andrewbogott lols at the os drives being sda and sdj [13:41:53] and sdi in some xd [13:44:12] There's some abstract reason why unpredictability is 'better' than predictability, right? I'm sure that pre-bullseye debian didn't do the surprise drive-swap thing [13:50:49] I guess there might be some race conditions on the drivers/kernel booting up, that when done serially is predictable, but in parallel might become unpredictable, and new stuff might prefer parallelizing something (just a guess) [13:51:38] andrewbogott: the reweighting change, I'll test in in a bit with 1037 [13:51:38] https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1060443 [13:54:05] hmpf... now ci is complaining about lots and lots of things I did not touch [13:54:15] some are not even part of the code [13:54:16] 15:48:49 ./.eggs/setuptools_scm-8.1.0-py3.10.egg/setuptools_scm/_run_cmd.py:55:5: E704 multiple statements on one line (def) [14:06:07] dcaro: does weight correspond directly to how much gets stored on a given osd? That's not obvious to me but maybe I don't understand the concept of weight [14:06:21] yep, that's the default [14:06:36] currently it's 1.7...., that is the 1.7TiB of the drives [14:06:57] (it fluctuated a bit, as some drives had a bit more/less than others, and we started copying it around instead) [14:07:14] OK, I re-read the docs and I think I follow now [14:08:46] oh, py312 was enabled on ci, that's why the tests fail, spicerack does not work on 3.12 [14:09:25] is that a thing you can fix or should I bug releng people? [14:10:02] I think I can't fix it, I can ask volans if there's a fix already [14:10:10] ok [14:10:23] volans is out this week [14:10:28] oh, not around [14:10:31] dhinus: do you know? [14:10:37] let me check the latest changes [14:11:14] we are using latest, no luck there [14:11:37] dcaro: yep volans is on holiday for a few weeks I think [14:11:43] hashar: I see you just updated the ci image with py312 [14:11:56] that makes wmcs-cookbooks checks fail as spicerack des not support it, can we filter it out? [14:12:08] (/me will try now with tox, probably can configure there) [14:12:11] it will likely makes the prod cookbooks and the spicerack repo fail [14:12:15] *make [14:12:29] unless there is some tox config yeah [14:13:52] yep, I think prod might fail too [14:14:15] dhinus: do you remember what was the task? I can add it to the comments/commit [14:14:23] for py312 support I mean [14:16:54] yep I'll find it [14:17:02] prod already removed it from tox [14:17:03] 3 envlist = py{39,310,311}-{flake8,mypy,bandit,prospector,unit} [14:17:07] T354410 [14:17:08] T354410: [spicerack] python-kafka does not support python 3.12, there's a fix but there has not been any releases since 2020 - https://phabricator.wikimedia.org/T354410 [14:17:23] thanks! [14:18:38] fix is out, let's see if it passes: https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1060453 [14:19:17] (by the amount of errors it showed, it's going to be fun moving to the newer prospector in py312) [14:22:27] LOL [14:23:04] https://www.irccloud.com/pastebin/lL31EVKn/ [14:23:11] * andrewbogott not really a fan of prospector lately [14:23:12] caught that one before it became an issue [14:23:21] looking [14:26:34] oh, pings seem to be failing randomly [14:26:37] topranks: ^ [14:27:24] dcaro: want to skip checkin as well or would it be useful to go over things? [14:27:51] I think the switch is misbehaving again [14:28:12] https://www.irccloud.com/pastebin/pDYhpx7N/ [14:28:25] https://www.irccloud.com/pastebin/rAUBPANz/ [14:28:31] previous run [14:28:34] * andrewbogott takes that for a yes [14:28:34] https://www.irccloud.com/pastebin/3pz17HUx/ [14:28:45] some fail, but not the same [14:28:47] dcaro: yes swithc is sick [14:28:59] no sorry - my bad one sec [14:29:09] I was checking cloudsw2-d5 (not cloudsw1) [14:29:19] https://www.irccloud.com/pastebin/kcuNlSBK/ [14:30:10] cloudsw1-d5 does not look to be doing what it was yesterday [14:31:03] https://usercontent.irccloud-cdn.com/file/mDAfKsbs/image.png [14:31:15] it seems that it has been dropping things the whole day [14:31:24] but there's peaks now [14:31:33] (maybe because I'm moving data around) [14:33:09] not only jumbos [14:33:10] https://www.irccloud.com/pastebin/EthIM3UG/ [14:36:06] it seems that 1037 is misbehaving more than others (1006 seems ok) [14:38:37] the link from 1037 to cloudsw1-f4-eqiad is flapping [14:39:21] yep, it seems 1037 (and 1038) are the ones having most of the issues [14:41:17] (good, ish) [14:44:08] yeah [14:44:11] switch seems ok [14:44:45] https://phabricator.wikimedia.org/P67243 [14:45:03] or at least 1030 does not exhibit the same issues 1037 and 1038 do (they are all in rack f4) [14:45:34] topranks: both servers that are misbehaving (and the only ones) are the 37 and 38 both hanging from F4 [14:46:18] maybe some configuration issue? [14:46:30] full-duplex misconfigurations used to do that iirc [14:48:28] found it [14:48:30] duplicated ip [14:48:31] https://www.irccloud.com/pastebin/by1vqnFI/ [14:48:34] in 1038 [14:50:05] removed the ip, let's see [14:50:36] works xd [14:50:46] * dcaro got scared [14:51:53] ah!! [14:51:56] topranks: fixed now, sorry for the noise, thanks for the help :) [14:52:00] haha no problem [14:52:08] * topranks also got scared :) [15:01:02] o/ [15:01:47] dcaro: andrewbogott dhinus: indeed I have pushed a change earlier which adds python 3.10 3.11 and 3.12 to the CI images [15:02:06] and propsector is not compatible with python 3.12 (that has hit Xionox for netbox-extras repository ) [15:02:23] easiest is to drop python 3.12 from `envlist` in your repo `tox.ini` [15:02:35] hashar: yep doing that rn, seems to work! [15:02:44] yep :) [15:02:44] yeah, I think we may have worked around things already... [15:02:55] but thank you for appearing! [15:03:12] excellent! [15:04:22] hashar: thanks for adding those versions btw, I actually needed 3.11 in a different project :) [15:06:36] \o/ [15:33:45] dcaro: fwiw at 10G there is no potential for duplex issues - as we use DAC cables which only support 10G/full [15:34:19] also fairly rare at 1G, but with those ports they supoort 10/100 also, and thus some half duplex modes and you can get a mis-match if autoneg fails [15:34:40] used to happen a little but thankfully all that is mostly in the past now [16:01:52] ack thanks [17:07:49] * dcaro clocking off [17:08:11] andrewbogott: you can use the cookbooks on the tip of https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1060173 , I've tested them with codfw [17:09:01] you can filter now with --osd-id on all the drain/undrain/bootstrap_and_add/depool_and_destroy_node cookbooks, and the weights are being set correctly [17:10:00] ceph has not yet rebalance cloudcehosd1010, there's osd 72, 73, 74 and 75 that need starting there too once the cluster is ok [17:10:19] let me know how it goes! [17:10:21] * dcaro off [19:40:25] ceph has been telling me that cloudcephosd1010 will be done rebalancing in 5 minutes for more than an hour :(