[00:14:21] andrewbogott: did you catch the goose? ;) [00:15:40] I went ahead and made T400305 to juice the quota for the zuul project. The flavor questions and related performance concerns are things that I think we can work out separately. [00:15:41] T400305: Large quota increase for zuul Cloud VPS project - https://phabricator.wikimedia.org/T400305 [00:15:50] no :( [00:17:36] jenna is engaged in https://xkcd.com/349/ -- most recently office IT told her to reimage her laptop but because everything is in corporate lock-down she can't read her email without connecting to the VPN but can't connect to the VPN without her laptop but can't log into the newly-reimaged laptop without her 2fa key which she can only get from her email which she can only get if she's logged into her VPN GOTO 10 [00:18:13] That sounds wildly annoying [00:18:18] and her IT department's latest idea was that she should go to an actual retail store and log into their wifi which (supposedly) would be inside the VPN. Which, somewhat to my relief, it was not :) [00:18:42] So we were wardriving for a while there [00:19:22] My job was to drive the car and say "Surely this is not the first time this has happened!" over and over [00:19:31] When Petr started at Slack they gave him a whole top of the line iPhone just for 2fa [00:20:06] like the new employee pack was 1 laptop, 1 on-call phone, and 1 2fa phone [00:20:20] effective, at least! [00:20:57] yeah, seems wild but $800 to never have that 2FA bootstrapping problem possibly worth it [00:22:02] The only thing I know I can specify for the Magnum instance volumes is the size in GB. I'll have to poke around a bit to see if there is another setting that can change things about the volumes that are requested. [00:22:26] more fun things to figure out [00:25:03] On codfw1dev there's a volume type called 'no_throttle' so that suggests that it's something we can set up but we haven't ever done it in eqiad. [00:25:22] And of course I don't know if that can be restricted to a project or a user [00:28:03] https://docs.openstack.org/magnum/latest/user/#docker-volume-type -- looks like there is a way to ask for a different type from the default for Magnum generally. [00:29:21] yep, so definitely possible if throughput turns out to be an issue [00:30:39] * bd808 goes looking for dinner [07:37:24] !log admin downgrading the codfw1 ceph mons to pacific, to do a rebuild instead of in-place upgrade to quincy [07:37:24] dcaro: Not expecting to hear !log here [12:08:41] dcaro: did ^^ work? [12:08:55] nope :/, I'm still fooling around with it [12:09:07] trying now to get the client working on 2004, before messing more with the rest [12:09:42] the mon process does not seem to start listening on the ports it should, it shows cephx errors [12:09:51] just tried disabling cephx, but same behavior [12:10:15] not sure where the mds config in the process comes from, as we don't have cephfs [12:10:15] root 159912 0.1 0.0 514984 33092 ? Sl 12:04 0:00 /usr/bin/python3.9 /usr/bin/ceph --in-file /etc/ceph/ceph.client.bootstrap-mds.keyring auth get-or-create-key client.bootstrap-mds mon allow profile bootstrap-mds [12:11:05] I've been trying the script at https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-mon/ under "Recovery Using OSDs" [12:11:42] One thing that I ran into is that somehow the mons got assigned the wrong IPs, so e.g. 2004 was trying to listen on .12 rather than .19. That was obvious at least. [12:13:50] this does not work :/ `ceph --admin-daemon ` maybe the ceph version [12:14:38] actually it did, just uses a different set/list of commands [12:15:41] I've only ever used the existing systemd unit for start/stop [12:16:02] should we move over to -ceph and see if Ben has any practice with this? [12:16:05] that's to be able to talk directly to the mon [12:16:07] maybe yep [12:21:01] o/ reading scrollback - I'm around if I can be of any help. [12:22:45] dcaro: sounds like you're seeing what I saw before I gave up yesterday, lots of log messages from cephx, and timeouts when you tried to actually interact with the ceph APIs in any way? [12:23:50] I think that the timeouts are the client trying to connect to the msgv2 port (3300) [12:23:58] and the mon process not starting up on that port [12:25:53] oh, puppet is enabled? I thought I had disabled it :/ [12:26:27] it hangs, right? So barely matters if it's enabled/disabled :( [12:27:08] but confuses me when seeing logs/processes and ssuch [12:27:44] and it changes the ceph conf file [12:27:54] (and re-enables cephx, adds all the mons, ....) [12:29:03] btullis: the node we're talking about is cloudcephmon2004-dev.codfw.wmnet [12:30:00] Yep, I'm logged in. Following along.... sort of. [12:33:02] you can try running some commands with `ceph --admin-daemon /var/run/ceph/ceph-mon.cloudcephmon2004-dev.asok ...` [12:35:23] dcaro: do you think that the issues we're seeing now are happening before we get to the "mons don't know anything and db is corrupt" stage, or after? [12:35:40] It would be nice if it was 'after' because that would suggest that the db state situation is not completely terrible [12:36:17] I disabled cephx and now the mon seems to have come up all the way [12:37:19] that's encouraging! [12:37:21] andrewbogott: [12:37:44] I think they came after, the first issue was that the quincy mons don't talk leveldb, that for some reason was what the mons had [12:38:08] I see that `sudo ceph auth list` still shows the `osd.n` keys, but no longer contains any of the keys for the other components. [12:38:09] (though we have reimaged them, so those dbs should have been rewritten from scratch and using the latest rocksdb since luminous afaik) [12:38:55] maybe the leveldb came from my rebuild attempts yesterday? But the db I built was generated on the osd nodes which are running Q [12:38:57] there's some confusing output on the status too [12:39:50] ` pools: 0 pools, 0 pgs` and `all OSDs are running quincy or later but require_osd_release < quincy` [12:40:02] `1 monitors have not enabled msgr2` [12:40:12] (that's the port 3300 issue I was talking about) [12:40:38] let me try to explicitly enable msg2 (though it should pick it up itself) [12:41:31] it gets stuck on [12:41:34] `Jul 24 12:41:15 cloudcephmon2004-dev ceph-mon[166472]: 2025-07-24T12:41:15.880+0000 7fdcc9e1e700 1 mon.cloudcephmon2004-dev@0(leader).auth v104 client did not provide supported auth type` [12:42:45] Is it just that the keyrings are scrambled/missing? [12:43:15] yep, like btullis said the mon keyrings seem to be missing [12:43:22] we can try adding them now [12:45:54] it looked to me like it was the 'unless' clause in puppet that was hanging, so one of the various silly things I did was set the puppet timeout to 10 hours and let it actually finish running. That should've created all the puppet-managed keyrings assuming I wasn't backwards about the 'unless' logic [12:46:01] but yeah, let's go through manually and check. [12:46:37] xd, created the mgr, the one I need now is mon [12:46:51] I wondered if they weren't showing the `ceph auth list` output because cephx was disabled in the `/etc/ceph/ceph.conf` file. [12:46:52] it might work now actually [12:47:22] let me try to run puppet [12:48:27] oh, but that will re-enable cephx :/, and then fail... [12:48:30] hmpf.... okok [12:48:34] manually it is [12:54:01] oh, puppet was able to run some of the key imports [12:54:30] let me try to re-activate cephx [12:54:36] nice! that's a lot more than I ever got [12:55:03] oh, I think that it created new keys, the clients are now failing xd [12:55:22] `Jul 24 12:54:49 cloudcephmon2004-dev ceph-mon[171067]: 2025-07-24T12:54:49.760+0000 7f51d053a700 0 cephx server client.codfw1dev-compute: unexpected key: req.key= expected_key=` [12:55:50] it's not generating new keys, right? Just pulling them from private puppet? [12:56:10] This is a bit like the chicken->egg situation I was discussing recently, is it? The keys in puppet don't match the keys created by the execs that puppet runs. i.e. puppet isn't the source of truth. [12:56:37] yeah, it's similar [12:56:55] hmm... it's still not starting on v2 port 3300 [12:57:23] dcaro: what keys are getting used by the mons other than /var/lib/ceph/mon/ceph-cloudcephmon10040-dev/keyring? [12:57:25] btullis: a bit yes, it kinda created the keys from scratch instead of importing them [12:57:46] andrewbogott: I created that one manually, it's the same as /tmp/ceph.mon.keyring [12:58:06] right, that one should be the simple, it's just identical across all nodes [12:58:26] but you said "oh, I think that it created new keys" -- I'm wondering which keys you mean [13:00:27] the client ones, for the pools [13:00:35] I have some meetings starting now, so I will be less available for a bit, but I will check back asap. [13:00:42] thx btullis [13:00:43] `client.codfw1dev-compute` from the log above [13:00:58] we'll keep you in the loop :) [13:01:02] ta [13:02:19] so with the keys it was able to load the pools [13:02:20] pools: 11 pools, 481 pgs [13:03:28] that sounds like progress! [13:05:51] hmm... I was able to enable cephx, by importing the admin key, starting the mon process, and then changing ceph.conf commenting the v2 so the client will use v1 [13:05:59] though the server still does not start on v2 [13:06:21] ceph config dump shows nothing [13:06:43] the mon logs look ok now though [13:07:54] I don't think I understand the v1 v2 distinction but you should carry on without me for now :) [13:08:17] 🤦‍♂️ [13:08:20] `root@cloudcephmon2004-dev:~# ceph mon enable-msgr2` [13:08:30] no idea where that is registered [13:08:56] v1 protocol for mons is the old one, v2 is the new one (since a couple versions back, and the default now) [13:09:02] ah, ok [13:09:13] I'm a bit confused on why it got disabled, I think it might be related to the config being empty too [13:09:18] `ceph config dump` [13:09:42] `ceph ...` should work now from any ceph node [13:10:49] I'll write some of this on the task too for the record [13:10:54] does that mean we can let puppet update keys on the other other mons? [13:16:17] yep, let me try [13:17:29] we have to set again all the config options too [13:19:19] puppet ran ok [13:20:30] running puppet on mon 2005, let's see if I can get that one to join, might need to reset the monstore [13:21:06] 'ceph osd tree' returns something! Although it thinks that lots of osds are down [13:21:34] let's get the mrg up [13:21:37] * andrewbogott really hoping the data still exists [13:21:39] all the osds are down xd [13:21:45] (service stopped) [13:21:50] oh, then 'osd tree' is correct :) [13:21:53] I guess I probably stopped them [13:23:06] okok, I had to manually import the mgr key, but it seems to be starting up [13:23:19] mgr: cloudcephmon2005-dev(active, since 17s) [13:23:42] what does 'manually import the mgr key' look like? [13:24:05] root@cloudcephmon2005-dev:~# ceph auth import -i /var/lib/ceph/mgr/ceph-cloudcephmon2005-dev/keyring [13:25:08] puppet doesn't do that? [13:25:44] okok, I've started osd.0, and set the cluster to norebalance while the rest are down, it seems to be connecting [13:26:02] let's bring the rest up (did not need to re-import the key) [13:26:59] ` osd: 32 osds: 16 up (since 13s), 31 in (since 92s); 44 remapped pgs` [13:27:02] coming up [13:27:40] * andrewbogott checks to see what Nova thinks of all this... it seems oblivious [13:29:19] okok, osds 2004 and 2007 up, will start 2005 and 2006\ [13:29:45] are you just starting the osd service/letting puppet start them? Or doing something more manual? [13:30:17] okok, osd: 32 osds: 32 up (since 1.48011s), 32 in (since 36s); 9 remapped pgs [13:30:20] starting service [13:30:23] manually [13:30:48] just systemctl start [13:30:53] yep [13:30:55] ceph-osd\* [13:31:06] (or ceph-osd.target, depends on how I'm feeling xd) [13:31:27] .target is a meta-unit that starts all the osds at once? [13:31:36] all osds are up, no data loss it says, openstack should be able to use the drives [13:31:44] andrewbogott: yep, it's kinda handy :) [13:31:50] also for stopping and such [13:31:54] status does not work on it though [13:32:03] I was trying to use it but I think on a mon node that didn't know the actual service name yet [13:32:49] should we turn off norebalance? [13:33:04] yep, should be ok [13:33:36] it only complains about the crashes [13:33:40] https://www.irccloud.com/pastebin/feDoSPUs/ [13:33:43] so that's good [13:33:47] I'm going to restart all the nova-compute services and see if that gets VMs moving [13:33:51] ack [13:34:38] health_ok \o/ [13:35:01] we still have to add the other two mons though xd [13:35:12] but let's see if this is ok first [13:36:03] nova-compute seems pretty happy but I definitely can't ssh into things yet [13:37:24] I restarted a VM, it's doing fsck... [13:38:28] seems to have survived. rebooting bastions too... [13:39:22] yep, I can ssh now! [13:39:30] So at least some things have survived the deluge [13:39:58] I will write a script to hard reboot every VM in codfw1dev [13:40:22] can you talk me through adding a mon so that I learn at least part of the what you did? [13:41:59] \o/ [13:42:38] yep, I wrote some of it in the task, the key I think was to disable cephx, so I could import/create the right keys, but the monstore fix you had done already [13:43:18] the only monstore fix I know I did was unscrambling the listener IPs, and I definitely don't understand how they got scrambled originally [13:44:04] did you copy over the monstore from somewhere or something? [13:44:18] Yeah, there's a command to dump it [13:44:20] I see a mkfs command in the history [13:44:34] and then I removed and replaced the broken entries and applied [13:44:50] so I definitely did things, I thought you were talking about the actual contents of the monstore itself [13:44:58] What task are you logging this on? [13:45:13] https://phabricator.wikimedia.org/T400334 [13:45:21] T400334 [13:45:22] T400334: [ceph,codfw1dev] upgrade the hosts from pacific->quincy - https://phabricator.wikimedia.org/T400334 [13:45:45] great [13:46:02] ok, I'll see if I can revive 2006-dev, and when I can't I'll ping [13:46:58] xd, okok, I might have scrambled that one a bit extra, this morning I tried adding it by copying over the monstore from 2004 (that I though was running as the service reported up, but it was only "half" running) [13:47:11] there should be a backup dir [13:47:22] as far as I know the monstore itself is the same across all three? [13:47:35] I think so yes [13:48:08] oh, there's only one mon registered too though, in the config in 2004, that should change too [13:48:33] in ceph.conf? Surely puppet has reset that now [13:48:46] btw. do you know if the /var/lib/ceph/* dirs get recreated on reimage? I'm not understanding how the keystore was able to continue in the old format [13:49:19] I think it does, but if it doesn't that would explain some things. [13:49:31] there's an easy way to find out! [13:49:36] xd [13:52:13] how did you get past this 'Start request repeated too quickly' that seems to happen basically any time I restart a mon service? [13:53:11] systemctl reset-failed && systemctl start [13:53:30] thanks [13:54:10] it's kinda annoying that it does not tell you when you try to start it how to actually force it [13:56:07] hmm... nfs is enabled on codfw1, is that something we did? [13:56:15] (as in, did we enable nfs on it?) [13:56:26] not that I know of [13:56:26] for testing at any point in the past [13:56:55] oh, the rados gw are not connected to the cluster [13:58:46] I'm sure the rados agents on cloudcontrols need a restart at the very least [14:01:06] so we're still running quincy everywhere, right? You didn't actually downgrade? [14:02:56] -1 unable to read magic from mon data [14:04:10] yep, that's how I broke it xd [14:04:34] I did not downgrade no [14:05:28] ok, so should I wipe out store.db entirely and let it regenerate? [14:06:37] yep, I think that'd be the best right now that we have one mon already working [14:06:57] oh dang it I typed in the wrong terminal, may have broken 2004-dev again. Argh [14:07:02] well, I reverted, will restart and see [14:07:07] btw. I'm installing ceph-mgr-dashboard, that seems missing in some nodes, it might have been split in a different package, I'll add it to puppet [14:07:17] ooops [14:08:00] grrrrrr [14:08:18] I should've just stayed in bed all week [14:08:39] xd [14:08:40] np [14:08:52] I guess I'm back to the cephx errors [14:09:04] good time to retest the process xd [14:09:06] I'm going to step back and let you revive that one (and close my terminal there entirely) [14:09:24] and meanwhile do what I was trying to do on 2006-dev actually on 2006-dev [14:09:46] so mostly what I did was briefly move store.db out of the way, and then realize my mistake, and move it back. [14:09:56] Which shouldn't have broken it forever but seems to have [14:10:10] it seems broken in a different way though, it's currenlty listening to 3300 [14:10:45] it's setting itself as leader, but failing to assing global_id [14:11:01] (something I disabled in conf as it's unsafe for running clusters) [14:11:12] that's probably from ceph-mon -i cloudcephmon2004-dev --inject-monmap ~andrew/monmap.txt [14:11:23] which I thought was a no-op but maybe you're using a new monmap.txt? [14:13:08] I did not change it, but I changed the members when restarting the mon to be just one [14:13:22] so it might have changed it by itself on restart [14:13:30] I'm disabling puppet there, let's see.... [14:14:52] okok, let me try disabling cephx, to be able to unset the global_id ban [14:15:44] nope, still failing\ [14:17:03] hmm... it's not what i though [14:17:12] https://www.irccloud.com/pastebin/DuSptDbm/ [14:17:17] it's allowed already [14:19:02] andrewbogott: unrelated to ceph, I think I figured out the cinder mystery from yesterday https://phabricator.wikimedia.org/T400285#11031197 [14:20:09] andrewbogott: as I'm in a cleanup mood, I also noticed a couple of vms in the linkwatcher project abogott-temp-test and abogott-temp-test2, I assume they can be deleted [14:20:10] I find everything about how snapshots work in cinder confusing and upsetting. I kind of wish it would just make a dang copy and be done. [14:20:21] yep, go ahead and delete those, thanks [14:20:26] I was suprised because they were created quite recently 2025-05-20 [14:20:36] do you remember what they were about? [14:22:07] I think I found it: T394790 [14:22:08] T394790: Failures when draining certain VMs with attached cinder volumes (coibot-2) - https://phabricator.wikimedia.org/T394790 [14:22:14] I don't at the moment. I was responding to some service request from th eprojecet but would have to dig... [14:22:19] yeah, that's probably it :) [14:22:35] all clear, I'll delete them [14:32:11] dcaro: I'm so sorry that I undid all your work in 20 seconds :( [14:33:37] xd [14:33:43] I think I'm getting it back [14:33:52] now it's complaining about not having the keys again [14:33:56] so that's something [14:34:18] I definitely didn't delete any keys! [14:34:21] so far I created a new monmap just with 2004, stopped the mon in 2004, imported the monmap, and started it [14:34:32] I think it might lose it when it loses the quorum [14:36:10] okok, mon is up and running again [14:36:19] all osds might have crashed though [14:36:34] and all the ceph config is lost [14:36:49] (including the msgr2 protocol thingie, I'll rerun the config commands) [14:37:05] I'm going to keep my hands off the mons but I'll see about getting osds back up [14:37:35] actuall osd tree shows them as up still [14:38:04] osds are coming up [14:38:23] they are in unknown state though, what do the logs say? [14:39:57] Jul 24 14:39:49 cloudcephosd2004-dev ceph-osd[485999]: 2025-07-24T14:39:49.796+0000 7faec0cb0700 0 auth: could not find secret_id=20 [14:40:28] * andrewbogott tries restarting things on osd2004-dev [14:40:44] I've set to norebalance, and minversion quincy [14:40:56] yeah, still could not find secret_id=40622 [14:41:09] ceph auth ls look ok I think [14:41:17] let's cross check one of the osd keyring [14:41:31] does that mean that possibly the mon is up but with the wrong keys? [14:42:09] osd.10 has a key that starts 'AQA' [14:42:15] it's there [14:42:32] (all start with AQA xd, but I got the full one) [14:42:59] oh, oops [14:43:11] is there a way to enumerate the keys by ID and see what 'secret_id=20' really is? [14:43:20] I mean, not what the key is but which key it is [14:46:14] I think that refers to a temporary token the client gets with the key [14:46:23] https://tracker.ceph.com/issues/4282 [14:47:43] or maybe not [14:47:54] Jul 24 14:47:27 cloudcephmon2004-dev ceph-mon[195018]: 2025-07-24T14:47:27.423+0000 7f603d540700 0 cephx server mgr.cloudcephmon2005-dev: unexpected key: req.key= expected_key= [14:48:42] can I get a quick +1 to https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/258 ? [14:49:11] (not strictly required but for maintaining good habits!) [14:49:12] okok, I re-imported the mgr key and it started ok [14:50:58] I'm running puppet to import keys and stuff [14:51:37] hm, I did ceph auth list on the mon and the osd and they're identical [14:52:18] osd is still saying 'auth: could not find secret_id=20' for now [14:52:45] puppet started also the mgr on 2004, it came up ok [14:53:22] let me try to stop and start on of the osds (osd.6) [14:53:46] same [14:57:10] the mon says they're up, despite all the log complaints [15:01:10] I'm stopping all the osds [15:20:14] andrewbogott: the db format of the mon nodes [15:20:17] https://www.irccloud.com/pastebin/QiHAYltH/ [15:20:40] equiad should be ok [15:20:45] great [15:23:42] taavi: I was checking on the beta cluster cherry-picks today and was reminded that your fix for customizing the noc@ contact string is still unmerged -- https://gerrit.wikimedia.org/r/c/operations/puppet/+/1143602 [15:24:16] I assume you were looking for a +1 from some other folks before merging. [15:25:07] bd808: yeah, I'm not self-+2ing cdn config changes withot approval from traffic [15:36:16] taavi: I can't imagine why ;) [15:36:33] *why not [15:38:01] dcaro: seems like the OSDs have stopped saying 'could not find secret' [15:38:09] So the solution was to wait and do nothing [15:38:17] yep, some though they still don't show up [15:38:21] (as ok I mean [15:38:30] dcaro: I will complete the project creation for project "voterlists" as I've already started it... can I get a +1? https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/258 [15:39:28] did the cookbook run twice? [15:39:39] I tried splitting it into eqiad and codrw [15:39:41] *codfw [15:39:44] because the first run failed [15:39:51] I think because ceph in codfw is down? [15:40:21] ahh, yep, maybe, we should make sure codfw down does not block eqiad :/ [15:40:41] +1d [15:40:45] thanks [15:41:52] hmpf: "Exception: You can only run 'apply' for all clusters, i.e: don't specify --cluster_name" [15:42:17] luckily it starts with eqiad if I don't specify a cluster name :) [15:43:01] dcaro: where can I see osds not showing up? [15:43:21] if you do ceph status the pgs are all unknown [15:43:40] oh, so I see [15:43:45] despite 'osd tree' showing everything up [15:43:51] it seems that the mon is failing to register them [15:43:54] want me to restart osd services again? [15:43:54] (the slow ops) [15:44:02] root@cloudcephosd2004-dev:~# ceph daemon mon.cloudcephmon2004-dev ops [15:44:06] will show the specific ones [15:45:08] hm yeah, 12 is still complaining about a missing secret. [15:45:12] I guess we can just keep waiting! [15:46:31] stopping and starting the mon gets rid of the slow ops temporarily (as it flushes/cancels), but did that twice already, so there's something else getting them stuck [15:49:52] do you have time to get another mon or two online before you go for the day? I'm kind of afraid to try it now [15:54:45] I think we should try to get the cluster online first [15:57:44] ok [15:58:40] remember that this is test/dev so if you need to go you should go [15:59:30] the log message on 12 just changed to 'monclient: handle_auth_request no AuthAuthorizeHandler found for auth method 2' [15:59:38] I disabled the authx in the mon and osd2004, and it did not help :/ [15:59:44] I'll re-enable cephx [16:01:04] yep, now it's saying what it was saying before [16:05:07] andrewbogott: I'm getting a bit stuck now on what to try :/, maybe rebuilding that mon database like you did the first time might help, not sure what's getting that mon stuck on registering new osds [16:05:44] can we try enabling puppet on cloudcephmon2004-dev first and let it do its think with keyrings? [16:06:00] did already, it also reverts the config, that also did, and starts the mon [16:06:09] ah ok [16:06:12] we can try again if you want [16:06:16] *shrug* [16:06:31] at some point you thought this was about waiting for things to expire... and some OSDs /did/ stop complaining after a wait [16:06:32] (even if it's just to avoid puppet disabled alerts xd) [16:06:53] yep, but they still remained stuck registering in the mon [16:07:28] https://www.irccloud.com/pastebin/KazXJyn4/ [16:08:31] let me try to unset the norebalance [16:10:11] is it really just those two that won't register? [16:10:51] I think noout implies norebalance, can we unset that? [16:12:11] done [16:12:28] it seems to have helped with some of the ops, but still not registering the osds correctly [16:12:57] yeah, still shows 100% unknown [16:14:04] it might be auth related, I'm finding some people that had issues with similar output, though most just rebuilt the cluster [16:14:26] did you already try 'ceph osd out' and 'ceph osd on'? [16:14:31] on the affected osds? [16:14:41] I mean, seems unlikely to do much unless that resets the auth [16:17:20] I did not [16:21:58] it accomplished nothing [16:23:19] ceph daemon mon.cloudcephmon2004-dev ops | grep boot suggests I have made things worse [16:24:30] xd [16:25:14] I tried manually importing the osd keyring for osd.6, but it did not help either (copied the /var/lib/ceph/osd/../keyring to /root/test.keyring, added the capabilities, and run the ceph auth import -i ./test.keyring) [16:25:47] oooo [16:25:52] https://www.irccloud.com/pastebin/csXvdHwI/ [16:26:06] did you 'ceph auth del' first? [16:26:15] https://www.irccloud.com/pastebin/GLi9zgli/ [16:26:32] maybe the fsid of the cluster changed? [16:27:01] ah no [16:27:03] wrong file [16:27:06] https://www.irccloud.com/pastebin/4ndy76Nm/ [16:27:28] andrewbogott: no, I did not remove it, ican try [16:27:41] try remove and import [16:28:10] but also I'm not positive that 6 is the problem, it could be that 6 is complaining about other osd keys... [16:28:36] I don't know how to find secret_id= [16:28:49] did not help [16:29:20] we can try to create a new keyring, force it to use a new auth [16:30:04] are the osd keys puppetized or only dynamic? Dynamic, right? [16:32:05] dynamic I think [16:32:07] it did not help [16:34:27] I think I'm going to call it a day for now, I'll come back if I think of something, but I think I need some distance/fresh air [16:34:52] hmmm... what did I do before that now is different... hmm... [16:34:52] xd [16:35:03] we'll get it eventually. Thank you for working on this! [16:37:56] hello, WMCS folks o/ while dealing with some cleanups related to [0], I noticed there are a number of cases where `bullseye-backports` is still referenced in `Dockerfile`s for various toolforge infra [0]. [16:37:56] just wanted to flag that with you all, since those image builds are likely to fail now that `bullseye-backports` has been archived (i.e., at the first `apt update`). [16:37:56] [0] https://phabricator.wikimedia.org/T383557 [16:37:56] [1] https://codesearch.wmcloud.org/search/?q=bullseye-backports [16:48:20] thank for the ping swfrench-wmf! we already fixed a few, and it looks like there are not many left... we'll probably fix them as needed when we have to run one of those builds [16:48:23] *thanks [16:49:12] dhinus: ah, great - good to hear it's already on your radar :) [17:02:04] * dhinus off [17:28:29] * dcaro off [20:42:11] dcaro: I think that the ceph cluster in codfw1dev is 100% back up now. Status looks right and I can ssh to things again. We will see if this is still true when I get back! [20:58:51] Did you do anything to it? [21:46:38] I did very many things [21:46:53] dumped the state from the osds again and started the mon over again from scratch [21:46:56] among other things