[07:40:47] jynus: o/ [07:40:50] morning :) [07:41:05] hi [07:41:26] I didn't follow https://phabricator.wikimedia.org/T377853, did you manage to talk with Dcops for the missing BBUs? If not I can try to reach out to our Supermicro repr and ask more info [07:42:52] I didn't get any answer back [07:45:44] okok I'll try to collect info and send an email to Supermicro, maybe I can also add the issue that we are seeing with ms-be20xx (12 jbod hds set up, 12 in unconfigured good) [07:49:50] elukey: from my POV, I don't care about the BBU thing, but I do very much care about 24 disks :) [07:52:56] Emperor: we all care about jynus, so we also care about BBUs! :D [07:53:12] jokes aside, got your point :) [07:53:43] The last that I've heard from DCops was that they were working on it, but I didn't get any info on IRC/phab [08:40:46] 'BBU': {'BBUStatus': 'Not ' [08:40:49] 'Install', [08:40:52] 'Status': {'Health': 'OK', [08:40:55] 'State': 'Enabled'}}, [08:41:01] this is from ms-be2083's redfish (same model as backup1012) [08:43:51] elukey: do you know if monitoring works there with the older user space, or the title of T377853 still stands ? [08:43:52] T377853: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853 [08:51:14] jynus: nope no idea, the only thing that I know is what you wrote in https://phabricator.wikimedia.org/T377853#10253136 [08:51:44] qq - do the absence of bbus cause backup1012 to be uneligible for production? [08:52:00] or would it be acceptable? I am asking to undestand what to write to supermicro [08:52:20] for media backups? probably not, but it will be much slower than the other hosts [08:52:33] that from my point of view is consumer-grade hardware [08:52:44] it is just a card I can buy at target to do a raid [08:53:06] technically all are cards, but the ones we use for performance have a BBU because they need it for its cache [08:53:23] dbs would not survive without one [08:53:49] yes I know what a bbu is, I was just asking if you need it before backup1012 goes into production or not :) [08:53:58] since you are the sme in this case [08:54:25] elukey, we are at 94% disk utilization, we will use what we have [08:55:05] but for DC ops, that is not the same configuration we used to have, it is way slower [08:55:41] the whole point of standard configs is that we have more or less the same hardware [08:56:17] sure but these are the first hosts, if we just need to buy an extra bbu from them and include it in the standard specs it doesn't seem the end of the world [08:56:34] somewhere along the process this may have fell through the cracks [08:56:40] I don't think it works like that [08:56:56] I think that is what you may not be understanding, I don't think that is possible [08:58:04] hopefully it is, and I was asking the question of "is it ok to keep it for production?" to understand if we needed a replacement or not [08:58:06] if we put a line between Software raid, and "a proper HW raid solution" [08:58:21] this is closer to software raid than the second [08:58:42] it is just a card with a chip [08:59:01] https://www.broadcom.com/products/storage/raid-on-chip/sas-3908 [08:59:30] that's a raid solution IMHO just on paper [09:00:27] note it is on a different section than Raid controllers: https://www.broadcom.com/products/storage/raid-controllers/megaraid-9660-16i [09:01:11] ok, but let's try to find an agreement for the purpose of the next emails to supermicro [09:01:46] is backup1012 ok for production? It seems that you said yes but slower, so mild yes (since we need space). [09:02:01] Another question is if we need to change that config-j's spec for the future [09:02:02] the problem I see is that [09:02:12] for me, absolutely yes [09:02:12] to be clear - I am on your side :D [09:02:32] but the thing is, if they do the same to databases, those would be unusable [09:02:36] I am trying to make your life easier expediting the resolution of the issue, I am not vouching for supermicro [09:02:48] okok this is a good info to dell to DCops [09:02:51] *tell [09:02:55] at least some of these cards have a supercap on, IIRC [09:04:27] but the whole point is the lack of cache [09:04:45] you need a bbu because you have a cache [09:05:08] this card doesn't have a cache [09:06:43] "Cost effective, single-chip solution for entry-level RAID storage systems supporting up to 8 direct-attach drives" [09:06:43] okok so we didn't check this bit when choosing config j [09:07:07] the thing is I cannot tell you what to tell the vendor [09:07:23] because maybe "we" chose that without knowing [09:07:42] or maybe they suggested and assured us it is "essentially the same" (it is not) [09:08:59] personally, between this and a kernel solution, I would prefer the second because at least it is not propietary [09:09:00] understood, I'll follow up with Willy [09:10:04] so just show raid-on-chip vs RAID Controller Cards doc I sent you to explain to dc ops [09:10:17] *web page [09:10:43] but if other specs have started using this (and some may not need it, that's ok) [09:10:56] it may not work for them (e.g. dbs) [09:11:53] for example, maybe if swift does mainly jbod maybe it is not a big deal [09:12:13] (I don't know, for Emperor to say) [09:13:33] for ms-be should be ok yes [09:13:52] on that side the main issue is setting all 24 disks to jbod [09:15:52] I worry those issues may come from a similar source (ROC solution) [09:16:08] I saw a query on the ticket whether the internal cabling was done correctly, but obviously IHNI how its meant to be cabled. [09:16:37] at my previous job we used SuperMicro disk system (4U holding 60 drives), and they could do all-JBOD no problem. [09:17:20] yes, the issue is not SM [09:17:41] I am sure they offer the more expensive option too :-D [09:18:21] well for the jbod disks it may be SM, I'd expect everything to work (modulo cabling done correctly) [09:20:25] I looked the server up on their website the other day, and it mentioned JBOD, so I can't see this isn't a supported behaviour [09:25:22] I am trying to update the firmware for SAS3908 on ms-be2083 [09:25:35] good luck :) [09:25:59] those nodes already had a super old BMC firmware [09:26:23] I don't have a lot of hopes but.. [09:29:26] no luck [09:30:35] :sadpanda: [09:44:29] ok I summarized our discussions [09:44:35] - backup1012: https://phabricator.wikimedia.org/T377853#10261974 [09:44:56] - ms-be20XX: https://phabricator.wikimedia.org/T371400#10261945 [09:45:04] the only next steps that I can see are [09:45:19] 1) Wait for DCops to follow up with supermicro on the config-j host cabling etc.. [09:46:11] 2) Assume that some hosts don't have the BBU for T377853 and try to fix the nagios scripts [09:46:12] T377853: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853 [09:46:21] does it make sense? Anything missing? [09:47:15] elukey: not to bother you too much, but could that be on a separate ticket/doc/convo [09:47:25] the bbu thiny there was a side thing [09:47:43] the purpose of that ticket is to implement proper monitoring [09:48:06] which is something we can do ourselves [09:48:33] the spec change is concerning, but should be handed separatelly [09:52:06] okok I'll send an email thread to Willy, then we'll decide if a new task is needed or not. I'll add you in CC, does it work for you? [09:52:24] yeah, it may not need a task, just a conversation [09:52:47] it is just that with that host, the other actionable is more important for me right now [09:53:04] the other thing is (for backups) more of a decision to go forward [09:53:14] aside from the jbod issues [09:54:04] yes I think in this case you probably will have the final choice (and I can understand that you may be forced to proceeding due to space concerns) [09:56:48] I don't think I should have any final decision, but I thought it was important to bring up because I suspected it was not conscious decision [10:00:34] well it will surely depend on your opinion, I guess it is a decision about buy a new node or not [10:05:12] my main worry is future hosts, specially high performance ones [10:05:36] for those we'll surely need to change specs [10:27:23] arnaudb: did you notice db2146 ? it may have some memory leak, it is using 90% of memory. Not urgent to restart it, but maybe something to research if there is an obvious reason (hanging transactions, version bug, different config, etc) [10:29:56] email sent! [10:30:18] jynus: did not notice, will check thanks for raising it! [10:38:40] nothing obvious from processes or sys schema info afaict, I'll get back to it a bit later. it is depooled for safety