[09:30:12] elukey: We have a puppet-merge clash. Shall I merge your `pyrra: add tonecheck Pyrra config (3ab40d9aa2)` or feel free to merge min. [09:30:20] *mine [09:30:27] btullis: I pinged you in ops as well :D [09:30:29] merging [09:31:14] Thanks. [09:32:31] done [13:29:34] legoktm just a heads-up that https://apt-browser.toolforge.org/ has been down a few days now...if I can do anything to help, LMK [13:33:41] as FYI I am going to roll restart the eventgate-main eqiad/codfw pods to pick up the config of a new stream (new tegola queue for maps-bookworm) [14:53:36] elukey: hopefully tomorrow you have some time so I can talk to you about https://gerrit.wikimedia.org/r/c/operations/puppet/+/1168176 [14:54:55] sure sure! [15:02:40] is there any interest in adding the cloud ceph incident to the agenda for the "incident review ritual" today? cc akosiaris btullis dcaro [15:03:31] dhinus: I belive there is already a topic, but feel free to suggest it to the onfire people [15:03:39] dhinus: the Friday one? Unless it's clearly and fully resolved, my default stance is that it would be too soon. [15:03:49] akosiaris: yes the one from friday [15:04:00] it tends to take a while to gather all the data in the docs in order to have a meaningful discussion [15:04:02] happy to postpone it to the next round [15:04:14] otherwise it's fully of "I don't know, action item to look it up" [15:04:15] I haven't attended many previous ones so not sure what the process is [15:04:30] The process is what you did, my concern is just that it might be too soon [15:04:36] sounds good :) [15:06:20] who should I ping to get it in the list for the next meeting? [15:08:39] hmmm, lmata and sobanski are the typical people for this, but I am not sure they 'll be around for the next one. That being said, they can put it in the queue for the next one right now and we discuss it in a couple of weeks [15:09:11] perfect :) [15:09:48] the incident doc is https://docs.google.com/document/d/1CLY_iZyXDTyJEl4fKYeU1aRSNsheO9-TZcjyW9wFyEk/edit?tab=t.0 -- I'll create the on-wiki report shortly [15:09:55] thanks! [15:40:57] Hi SRE folks! I just filed https://phabricator.wikimedia.org/T399469 to complain about an unusable docker-registry.wikimedia.org/wikimedia-buster image. Can someone suggest additional tags I should use? [15:41:02] (tags on the phab ticket) [15:42:21] AFAIK buster was removed from Debian mirrors during the weekend [15:42:50] ooh interesting.. but we still have Buster hosts running (for example, the deployment server, where scap still needs to be supported) [15:43:27] the deployment servers are on Bullseye, though? [15:43:29] no that's wrong.. not the deployment server... I'll recheck to see if any buster targets remain for scap. I'd like to drop that support [15:44:20] the remaining buster nodes are mwmaint* (but not longer in use) and maps* [15:44:28] in prod we have 18 buster hosts, maps, mwmaint and puppetmasters [15:44:35] (the old ones, not puppetservers) [15:44:49] and the puppetmasters, but they are only to run the remaining buster nodes and the conf* servers which are also on Puppet 5 [15:45:29] ok! I don't see hits for maps* in deploy1003:/etc/dsh/group/scap_targets and no hits for mwmaint* in any other files in /etc/dsh/group so I think we're good to drop buster in scap. Woohoo! Thanks all [15:45:42] great :-) [15:49:53] ❤️ [16:00:31] as FYI me and Moritz are working on maps-codfw, tegola and kartotherian are running only in eqiad. Tomorrow we'll bootstrap the cluster, at the moment if you really need to repool kartotherian and tegola in codfw you'll need to revert https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1165550 first. For any maps-related issues please ping me or Moritz :) [16:00:40] cc: nemo-yiannis [21:15:35] sukhe: given your history of excellent advice, do you have a minute for me to pick your brain? [21:16:14] (regarding a hardware matter) [21:16:16] * urandom spits [21:18:10] urandom: sure (how helpful I was is debatable but I can try) [21:18:42] Ok, we have a machine with 8 SSDs arranged into two RAID10s [21:18:53] one of which ofc we expect to boot from [21:19:44] there were two failed SSDs that needed to be replaced, so out of an abundance of caution, we took the machine down so as to validate the serial numbers, and make sure we'd pulled the correct drives [21:20:01] (since we were at the limit of the redundancy) [21:20:16] this is a software RAID10 btw [21:20:33] ok [21:20:35] at any rate, the drive replaced was sda [21:21:03] which seems to be C: as far as the BIOS is concerned, and the machine will no longer boot [21:21:31] honestly, if that *is* the issue, then just rebooting it without replacing the SSD would have wedged it I assume [21:22:48] most of that is theory or supposition other than the fact that it won't boot... that is an absolute fact [21:23:20] urandom: ok. which host is this? [21:23:26] aqs1012 [21:24:29] one final detail: thinking that d-i put a bootloader on each of those raid members, I had dcops try moving the second device (sdb) to the first slot in the hopes it would be detected as C: [21:24:53] that did not work; we should probably move that back before confusion sets in .... but just thought I'd be clear about that [21:25:49] urandom: it says though that no physical disks detected? [21:25:55] > RAC0501: There are no physical disks to be displayed. [21:26:10] this is from the aqs1012 idrac web interface [21:26:21] hrmm... these are sata devices fwiw [21:26:49] I do see the booting from C: [21:28:14] ^ on the virtual console; but I am not sure if that actually means anything or if it is just saying that because it can't find anything relevant in the boot order [21:29:12] R440 ok. and it is also complaining about the idrac firmware. [21:29:24] oh, you think C: can be any one of a valid set of drives? [21:29:56] urandom: not a 100% sure but I don't see any detected drives. [21:29:59] I do see two errors: [21:30:04] > RAC0501: There are no physical disks to be displayed. [21:30:08] > RAC0503: There are no out-of-band capable or re-configurable controllers to be displayed. [21:30:38] https://usercontent.irccloud-cdn.com/file/RTKoDm5g/image.png [21:30:42] maybe this needs https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Sw_raid_rebuild_directions [21:31:18] that is about copying the partition table from another drive etc [21:31:28] mutante: it does need that, but we'll have to get it booted up first [21:31:50] mutante: yeah but it's not showing up in the iDRAC at all and that's for the SW raid built part [21:32:03] sukhe: see the screencap above [21:32:20] that's all I ever see on similarly equipped hardware [21:32:22] urandom: yep I see it [21:32:31] insofar as the drac is concerned [21:32:42] so on other aqs hosts, you don't see anything on the storage tab? [21:33:10] (I mean this is of course not aqs specific but since we are talking about those and in the context of aqs-specific HW config of the R440) [21:33:11] it's worth double-checking, but I don't think so, no [21:34:19] based on "RAC0503" I would say let them try to hard reset the DRAC itself [21:34:48] sukhe: confirmed; aqs1014 is the same, nothing shows under the storage tab [21:34:48] yeah I guess that is worthwhile. also the firmware seems old so that's worth a shot. none of this explains why it stopped working in between the disk swaps but yeah [21:34:52] urandom: ok interesting. [21:35:04] trying that [21:35:49] yeah and it even shows the debian login so that's good [21:36:30] huh? [21:36:35] aqs1014 [21:36:41] not 12 [21:36:50] oh, yeah it's fine :) [21:36:58] so far. [21:37:01] * urandom knocks on wood [21:38:40] "drain flea power" [21:38:42] ok out of ideas so far. [21:38:54] tell dcops to do the "flea power" thing [21:39:18] shut it down.. unplug power cable, hold power button for 20 seconds ... [21:39:22] urandom: try racreset once. what did dc-ops say? [21:39:38] if it still does this after a (hard) DRAC reset.. then I would escalate to Dell [21:39:52] and sorry, have to run now for dinner but happy to pick it up tomorrow. we all love a good HW problem anyway :] [21:40:09] do we though? [21:40:10] :) [21:40:15] thanks for the help tho! [21:40:26] mutante: why do you suspect the drac? [21:40:44] because of that error RAC0503 [21:40:58] "no controller" probably means the oob-controller.. the drac itself [21:41:08] and others describe similar issues with no disks detected [21:41:13] that come back after drac reset like this [21:41:40] and just because I know that is what they would do anyways before it gets escalated at dell [21:42:04] I think this host is out of warranty [21:42:09] f.e. https://www.dell.com/support/kbdoc/en-nz/000035224/vxrail-idrac-console-reports-rac0501-rac0503-errors-and-no-disks-are-detected [21:42:45] urandom: how bad is running reimage cookbook ? [21:43:06] assuming you get it to boot but the data is gone.. I mean [21:43:22] I mean, that's where we are, trying to determine whether it's a wash [21:43:48] bad enough that I'd really rather not [21:43:58] yea.. so.. ignoring all other things.. we have disks and a DRAC claims there are no disks [21:44:12] thats why I expected it becomes "drac reset" [21:44:24] that's literally every one of these hosts that I can recall [21:44:25] and "reseat cables" [21:44:33] I honestly thought it was "normal" for this config [21:45:05] those "RAID rebuild" docs start with "check if it (even) detects the disk" though [21:46:06] and didnt s.uke see the disks on that other host [21:46:17] Ok, that wiki page is using lshw [21:46:31] I'm certain we'd see the affected disks that way... [21:46:43] you can even seem them detected on post [21:46:57] and they are there on the inventory page [21:47:26] that page you referenced also seems to say it's cosmetic? https://www.dell.com/support/kbdoc/en-nz/000035224/vxrail-idrac-console-reports-rac0501-rac0503-errors-and-no-disks-are-detected [21:47:38] what is a vxrail btw? [21:48:47] oh... it's not what we have here (though the article might be relevant) [21:48:51] I didnt see that as important. It's just one of many example of people having a DRAC with RAC0503 and then they reset it [21:49:06] is the config lost if you reset? [21:51:22] yeah, it does [21:51:27] resets to factor defaults [21:51:29] no, there is soft and hard reset which should be fine.. and then there is "racresetcfg" to actually factory reset [21:51:51] but I would only do soft and if that doesnt fix it hand over to dcops [22:23:03] FYI, in case anyone else bumps into this: if you run into issues updating packages with `reprepro`, I suspect this is fallout from [0]. flagged on the patch, and hopefully the old databases indeed can get `clearvanished` to clear this up. [22:23:03] [0] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1169106