[07:38:36] headsup; I'll migrate the eqiad installserver to nftables in 10 minutes [08:05:17] mmhh I launched this pipeline/job again and it is stuck in the same spot, thoughts/ideas on what it could be ? https://gitlab.wikimedia.org/repos/cloud/toolforge/webservice-cli/-/jobs/829846 [08:09:47] ok I'll just wait I guess, looks like slow disk i/o [19:33:18] I have a host (kafka-jumbo1016) seemingly stuck in PXE boot. I've been able to take a screenshot via the IDRAC web ui. https://wikimedia.slack.com/archives/C055QGPTC69/p1779303718528979?thread_ts=1779274323.128399&cid=C055QGPTC69 [19:33:28] Does this ring a bell for anyone? [19:34:56] hmm actually maybe I should ask #wikimedia-sre-foundations [19:39:22] brouberol: I wonder if this is because you have a partman recipe specified for BIOS but it is attempting to do UEFI instead (which is the default for all reimages) [19:39:56] and thatn probably requires a reprovision with --legacy [19:40:01] 'kafka-jumbo101[0-8]': [19:40:01] - reuse-parts.cfg [19:40:01] - partman/custom/reuse-kafka-jumbo.cfg [19:40:31] I may be wrong, but we saw someting similar for a recent LVS reimage, see https://phabricator.wikimedia.org/T421421#11914273 [19:40:35] I can't type, something [19:41:20] Thanks! brb, my daughter can't sleep, I'll come back when I can [19:46:59] Nice, so I should kill the current cookbook, and simply run the provisioning one with —legacy? [19:47:24] At least there shouldn’t be any harm in trying this I guess [20:08:21] (famous last words, as I've never used the cookbook before. Here's to hoping I don't f things up some more) [20:18:29] hmm, the cookbook does not seem to be able to shut down the host. I'm aborting [20:18:38] welp, I probably made it worse then [20:24:39] sukhe: yeah that makes sense [20:24:43] nice find [20:26:52] jhathaway if that's ok with you, I'll follow up here instead of #-sre-foundations to avoid too much x-channel activity [20:27:08] please do [20:27:14] so, I'm attempting to re-run the reimage cookbook, to see if it still fails the same way [20:28:11] I got to [20:28:12] Running IPMI command: ipmitool -I lanplus -H kafka-jumbo1016.mgmt.eqiad.wmnet -U root -E chassis bootparam set bootflag none options=reset [20:28:12] Running IPMI command: ipmitool -I lanplus -H kafka-jumbo1016.mgmt.eqiad.wmnet -U root -E chassis bootparam get 5 [20:28:12] Running IPMI command: ipmitool -I lanplus -H kafka-jumbo1016.mgmt.eqiad.wmnet -U root -E chassis bootparam get 5 [20:28:12] Checked BIOS boot parameters are back to normal [20:28:12] [1/240, retrying in 10.00s] Attempt to run 'spicerack.remote.RemoteHosts.wait_reboot_since' raised: Reboot for kafka-jumbo1016.eqiad.wmnet not found yet [20:28:25] and it's back to looping over these checks [20:29:57] with the same data on screen visible through the IPMI web ui [20:31:09] from the serial console it seems to be stuck in the debian installer at the moment [20:31:13] [/dev/sda] ERROR: recipe partition count (2) != actual partition count (1) [20:31:19] is the error on the screen [20:31:54] ooh, I might have been looking at the wrong thing then, aka the vconsole in the IPMI ui instead of the serial console [20:34:05] the first part of the error is "reuse-parts: Recipe mismatch with existing partitioning" [20:34:20] tell you what, it's 10:30PM local time, way past the time to futz with parted. I'll pick this up in the morning w/ b.tullis and we'll see how we can fix this [20:34:43] nod, happy to help as well, enjoy your evening [20:34:57] I'd rather we don't lose the data on the second disk, as otherwise we'd force some pretty large catchup from other brokers (about ~10TB of data) [20:35:15] s/the second disk/the RAID array [20:35:21] thanks again y'all