Quote from
RCobb1 on February 18, 2020, 4:59 pm
Hopefully someone will help guide me in the right direction... I've had an Antsle One for a little over 2 years. I recently decided to set up a small k8s cluster using RancherOS KVMs on my antsle. I noticed that after a few weeks, the Antsle box started becoming completely unresponsive (no Antsle GUI, no antlets running, but I COULD SSH to edgeLinux. Usually, a simple "sudo reboot" solved it for another week or so when it would happen again. Saturday, it happened again and after issuing the "sudo reboot" command, it never came back up. I connected a monitor and keyboard to it and rebooted and I see the Supermicro boot for about a 1/4 of a second, then the screen clears and goes to a flashing cursor then stays there indefinitely (been 2 full days, now, with no change). So, I plugged in an ethernet cable to the IPMI port and used the IPMI web page to look at console activity, and it looks like perhaps the boot SSD has died; likely due to heat. IPMI also shows a lot of critical and non-recoverable temperature events in the IPMI event log (screen-capture attached).
I can replace the SSD's without issue and re-installing edgeLinux, what I'm worried about is heat killing 2 more of them. Any ideas what I need to look at?
Hopefully someone will help guide me in the right direction... I've had an Antsle One for a little over 2 years. I recently decided to set up a small k8s cluster using RancherOS KVMs on my antsle. I noticed that after a few weeks, the Antsle box started becoming completely unresponsive (no Antsle GUI, no antlets running, but I COULD SSH to edgeLinux. Usually, a simple "sudo reboot" solved it for another week or so when it would happen again. Saturday, it happened again and after issuing the "sudo reboot" command, it never came back up. I connected a monitor and keyboard to it and rebooted and I see the Supermicro boot for about a 1/4 of a second, then the screen clears and goes to a flashing cursor then stays there indefinitely (been 2 full days, now, with no change). So, I plugged in an ethernet cable to the IPMI port and used the IPMI web page to look at console activity, and it looks like perhaps the boot SSD has died; likely due to heat. IPMI also shows a lot of critical and non-recoverable temperature events in the IPMI event log (screen-capture attached).
I can replace the SSD's without issue and re-installing edgeLinux, what I'm worried about is heat killing 2 more of them. Any ideas what I need to look at?
Uploaded files:mshappe has reacted to this post.