Bad Luck Comes in Threes – Detecting Hardware Failure

A set of 4 industry standard 80mm fans, most c...
Image via Wikipedia

It would be very easy if we could blame Geektonic for this. After he featured a MythTV system, we agreed to offer him a profile of our primary MythTV setup. It seemed like a good time to dust the system both inside and out, in preparation for some fresh photos.

We have a dedicated MythTV backend which doubles as a file server. We exercise basic power saving functions on it, including CPU frequency scaling. We discovered some hardware problems during the cleaning process.

There was an odd grinding noise that gradually appeared. Originally, we attributed it to a cable hitting against one of the case fans, but there was no such cable. We used velcro cable ties to bundle our cables together just in case. The server scales the CPU to its normal specs only when it is under load. We’d been having crashes whenever we tried any video transcoding. Finally, we tied these two problems we thought unrelated together. The fan was failing, and could handle cooling, but not when the computer went full throttle, during which it gradually overheated till the built-in BIOS shutdown temperature is reached.

So we ran to the store the next day, bought a new CPU fan, and replaced ours. We’re always nervous about pulling off a long-running CPU fan, as the thermal paste used between the heatsink and the processor can tend to act, as the name paste implies, as a bond that could cause damage if you are not very careful in pulling it up. But there were no problems, and the CPU temperatures returned to normal.

Then, today, half a week later, we received notification of prefailure on one of the drives. As we speak, we’re moving data off of it. It will be removed, lashed to a different system, and a manufacturer low-level test mechanism used to check the drive. Either way, it is still under warranty and could be exchanged for replacement by the manufacturer.

It is said bad luck comes in threes, so we thought this was a perfect time to discuss how you can monitor and protect yourself from hardware failure. We’ll focus on these techniques for Linux users, but the idea applies to all systems.

  • S.M.A.R.T.(Self Monitoring, Analysis, and Reporting Tool) -  A system built into every hard drive that monitors several variables to predict drive failure before it happens.
  • Temperature/Fan – There are temperature and fan speed sensors in motherboards that permit monitoring of computers for overheating.

Here are, in our opinion, two very important things you need to enable if you are going to protect your computer. But should be enabled in the motherboard BIOS to start. If your BIOS has a shutdown temperature option, that if reached, will turn the computer off…enable it. Enable SMART reporting as well.

Smartmontools is available in all Linux distributions. Make sure it, or another SMART monitoring tool is enabled and configured to send you an alert if it detects anything. Smartmontools on one of our systems was configured to send an email to the root email account, which was local to the system, and rarely used. Make sure it is sent somewhere you will see it.

For temperature and fans, the standard is LM_Sensors, which may tae some tweaing. Check your estimated CPU temperature in BIOS, then boot and compare it to the one in lm_sensors. If they don’t match, you may need to tweak your settings. Make sure this also generates an alert when it reaches a threshold so you can take action.

We’re a bit mystified as to why features like this aren’t built into Windows, but many manufacturers do offer their own monitoring utilities you can install to monitor vitals. Either way, by setting up your computer early on monitor for these things, you can head off some catastrophic failure. Barring that…backup often.