Assorted notes

by Juliet Kemp

I saw this article today about rebooting a frozen system without having to hit the power button, which looks like it might be useful. We do have this happen once in a while - usually for no readily apparent reason, and since I tend to go by the "once is an anomaly, twice in quick succession is time to investigate" rule, I confess I don't look too hard.

Also very useful today was step-by-step guide to installing IRAF on Debian (literally tells you exactly what to type). IRAF is an astronomy program, so this information is probably of use to a very small subset of people, but still. I have found it a PITA to install, partly because it insists on belonging to its own user rather than root. This install (of the latest beta) went very smoothly, although I haven't heard back from the requesting user yet.

Finally, a note on RAID5. A fortnight ago two drives on my RAID5 array failed (I was out of the office at the time & can't find anything in the logs to indicate why). Much beeping ensued, drive appeared dead (suspicion was that the hotspare had failed partway through the rebuild, i.e. all data would be lost). The very competent support engineer managed to fix it after several hours (the old drives resurrected themselves; neither of us have any idea why), and I've now replaced both faulty drives and done the required rebuilds.

However: it reminded me of concerns I've had in the past about the reliability of SATA drives, and the deceptive appearance of redundancy that RAID5 has. Whilst in theory you can lose up to two drives and still not lose your data, in practice a) drives may tend to fail around the same time, and b) the thrashing caused by a rebuild may be enough to trigger failure in another drive (in which case you lose data).

I'm not sure if there are any much better alternatives, though; I guess the lesson is ALWAYS BACK UP. (Yes, I did have backups, even if I didn't need them in the end!)


8 Comments

Shawn Wheatley
2007-09-13 10:47:34
Your concerns are certainly well-founded:
http://labs.google.com/papers/disk_failures.pdf


The best summary I could find on the paper was here:
http://andirog.blogspot.com/2007/02/google-findings-of-disk-failure-rates.html


"...the higher possibility of failure of new drive during RAID rebuild..."

Mark Wyatt
2007-09-14 04:28:08
Hmmm, let me ask you a question. Imagine you have a system with a spare drive. What happens if the spare drive fails before any of the in-use drives fails? Do you get any notification?
If not, you continue, believing the spare drive to be good until you need it. Then when you need it because one of the in use drives has failed, everything really does go pear shaped in a way that gets your attention.


I have to say, the more I think about it, the more I find the idea of building RAID arrays out of consumer-spec hard drives (not that I know that this is what you did) to be worthy of scepticism. There is a case for using enterprise spec SATA drives, which have longer warranty periods, or maybe the Raptor series of hard drives, which seem to be mechanically low end (10k) SCSI drives, but with a SATA interface.


I'm also sceptical about some of the 'get free raid with this motherboard' chips. I have had a play with some of the early ones, and they didn't have the sophistication of, say, SCSI raid solutions. Maybe modern ones are better, I don't know.


Sure, consumer SATA drives are cheap, offering bargain levels of storage per unit cost, but that won't seem like the most important thing when you get failures later.

Duncan
2007-09-14 05:26:39
Are you talking RAID-5 or RAID-6? RAID-5, you can lose a single disk from the array without data loss, while RAID-6 makes it two disks. I know, as that's why I chose RAID-6 for the main portions of my home system.


(FWIW, that was after losing two drives in two years... I've seen others observe that the quality was bad between ~80 gig and 300 gig or so, that's about when the warrantee coverage shrank to about a year for most drives, and my experience backs it up as I'm used to having them last thru two upgrade cycles and five years, and /then/ often throwing them out still in running condition, only because they are simply too small and slow to be practical any longer).


Anyway, RAID-5, you'd be expected to have to restore from backup if you lost two drives and couldn't bring one back online at least long enough to sync up a spare, but RAID-6, you could lose two drives and survive, it's be the third one that brought you down.


As for the guy mentioning on-board RAID, that's a known and generally agreed problem. Onboard RAID is generally "firmware" RAID, which amounts to software RAID (accomplished with the drivers), with the config and perhaps just enough smarts in the BIOS to be able to read far enough into it to be able to boot from it.
You are right to be skeptical of it, as they aren't true hardware RAID by a long shot.


FWIW, Linux kernel RAID is generally as fast as long as you don't use enough port multipliers to exceed the bus bandwidth, MUCH more flexible and reliable than firmware/onboard RAID, and reasonably comparable with hardware RAID on modern CPUs, altho each has its benefits and drawbacks.


A big benefit of kernel md/raid is that it's truly hardware agnostic, stick the drives in any computer with the appropriate standard drive interface and number of ports, no RAID required, and the drives can be read and RAID reassembled and brought online. Another is that you can split the same set of drives into multiple differing RAID types, I'm running RAID-1 for boot, RAID-6 for decent redundancy on my main system, and RAID-0 for speed and space where redundancy isn't necessary, on different partitions of the same set of four drives.


A drawback of md/kernel RAID is that /boot must be on either non-RAID or RAID-1/mirrored (tho linear may work as well, not RAID-0/4/5/6, however), as grub and LILO don't really understand RAID and can only work with RAID-1, reading a single copy of the mirror as a single drive, not as RAID. (That's the only reason I'm using RAID-0, otherwise it'd just be RAID-6 for the data I want reasonably protected, RAID-0 for temp data and that such as the distributions' local repository cache, which I can trivially redownload/rebuild.)


Duncan

Tom
2007-09-18 05:21:17
On motherboard RAID is almost always a bad idea. Most of the time it's really a software RAID with a special driver. What advantage does that have over software RAID? Sometimes it gets you easier admin, shuffling, etc. But it also ties you to that card/firmware. Using software RAID means you can use any controller or motherboard to recover errors.


Of course this assumes you have enough CPU. Today's multi gigahertz CPUs have cycles to spare.

Juliet Kemp
2007-09-20 06:24:15
Shawn - thanks for the links!


Mark - our RAID array is enterprise-level, but I'm still a little sceptical about drive lifetimes. In my ideal world, I'd buy SCSI, but unfortunately in my actual world I have a budget...


Duncan - RAID5, but with hotspare (so in theory 2 disks can go without loss, as long as this doesn't happen during the rebuild). If buying again I'd consider RAID6; at the time it wasn't IIRC offered on this kit. It is hardware RAID, though - I share your scepticism about software RAID! (it's also IME a PITA with Linux).

Brad Silva
2007-10-09 21:08:21
(This is a little late)
One thing that the commercial RAID systems sometimes use is drive scrubbing. I.E. something reads thru all of the physical drives (including the spare) looking for bad sectors before they get found during a normal read. There are a couple of advantages to this. In the case of a spare, it can detect a failure even thou the spare may not yet be in use. Another bad situation is when one drive fails and bad sectors are found on one of the functional drives during the reconstruction. When this happens, the RAID goes offline. Scrubbing can detect this and failout the drive with the bad sectors early.


Bottom line of course is that RAID cannot prevent bad things happening, only make them less likely.


Slightly different topic; on the RAID-on-motherboard thing: Everytime I've done a performance test between MB or cheap RAID cards and Linux software RAID, the Linux RAID has won. It's amazingly efficient. The only downside is that a drive failure in software RAID can sometimes cause a system to crash. No data's lost, but the system needs a reboot. Still, I use the Linux SW RAID quite a bit and love it.


Brad

Hans
2007-10-12 12:05:07
Unfortunately, I have been there, and spent many nights recovering customer data from broken arrays.


On hardware RAID5: I've seen many cases of data loss, due to failing power supplies, temperature issues, or SCSI-errors. In all cases the RAID-controller suddenly decided that two or more disks had failed, where only one (or none) were broken. This had nothing to do with consumer level drives or hot spares, but everything with lousy (but expensive) controllers. A lousy PSU may corrupt your data even with RAID6, so you may want to mirror your data across controllers or even servers. Also it's important to perform regular surface scans to check for bad blocks and parity checks to make sure parity matches the data.


On-board "RAID": PLEASE don't! When your motherboard fails after a few years of service, your data will be lost because a replacement isn't available. Use software RAID instead or buy lots of spares.


Linux software RAID hasn't given me any problems yet, but be aware that most RAID-implementations can only detect problems with your drives, not with your data. For that reason, I really value Netapp's (expensive) filers and Sun Solaris ZFS filesystem (freely available). Both provide a higher level of protection than 'normal' RAID can offer, because data integrity is checked on read.

ila
2007-12-19 00:38:55
i had a server (os:win2k3- scsi intel - raid5)
i want to add other sata hard in my server ...i shut down my system and and plug new sata hard .but my server didnt come up..it comes up and said i didnt recongnize your hard disk...i unplug sata hard but everything now missing
i cant detect my scasi hard disk.