Mirror Drive Failure

Discussion:

(too old to reply)

TheScullster

2011-06-22 08:53:18 UTC

Hi all

We have a HP ML370 G6 Domain controller - 1 year old Windows Server 2008 R2.
The OS is on a mirror, but recently dropped one of the drives.
Basically the drive showed the orange failure light, but when the server was
rebooted, it rebuilt the array and operated fine for approx 1 month (till
today :().

Again the same drive is showing failure.

OK so this time I will suggest that our IT support company (I don't do
server work) replace the failed drive rather than allowing a rebuild.
But what else can cause this, and how can I build resilience against this
type of failure?
Fail-over server perhaps?

Thanks

Phil

unknown

2011-06-29 04:25:28 UTC

Permalink

Not good, hopefully you were fully backed up?

--
wert
------------------------------------------------------------------------
wert's Profile: http://www.techhelpcentre.com/member.php?userid=64
View this thread: http://www.techhelpcentre.com/showthread.php?t=919198

TheScullster

2011-07-04 10:30:46 UTC

Permalink

Post by unknown
Not good, hopefully you were fully backed up?

Yes, backups of backups!
I expected the server to fail over onto the hot spare drive and continue
operating as normal.
What actually happened was the server stopped performing its DCHP role and
the network ground to a halt.

Once the server was rebooted and a replacement drive introduced, the server
re-built the OS array in the background and functioned normally.

Phil

Dave Warren

2011-06-29 05:22:14 UTC

Permalink

Post by TheScullster
We have a HP ML370 G6 Domain controller - 1 year old Windows Server 2008 R2.
The OS is on a mirror, but recently dropped one of the drives.
Basically the drive showed the orange failure light, but when the server was
rebooted, it rebuilt the array and operated fine for approx 1 month (till
today :().
Again the same drive is showing failure.

Almost like the drive is bad...

Post by TheScullster
OK so this time I will suggest that our IT support company (I don't do
server work) replace the failed drive rather than allowing a rebuild.
But what else can cause this, and how can I build resilience against this
type of failure?

To be entirely honest, I'd be a little... annoyed... if a drive reported
as failing wasn't immediately removed from service and replaced in the
first place.

Maybe that's just me.

Post by TheScullster
Fail-over server perhaps?

A fail-over server is certainly an option. RAID-10 or RAID-6 would
isolate you from two simultaneous failures, at the cost of needing at
least 4 different drives. If you're in a situation where
possibly-failing drives might be put back into service, this might be
worth the minor up front cost.

It really depends on how sensitive to downtime you are, the odds of two
simultaneous failures are (IMO) fairly low. However, if the downtime of
restoring from backups or rebuilding will cost you more than the cost of
a couple extra drives, the math is easy.

TheScullster

2011-07-04 10:39:15 UTC

Permalink

Post by Dave Warren

Almost like the drive is bad...

The fact that it failed twice suggested this, so I had the support company
check it out after the second failure to prove (hopefully) that it was drive
rather than say a controller issue that was causing the grief.

Post by Dave Warren

To be entirely honest, I'd be a little... annoyed... if a drive reported
as failing wasn't immediately removed from service and replaced in the
first place.
Maybe that's just me.

No I agree whole heartedly.
The problem was that as soon as the server was rebooted (after the first
failure), the controller started a rebuild back onto the dodgy drive.
It rebuilt without errors and so it was considered OK to proceed.

Post by Dave Warren

Post by TheScullster
Fail-over server perhaps?

A fail-over server is certainly an option. RAID-10 or RAID-6 would
isolate you from two simultaneous failures, at the cost of needing at
least 4 different drives. If you're in a situation where
possibly-failing drives might be put back into service, this might be
worth the minor up front cost.
It really depends on how sensitive to downtime you are, the odds of two
simultaneous failures are (IMO) fairly low. However, if the downtime of
restoring from backups or rebuilding will cost you more than the cost of
a couple extra drives, the math is easy.

The migration to the new servers (we swapped domain controller and Exchange
box at the same time) was troublesome to say the least.
Since the bugs in this operation were resolved, downtime has been minimal.
The fact is that very occasional downtime is probably tolerable in our case,
but when you have repeated system loss lasting 1/2 a day then this level is
certainly intolerable.

Thanks for your thoughts Dave

Phil

Dave Warren

2011-07-05 00:38:03 UTC

Permalink

Post by TheScullster

Post by Dave Warren
It really depends on how sensitive to downtime you are, the odds of two
simultaneous failures are (IMO) fairly low. However, if the downtime of
restoring from backups or rebuilding will cost you more than the cost of
a couple extra drives, the math is easy.

When I consider something like how "sensitive to downtime" you might be,
I'd consider 3-4 days of downtime as the likely consequence of a
hardware failure unless you have replacement parts for absolutely
everything onhand in some fashion (online, hot spares, or spare parts)

At least from my point of view, if you can't handle 3-4 days of downtime
you should probably have a completely redundant architecture of some
sort in place. If you don't have the budget for that, you should be
prepared for a hardware failure to take you down 3-4 days assuming it
can take 1-2 days to get replacement hardware (and to get it working --
Assume your replacement hardware will fail too) plus 1-2 days to cover
troubleshooting time and rebuilding time once the hardware issues are
resolved.

I realize a lot of server administrators expect that hardware failures
can be resolved in 1-2 hours. They usually can. However, I prefer to
be prepared for the worst, either technically (redundancy) or
politically (user expectations).

No user that expects a 2-4 day recovery time screams when you fix it in
12 hours. Try it the other way around and see what happens, regardless
of how busy you look while you're fixing it.

That's my perspective anyway.

TheScullster

2011-07-06 08:57:40 UTC

Permalink

Post by Dave Warren
When I consider something like how "sensitive to downtime" you might be,
I'd consider 3-4 days of downtime as the likely consequence of a
hardware failure unless you have replacement parts for absolutely
everything onhand in some fashion (online, hot spares, or spare parts)
At least from my point of view, if you can't handle 3-4 days of downtime
you should probably have a completely redundant architecture of some
sort in place. If you don't have the budget for that, you should be
prepared for a hardware failure to take you down 3-4 days assuming it
can take 1-2 days to get replacement hardware (and to get it working --
Assume your replacement hardware will fail too) plus 1-2 days to cover
troubleshooting time and rebuilding time once the hardware issues are
resolved.
I realize a lot of server administrators expect that hardware failures
can be resolved in 1-2 hours. They usually can. However, I prefer to
be prepared for the worst, either technically (redundancy) or
politically (user expectations).
No user that expects a 2-4 day recovery time screams when you fix it in
12 hours. Try it the other way around and see what happens, regardless
of how busy you look while you're fixing it.
That's my perspective anyway.

Thanks Dave

I guess I've been lucky so far in that whole system outages have been
restricted to one day max.
I do try to limit servers to a 4 year life, but with the (lack of)
reliability of the HP servers recently installed, having new equipment
clearly isn't any guarantee of a stress-free life.

Phil