Discussion:
Mirror Drive Failure
(too old to reply)
TheScullster
2011-06-22 08:53:18 UTC
Permalink
Hi all

We have a HP ML370 G6 Domain controller - 1 year old Windows Server 2008 R2.
The OS is on a mirror, but recently dropped one of the drives.
Basically the drive showed the orange failure light, but when the server was
rebooted, it rebuilt the array and operated fine for approx 1 month (till
today :().

Again the same drive is showing failure.

OK so this time I will suggest that our IT support company (I don't do
server work) replace the failed drive rather than allowing a rebuild.
But what else can cause this, and how can I build resilience against this
type of failure?
Fail-over server perhaps?


Thanks

Phil
unknown
2011-06-29 04:25:28 UTC
Permalink
Not good, hopefully you were fully backed up?
--
wert
------------------------------------------------------------------------
wert's Profile: http://www.techhelpcentre.com/member.php?userid=64
View this thread: http://www.techhelpcentre.com/showthread.php?t=919198
TheScullster
2011-07-04 10:30:46 UTC
Permalink
Post by unknown
Not good, hopefully you were fully backed up?
Yes, backups of backups!
I expected the server to fail over onto the hot spare drive and continue
operating as normal.
What actually happened was the server stopped performing its DCHP role and
the network ground to a halt.

Once the server was rebooted and a replacement drive introduced, the server
re-built the OS array in the background and functioned normally.

Phil
Dave Warren
2011-06-29 05:22:14 UTC
Permalink
Post by TheScullster
We have a HP ML370 G6 Domain controller - 1 year old Windows Server 2008 R2.
The OS is on a mirror, but recently dropped one of the drives.
Basically the drive showed the orange failure light, but when the server was
rebooted, it rebuilt the array and operated fine for approx 1 month (till
today :().
Again the same drive is showing failure.
Almost like the drive is bad...
Post by TheScullster
OK so this time I will suggest that our IT support company (I don't do
server work) replace the failed drive rather than allowing a rebuild.
But what else can cause this, and how can I build resilience against this
type of failure?
To be entirely honest, I'd be a little... annoyed... if a drive reported
as failing wasn't immediately removed from service and replaced in the
first place.

Maybe that's just me.
Post by TheScullster
Fail-over server perhaps?
A fail-over server is certainly an option. RAID-10 or RAID-6 would
isolate you from two simultaneous failures, at the cost of needing at
least 4 different drives. If you're in a situation where
possibly-failing drives might be put back into service, this might be
worth the minor up front cost.

It really depends on how sensitive to downtime you are, the odds of two
simultaneous failures are (IMO) fairly low. However, if the downtime of
restoring from backups or rebuilding will cost you more than the cost of
a couple extra drives, the math is easy.
TheScullster
2011-07-04 10:39:15 UTC
Permalink
Post by Dave Warren
Post by TheScullster
We have a HP ML370 G6 Domain controller - 1 year old Windows Server 2008 R2.
The OS is on a mirror, but recently dropped one of the drives.
Basically the drive showed the orange failure light, but when the server was
rebooted, it rebuilt the array and operated fine for approx 1 month (till
today :().
Again the same drive is showing failure.
Almost like the drive is bad...
The fact that it failed twice suggested this, so I had the support company
check it out after the second failure to prove (hopefully) that it was drive
rather than say a controller issue that was causing the grief.
Post by Dave Warren
Post by TheScullster
OK so this time I will suggest that our IT support company (I don't do
server work) replace the failed drive rather than allowing a rebuild.
But what else can cause this, and how can I build resilience against this
type of failure?
To be entirely honest, I'd be a little... annoyed... if a drive reported
as failing wasn't immediately removed from service and replaced in the
first place.
Maybe that's just me.
No I agree whole heartedly.
The problem was that as soon as the server was rebooted (after the first
failure), the controller started a rebuild back onto the dodgy drive.
It rebuilt without errors and so it was considered OK to proceed.
Post by Dave Warren
Post by TheScullster
Fail-over server perhaps?
A fail-over server is certainly an option. RAID-10 or RAID-6 would
isolate you from two simultaneous failures, at the cost of needing at
least 4 different drives. If you're in a situation where
possibly-failing drives might be put back into service, this might be
worth the minor up front cost.
It really depends on how sensitive to downtime you are, the odds of two
simultaneous failures are (IMO) fairly low. However, if the downtime of
restoring from backups or rebuilding will cost you more than the cost of
a couple extra drives, the math is easy.
The migration to the new servers (we swapped domain controller and Exchange
box at the same time) was troublesome to say the least.
Since the bugs in this operation were resolved, downtime has been minimal.
The fact is that very occasional downtime is probably tolerable in our case,
but when you have repeated system loss lasting 1/2 a day then this level is
certainly intolerable.

Thanks for your thoughts Dave


Phil
Dave Warren
2011-07-05 00:38:03 UTC
Permalink
Post by TheScullster
Post by Dave Warren
It really depends on how sensitive to downtime you are, the odds of two
simultaneous failures are (IMO) fairly low. However, if the downtime of
restoring from backups or rebuilding will cost you more than the cost of
a couple extra drives, the math is easy.
The migration to the new servers (we swapped domain controller and Exchange
box at the same time) was troublesome to say the least.
Since the bugs in this operation were resolved, downtime has been minimal.
The fact is that very occasional downtime is probably tolerable in our case,
but when you have repeated system loss lasting 1/2 a day then this level is
certainly intolerable.
When I consider something like how "sensitive to downtime" you might be,
I'd consider 3-4 days of downtime as the likely consequence of a
hardware failure unless you have replacement parts for absolutely
everything onhand in some fashion (online, hot spares, or spare parts)

At least from my point of view, if you can't handle 3-4 days of downtime
you should probably have a completely redundant architecture of some
sort in place. If you don't have the budget for that, you should be
prepared for a hardware failure to take you down 3-4 days assuming it
can take 1-2 days to get replacement hardware (and to get it working --
Assume your replacement hardware will fail too) plus 1-2 days to cover
troubleshooting time and rebuilding time once the hardware issues are
resolved.

I realize a lot of server administrators expect that hardware failures
can be resolved in 1-2 hours. They usually can. However, I prefer to
be prepared for the worst, either technically (redundancy) or
politically (user expectations).

No user that expects a 2-4 day recovery time screams when you fix it in
12 hours. Try it the other way around and see what happens, regardless
of how busy you look while you're fixing it.

That's my perspective anyway.
TheScullster
2011-07-06 08:57:40 UTC
Permalink
Post by Dave Warren
When I consider something like how "sensitive to downtime" you might be,
I'd consider 3-4 days of downtime as the likely consequence of a
hardware failure unless you have replacement parts for absolutely
everything onhand in some fashion (online, hot spares, or spare parts)
At least from my point of view, if you can't handle 3-4 days of downtime
you should probably have a completely redundant architecture of some
sort in place. If you don't have the budget for that, you should be
prepared for a hardware failure to take you down 3-4 days assuming it
can take 1-2 days to get replacement hardware (and to get it working --
Assume your replacement hardware will fail too) plus 1-2 days to cover
troubleshooting time and rebuilding time once the hardware issues are
resolved.
I realize a lot of server administrators expect that hardware failures
can be resolved in 1-2 hours. They usually can. However, I prefer to
be prepared for the worst, either technically (redundancy) or
politically (user expectations).
No user that expects a 2-4 day recovery time screams when you fix it in
12 hours. Try it the other way around and see what happens, regardless
of how busy you look while you're fixing it.
That's my perspective anyway.
Thanks Dave

I guess I've been lucky so far in that whole system outages have been
restricted to one day max.
I do try to limit servers to a 4 year life, but with the (lack of)
reliability of the HP servers recently installed, having new equipment
clearly isn't any guarantee of a stress-free life.

Phil

Loading...