Problem with soft media errors on SSD drives and FastCache

4/25/2012 Update:  EMC has released a fix for this issue.  Call your account service representative and say you need to upgrade your NS-960 dart to 6.0.55.300 and flare to 4.30.000.5.524 plus drive firmware upgrade on all SSD drives to TC3Q.

Do you have FastCache enabled on your array?  Keep a close eye on your SP event logs for soft media errors on your SSD drives.  I just noticed over 2000 soft media errors on one of my FastCache enabled arrays, and found a technical advisory from EMC (emc282741) that desribes this as a potentially critical problem.  I just opened a case with EMC for my array to be reviewed for a possible disk replacement.  In the event a second disk drive in the same FastCache RAID group encounters soft media errors before the system automatically retires the first drive a dual faulted RAID Group could occur.  This can result in storage pools going offline and becoming completely inaccessible to the attached hosts.  That’s basically a total SAN outage, not good.

Look for errors like the following in your SP event logs:

“Date Stamp”  “Time Stamp” Bus1 Enc1 Dsk0  820 Soft Media Error [Bad block]

EMC states in emc282741 that enhancements are targeted for Q1 2012 to address SSD media errors and dual hardware faults, but in the meantime, make sure you review the SP logs if you have CLARiiON or VNX arrays that are configured with SSD disk drives or are using FAST Cache.  If any instance of the “Soft Media Error” listed above is associated with any one of the solid state disk drives in your arrays, the array should be upgraded to at least FLARE Release 04.30.000.5.522 (for CX4 Series arrays) or Release 05.31.000.5.509 (for VNX Series arrays) and then start a Proactive Copy (PACO) to a hot spare and replace the drive as soon as possible.

In order to quickly review this on each of my arrays, I wrote the following script to update my intranet site with a report every morning:

naviseccli -h clariion1a getlog >clariion1a.txt
naviseccli -h clariion1b getlog >clariion1b.txt  
cat clariion1a.txt | grep -i ‘soft media’ >clariion1_softmedia_errors.csv
cat clariion1b.txt | grep -i ‘soft media’ >>clariion1_softmedia_errors.csv
./csv2htm.pl -e -T -i /home/scripts/clariion1_softmedia_errors.csv -o /<intranet_web_server>/clariion1_softmedia_errors.html
 

The script dumps the entire SP log from each SP into a text file, greps for only soft media errors in each file, then converts the output to HTML and writes it to my intranet web server.

 
Advertisements

2 thoughts on “Problem with soft media errors on SSD drives and FastCache”

Leave a Reply