Tag Archives: soft

Problem with soft media errors on SSD drives and FastCache

4/25/2012 Update:  EMC has released a fix for this issue.  Call your account service representative and say you need to upgrade your NS-960 dart to 6.0.55.300 and flare to 4.30.000.5.524 plus drive firmware upgrade on all SSD drives to TC3Q.

Do you have FastCache enabled on your array?  Keep a close eye on your SP event logs for soft media errors on your SSD drives.  I just noticed over 2000 soft media errors on one of my FastCache enabled arrays, and found a technical advisory from EMC (emc282741) that desribes this as a potentially critical problem.  I just opened a case with EMC for my array to be reviewed for a possible disk replacement.  In the event a second disk drive in the same FastCache RAID group encounters soft media errors before the system automatically retires the first drive a dual faulted RAID Group could occur.  This can result in storage pools going offline and becoming completely inaccessible to the attached hosts.  That’s basically a total SAN outage, not good.

Look for errors like the following in your SP event logs:

“Date Stamp”  “Time Stamp” Bus1 Enc1 Dsk0  820 Soft Media Error [Bad block]

EMC states in emc282741 that enhancements are targeted for Q1 2012 to address SSD media errors and dual hardware faults, but in the meantime, make sure you review the SP logs if you have CLARiiON or VNX arrays that are configured with SSD disk drives or are using FAST Cache.  If any instance of the “Soft Media Error” listed above is associated with any one of the solid state disk drives in your arrays, the array should be upgraded to at least FLARE Release 04.30.000.5.522 (for CX4 Series arrays) or Release 05.31.000.5.509 (for VNX Series arrays) and then start a Proactive Copy (PACO) to a hot spare and replace the drive as soon as possible.

In order to quickly review this on each of my arrays, I wrote the following script to update my intranet site with a report every morning:

naviseccli -h clariion1a getlog >clariion1a.txt
naviseccli -h clariion1b getlog >clariion1b.txt  
cat clariion1a.txt | grep -i ‘soft media’ >clariion1_softmedia_errors.csv
cat clariion1b.txt | grep -i ‘soft media’ >>clariion1_softmedia_errors.csv
./csv2htm.pl -e -T -i /home/scripts/clariion1_softmedia_errors.csv -o /<intranet_web_server>/clariion1_softmedia_errors.html
 

The script dumps the entire SP log from each SP into a text file, greps for only soft media errors in each file, then converts the output to HTML and writes it to my intranet web server.

 
Advertisements

Reporting on Soft media errors

 

Ah, soft media errors.  The silent killer.  We had an issue with one of our Clariion LUNs that had many uncorrectable sector errors.  Prior to the LUN failure, there were hundreds of soft media errors reported in the navisphere logs.  Why weren’t we alerted about them?  Beats me.  I created my own script to pull and parse the alert logs so I can manually check for these type of errors.

What exactly is a soft media error?  Soft Media errors indicate that the SAN has identified a bad sector on the disk and is reconstructing the data from RAID parity data  in order to fulfill the read request.   It can indicate a failing disk.

To run a report that pulls only soft media errors from the SP log, put the following in a windows batch file:

naviseccli -h <SP IP Address> getlog >textfile.txt

for /f "tokens=1,2,3,4,5,6,7,8,9,10,11,12,13,14" %%i in ('findstr Soft textfile.txt') do (echo %%i %%j %%k %%l %%m %%n %%o %%p %%q %%r %%s %%t %%u %%v)  >>textfile_mediaerrors.txt

The text file output looks like this:

10/25/2010 19:40:17 Enclosure 6 Disk 7 (820) Soft Media Error [0x00] 0 5
 10/25/2010 19:40:22 Enclosure 6 Disk 7 (820) Soft Media Error [0x00] 0 5
 10/25/2010 19:40:22 Enclosure 6 Disk 7 (820) Soft Media Error [0x00] 0 5
 10/25/2010 19:40:27 Enclosure 6 Disk 7 (820) Soft Media Error [0x00] 0 5
 10/25/2010 19:40:27 Enclosure 6 Disk 7 (820) Soft Media Error [0x00] 0 5
 10/25/2010 19:40:33 Enclosure 6 Disk 7 (820) Soft Media Error [0x00] 0 5
 10/25/2010 19:40:33 Enclosure 6 Disk 7 (820) Soft Media Error [0x00] 0 5
 10/25/2010 19:40:38 Enclosure 6 Disk 7 (820) Soft Media Error [0x00] 0 5
 10/25/2010 19:40:38 Enclosure 6 Disk 7 (820) Soft Media Error [0x00] 0 5
 10/25/2010 19:40:44 Enclosure 6 Disk 7 (820) Soft Media Error [0x00] 0 5
 10/25/2010 19:40:44 Enclosure 6 Disk 7 (820) Soft Media Error [0x00] 0 5
 10/25/2010 19:40:49 Enclosure 6 Disk 7 (820) Soft Media Error [0x00] 0 5
 10/25/2010 19:40:49 Enclosure 6 Disk 7 (820) Soft Media Error [0x00] 0 5

If you see lots of soft media errors, do yourself a favor and open a case with EMC.  Too many can lead to the failure of one of your LUNs.

The script can be automated to run and send an email with daily alerts, if you so choose.  I just run it manually about once a week for review.