Reporting on Soft media errors

 

Ah, soft media errors.  The silent killer.  We had an issue with one of our Clariion LUNs that had many uncorrectable sector errors.  Prior to the LUN failure, there were hundreds of soft media errors reported in the navisphere logs.  Why weren’t we alerted about them?  Beats me.  I created my own script to pull and parse the alert logs so I can manually check for these type of errors.

What exactly is a soft media error?  Soft Media errors indicate that the SAN has identified a bad sector on the disk and is reconstructing the data from RAID parity data  in order to fulfill the read request.   It can indicate a failing disk.

To run a report that pulls only soft media errors from the SP log, put the following in a windows batch file:

naviseccli -h <SP IP Address> getlog >textfile.txt

for /f "tokens=1,2,3,4,5,6,7,8,9,10,11,12,13,14" %%i in ('findstr Soft textfile.txt') do (echo %%i %%j %%k %%l %%m %%n %%o %%p %%q %%r %%s %%t %%u %%v)  >>textfile_mediaerrors.txt

The text file output looks like this:

10/25/2010 19:40:17 Enclosure 6 Disk 7 (820) Soft Media Error [0x00] 0 5
 10/25/2010 19:40:22 Enclosure 6 Disk 7 (820) Soft Media Error [0x00] 0 5
 10/25/2010 19:40:22 Enclosure 6 Disk 7 (820) Soft Media Error [0x00] 0 5
 10/25/2010 19:40:27 Enclosure 6 Disk 7 (820) Soft Media Error [0x00] 0 5
 10/25/2010 19:40:27 Enclosure 6 Disk 7 (820) Soft Media Error [0x00] 0 5
 10/25/2010 19:40:33 Enclosure 6 Disk 7 (820) Soft Media Error [0x00] 0 5
 10/25/2010 19:40:33 Enclosure 6 Disk 7 (820) Soft Media Error [0x00] 0 5
 10/25/2010 19:40:38 Enclosure 6 Disk 7 (820) Soft Media Error [0x00] 0 5
 10/25/2010 19:40:38 Enclosure 6 Disk 7 (820) Soft Media Error [0x00] 0 5
 10/25/2010 19:40:44 Enclosure 6 Disk 7 (820) Soft Media Error [0x00] 0 5
 10/25/2010 19:40:44 Enclosure 6 Disk 7 (820) Soft Media Error [0x00] 0 5
 10/25/2010 19:40:49 Enclosure 6 Disk 7 (820) Soft Media Error [0x00] 0 5
 10/25/2010 19:40:49 Enclosure 6 Disk 7 (820) Soft Media Error [0x00] 0 5

If you see lots of soft media errors, do yourself a favor and open a case with EMC.  Too many can lead to the failure of one of your LUNs.

The script can be automated to run and send an email with daily alerts, if you so choose.  I just run it manually about once a week for review.

Advertisements

Leave a Reply