Tag Archives: errors

Frequent 0x622 and 0x606 errors in the SP Event Logs

During some routine checking of the SP Event logs on our NS-40 I was noticing a large number of alerts. Every few seconds I was seeing these three alerts pop in:

0x60a Internal Information Only. A logical unit has been enabled
0x622 Background Verify Aborted
0x606 Unit Shutdown for trespass
 

After a bit of investigation, I narrowed down the cause to several large LUNs that had just been added to a new ESX host.  It turns out that the LUNs were still running the background zeroing process, and that’s what was causing all the alerts in the SP Log. When you create a new LUN and the disks have been previously used for other LUNs, the new LUN needs to be “zeroed” (filled with all zeros to clear data). This takes place in the background and it is part of the LUN initialization.  Once this background zeroing process completed on my new LUNs the alert messages stopped.  I was unaware of that process, so I did a bit of research on it.

LUNs are immediately available for use after a bind (using “Fastbind”), however all the operations associated with a bind can take a long time to finish.  The duration of a LUN bind is dependent on these things:

  • LUN’s bind time background verify priority (rate)
  • Size of the LUN being bound
  • Type of drives in the LUN’s RAID Group
  • Potential disabling of initial verify on bind
  • State of the Storage System (Idle or Load)
  • Position of the LUN on the hard disks of the RAID Group

From that list, priority, LUN size, drive type and verification selection all have the greatest effect on duration.  You can calculate the approximate duration of the bind process with this formula:

Time = Bound LUN Capacity * Bind Rate

Here are the Average Bind Rates for FC and SATA disks:

Disk Type ASAP Bind Rate High Bind Rate Medium (default) Bind Rate Low Bind Rate
FC 83 MB/s 7.54 MB/s 5.02 MB/s 4.02 MB/s
SATA 61.7 MB/s 7.47 MB/s 5.09 MB/s 3.78 MB/s

If we were to calculate how many hours it would take to bind a 2000GB LUN on a five disk RAID5 group composed of SATA drives set to a medium (default) bind rate, here’s what the formula would look like:

Time = 2000 GB * ((1/5.09 MB/s) * 1024 MB->GB * (1/3600 sec->hrs) = 111.76 Hours.

There is a detailed white paper that covers this topic from EMC called “The Effect of Priorities on LUN Management Operations” that you can view here:  http://www.emc.com/collateral/hardware/white-papers/h4153-influence-priorities-emc-clariion-lun-wp.pdf.  That’s where I gathered the information above.

Advertisements

Powerpath commands in AIX causing unexpected errors / initialization errors.

We recently had a problem with one of our AIX VIO servers not being able to run any powerpath commands.  Any attempt to run a command would result in an unexpected error or initialization error.   After speaking to EMC about it, the root cause is usually either running out of space on the root filesystem or having the data and stack ulimit paramenters set too low after adding a large number of new LUNs.   We are running AIX 6.1 on an IBM pSeries 550 with PowerPath 5.3 HF1.

Here are the errors that were popping up:

root@vioserver1:/script # powermt config
Unexpected error occured.

root@vioserver1:/script # powermt display dev=all
Initialization error.

root@vioserver1:/script # naviseccli -h <san_dns_name> lun -list -all
evp_enc.c(282): OpenSSL internal error, assertion failed: inl > 0
ksh: 503926 IOT/Abort trap(coredump)

Having too many LUNs caused the issue,  we had recently added an additional 35 for a total of  70.  Increasing the data and stack parameters to ‘unlimited’ resolved the problem.

root@vioserver1:/script # ulimit -a
time(seconds)        unlimited
file(blocks)         unlimited
data(kbytes)         unlimited
stack(kbytes)        unlimited
memory(kbytes)       unlimited
coredump(blocks)     2097151
nofiles(descriptors) 2000
threads(per process) unlimited
processes(per user)  unlimited

Reporting on Soft media errors

 

Ah, soft media errors.  The silent killer.  We had an issue with one of our Clariion LUNs that had many uncorrectable sector errors.  Prior to the LUN failure, there were hundreds of soft media errors reported in the navisphere logs.  Why weren’t we alerted about them?  Beats me.  I created my own script to pull and parse the alert logs so I can manually check for these type of errors.

What exactly is a soft media error?  Soft Media errors indicate that the SAN has identified a bad sector on the disk and is reconstructing the data from RAID parity data  in order to fulfill the read request.   It can indicate a failing disk.

To run a report that pulls only soft media errors from the SP log, put the following in a windows batch file:

naviseccli -h <SP IP Address> getlog >textfile.txt

for /f "tokens=1,2,3,4,5,6,7,8,9,10,11,12,13,14" %%i in ('findstr Soft textfile.txt') do (echo %%i %%j %%k %%l %%m %%n %%o %%p %%q %%r %%s %%t %%u %%v)  >>textfile_mediaerrors.txt

The text file output looks like this:

10/25/2010 19:40:17 Enclosure 6 Disk 7 (820) Soft Media Error [0x00] 0 5
 10/25/2010 19:40:22 Enclosure 6 Disk 7 (820) Soft Media Error [0x00] 0 5
 10/25/2010 19:40:22 Enclosure 6 Disk 7 (820) Soft Media Error [0x00] 0 5
 10/25/2010 19:40:27 Enclosure 6 Disk 7 (820) Soft Media Error [0x00] 0 5
 10/25/2010 19:40:27 Enclosure 6 Disk 7 (820) Soft Media Error [0x00] 0 5
 10/25/2010 19:40:33 Enclosure 6 Disk 7 (820) Soft Media Error [0x00] 0 5
 10/25/2010 19:40:33 Enclosure 6 Disk 7 (820) Soft Media Error [0x00] 0 5
 10/25/2010 19:40:38 Enclosure 6 Disk 7 (820) Soft Media Error [0x00] 0 5
 10/25/2010 19:40:38 Enclosure 6 Disk 7 (820) Soft Media Error [0x00] 0 5
 10/25/2010 19:40:44 Enclosure 6 Disk 7 (820) Soft Media Error [0x00] 0 5
 10/25/2010 19:40:44 Enclosure 6 Disk 7 (820) Soft Media Error [0x00] 0 5
 10/25/2010 19:40:49 Enclosure 6 Disk 7 (820) Soft Media Error [0x00] 0 5
 10/25/2010 19:40:49 Enclosure 6 Disk 7 (820) Soft Media Error [0x00] 0 5

If you see lots of soft media errors, do yourself a favor and open a case with EMC.  Too many can lead to the failure of one of your LUNs.

The script can be automated to run and send an email with daily alerts, if you so choose.  I just run it manually about once a week for review.