Tag Archives: storage processor

Frequent 0x622 and 0x606 errors in the SP Event Logs

During some routine checking of the SP Event logs on our NS-40 I was noticing a large number of alerts. Every few seconds I was seeing these three alerts pop in:

0x60a Internal Information Only. A logical unit has been enabled
0x622 Background Verify Aborted
0x606 Unit Shutdown for trespass
 

After a bit of investigation, I narrowed down the cause to several large LUNs that had just been added to a new ESX host.  It turns out that the LUNs were still running the background zeroing process, and that’s what was causing all the alerts in the SP Log. When you create a new LUN and the disks have been previously used for other LUNs, the new LUN needs to be “zeroed” (filled with all zeros to clear data). This takes place in the background and it is part of the LUN initialization.  Once this background zeroing process completed on my new LUNs the alert messages stopped.  I was unaware of that process, so I did a bit of research on it.

LUNs are immediately available for use after a bind (using “Fastbind”), however all the operations associated with a bind can take a long time to finish.  The duration of a LUN bind is dependent on these things:

  • LUN’s bind time background verify priority (rate)
  • Size of the LUN being bound
  • Type of drives in the LUN’s RAID Group
  • Potential disabling of initial verify on bind
  • State of the Storage System (Idle or Load)
  • Position of the LUN on the hard disks of the RAID Group

From that list, priority, LUN size, drive type and verification selection all have the greatest effect on duration.  You can calculate the approximate duration of the bind process with this formula:

Time = Bound LUN Capacity * Bind Rate

Here are the Average Bind Rates for FC and SATA disks:

Disk Type ASAP Bind Rate High Bind Rate Medium (default) Bind Rate Low Bind Rate
FC 83 MB/s 7.54 MB/s 5.02 MB/s 4.02 MB/s
SATA 61.7 MB/s 7.47 MB/s 5.09 MB/s 3.78 MB/s

If we were to calculate how many hours it would take to bind a 2000GB LUN on a five disk RAID5 group composed of SATA drives set to a medium (default) bind rate, here’s what the formula would look like:

Time = 2000 GB * ((1/5.09 MB/s) * 1024 MB->GB * (1/3600 sec->hrs) = 111.76 Hours.

There is a detailed white paper that covers this topic from EMC called “The Effect of Priorities on LUN Management Operations” that you can view here:  http://www.emc.com/collateral/hardware/white-papers/h4153-influence-priorities-emc-clariion-lun-wp.pdf.  That’s where I gathered the information above.

Advertisements

Automating VNX Storage Processor Percent Utilization Alerts

Note:  The original post describes a method that requires EMC Control Center and Performance Manager.  That tool has been deprecated by EMC in favor of ViPR SRM.  There is still a method you can use to gather CPU information for use in bash scripts. I don’t have script examples that use this command, but if anyone needs help send me a comment and I’ll help. The Navisphere CLI command to get busy/idle ticks for the Storage processors is naviseccli -h getcontrol -cbt.

The output looks like this:

Controller busy ticks: 1639432
Controller idle ticks: 1773844

The SP utilization statistics outputted are an average of the utilization across all the cores of the SP’s processors since the last reset. To get the actual point-in-time SP CPU utilization from this output requires a calculation. You need to poll twice, create a delta for the individual counters by subtracting the earlier value from the later, and apply this formula:

Utilization = Busy Ticks / (Busy Ticks + Idle Ticks)

What follows is the original method I posted that requries EMC Control Center.

I was tasked with coming up with a way to get email alerts whenever our SP utilization breaks a certain threshold.  Since none of the monitoring tools that we own will do that right now, I had to come up with a way using custom scripts.  This is my 2nd post on the same subject, I removed my post from yesterday as it didn’t work as I intended.  This time I used EMC’s Performance Manager rather than pulling data from the SP with the Navisphere CLI.

First, I’m running all of my bash scripts on a windows sever using cygwin.  These should run fine on any linux box as well, however.  Because I don’t have a native sendmail configuration set up on the windows server, I’m using the control station on the Celerra to actually do the comparison of the utilization numbers in the text files and then email out an alert.  The Celerra control station automatically pulls the file via FTP from the windows server every 30 minutes and sends out an email alert if the numbers cross the threshold.  A description of each script and the schedule is below.

Windows Server:

Export.cmd:

This first windows batch script runs an export (with pmcli) from EMC Performance Manager that does a dump of all the performance stats for the current day.

For /f "tokens=2-4 delims=/ " %%a in ('date /t') do (set date=%%c%%a%%b)

C:\ECC\Client.610\PerformanceManager\pmcli.exe -export -out c:\cygwin\home\scripts\sputil999_interval.csv -type interval -class clariion -date %date% -id APM00400500999

Data.cmd:

This cygwin/bash script manipulates the file export from above and ultimately creates two single text files (one for SPA and one for SPB) with a single numerical value of the most recent SP Utilization.  There are a few extra steps at the beginning of the script that are irrelevant to the SP utilization, they’re there for other purposes.

#This will pull only the timestamp line from the top

grep -m 1 "/" /home/scripts/sputil/0999_interval.csv > /home/scripts/sputil/timestamp.csv

# This will pull out only the "disk utilization" line.

grep -i "^% Utilization" /home/scripts/sputil/0999_interval.csv >> /home/scripts/sputil/stats.csv

# This will pull out the disk/LUN title info for the first column

grep -i "Data Collected for DiskStats -" /home/scripts/sputil/0999_interval.csv > /home/scripts/sputil/diskstats.csv

grep -i "Data Collected for LUNStats -" /home/scripts/sputil/0999_interval.csv > /home/scripts/sputil/lunstats.csv

# This will create a column with the disk/LUN number

cat /home/scripts/sputil/diskstats.csv /home/scripts/sputil/lunstats.csv > /home/scripts/sputil/data.csv

# This combines the disk/LUN column with the data column

paste /home/scripts/sputil/data.csv /home/scripts/sputil/stats.csv > /home/scripts/sputil/combined.csv

cp /home/scripts/sputil/combined.csv /home/scripts/sputil/utilstats.csv
 

#  This removes all the temporary files
rm /home/scripts/sputil/timestamp.csv
rm /home/scripts/sputil/stats.csv
rm /home/scripts/sputil/diskstats.csv
rm /home/scripts/sputil/lunstats.csv
rm /home/scripts/sputil/data.csv
rm /home/scripts/sputil/combined.csv

# This next line strips the file of all but the last two rows, which are SP Utilization.

# The 1 looks at the first character in the row, the D specifies "starts with D", then deletes rows meeting those conditions.

awk -v FS="" -v OFS="" '$1 != "D"' < /home/scripts/sputil/utilstats.csv > /home/scripts/sputil/sputil.csv

#This pulls the values from the last column, which would be the most recent.

awk -F, '{print $(NF-1)}' < /home/scripts/sputil/sputil.csv > /home/scripts/sputil/sp_util.csv

#pull 1st line (SPA) into separate file

sed -n 1,1p < /home/scripts/sputil/sp_util.csv > /home/scripts/sputil/spAutil.txt

#pull 2nd line (SPB) into separate file

sed -n 2,2p < /home/scripts/sputil/sp_util.csv > /home/scripts/sputil/spButil.txt

#The spAutil.txt/spButil.txt files now contain only a single numerical value, which would be the most recent %utilization from the Control Center/Performance Manager dump file.

#Copy files to web server root directory

cp /home/scripts/sputil/*.txt /cygdrive/c/inetpub/wwwroot

Celerra Control Station:

CelerraArray:/home/nasadmin/sputil/ftpsp.sh

The script below connects to the windows server and grabs the current SP utilization text files via FTP every 30 minutes (via a cron job).

#!/bin/bash
cd /home/nasadmin/sputil
ftp windows_server.domain.net <<SCRIPT
get spAutil.txt
get spButil.txt
quit
SCRIPT
 CelerraArray:/home/nasadmin/sputil/spcheck.sh:

This script does the comparison check to see if the SP utilization is over our threshold. If it is, it sends an email alert that includes the %Utilization number in the subject line of the email. To change the threshold setting, you’d need to change the THRESHOLD=<XX> line in the script.  The line containing printf “%2.0f” converts the floating point value to an integer, as bash scripts don’t recognize floating point values.

#!/bin/bash

SPB=`cat /home/nasadmin/sputil/spButil.txt` 
SPBcheck= printf "%2.0f" $SPB > /home/nasadmin/sputil/spButil2.txt 
SPB=`cat /home/nasadmin/sputil/spButil2.txt`

echo $SPB
THRESHOLD=50
if [ $SPB -eq 0 ] && [ $THRESHOLD -eq 0 ] 
then 
        echo "Both are zero"
 elif [ $SPB -eq $THRESHOLD ]
 then         
        echo "Both Values are equal"
 elif [ $SPB -gt $THRESHOLD ]
 then          
        echo "SPB is greater than the threshold.  Sending alert" 

        uuencode spButil.txt | mail -s "<array_name> SPB Utilization Alert: $SPB % above threshold of $THRESHOLD %" notify@domain.com
else         
echo "$SPB is lesser than $THRESHOLD" 
fi

CelerraArray Crontab schedule:

The FTP script is currently set to pull SP utilization files.  Run “crontab –e” to edit the scheduler.  I’ve got the alert script set to run at the top of the hour and half past the hour, and the updated SP files from the web server are FTP’d in a few minutes prior.

[nasadmin@CelerraArray sputil]$ crontab –l
58,28 * * * * /home/nasadmin/sputil/ftpsp.sh
0,30 * * * * /home/nasadmin/sputil/spcheck.sh
 Overall Scheduling:

Windows Server:

Performance Manager Dump runs 15 minutes past the hour (exports data)
Data script runs at 20 minutes past the hour (processes data to get SP Utilization)

Celerra Server:

FTP script pulls new SP utilization text files at 28 minutes past the hour
Alert script runs at 30 minutes past the hour

The cycle then repeats at minute 45, minute 50, minute 58, and minute 0.