Scripting automatic reports for Isilon from the CLI

We recently set up a virtual demo of an Isilon system on our network as we are evaluating Isilon for a possible purchase.  You can obtain a virtual node that runs on ESX from your local EMC Isilon representative, along with temporary licenses to test everything out.  As part of the test I wanted to see if it was possible to create custom CLI scripts in the same way that I create them on the Celerra or VNX File.  On the Celerra, I run daily scripts that output file pool sizes, file systems & disk space, failover status, logs, health check info, checkpoint info, etc. to my web report page.  Can you do the same thing on an Isilon?

Well, to start with, Isilon’s commands are completely different.  The first step of course was to look at what was available to me.  All of the Isilon administration commands appear to begin with ‘isi’.  If you type isi by itself it will show you the basic list of commands and what they do:

isilon01-1% isi
Description:
OneFS cluster administration.
Usage:
isi  <subcommand>
[--timeout <integer>]
[{--help | -h}]
Subcommands:
Cluster Monitoring:
alert*           An alias for "isi events".
audit            Manage audit configuration.
events*          Manage cluster events.
perfstat*        View cluster performance statistics.
stat*            An alias for "isi status".
statistics*      View general cluster statistics.
status*          View cluster status.
Cluster Configuration:
config*          Manage general cluster settings.
email*           Configure email settings.
job              Isilon job management commands.
license*         Manage software licenses.
networks*        Manage network settings.
services*        Manage cluster services.
update*          Update OneFS system software.
pkg*             Manage OneFS system software patches.
version*         View system version information.
remotesupport    Manage remote support settings.
Hardware & Devices:
batterystatus*   View battery status.
devices*         Manage cluster hardware devices.
fc*              Manage Fibre Channel settings.
firmware*        Manage system firmware.
lun*             Manage iSCSI logical units (LUNs).
target*          Manage iSCSI targets.
readonly*        Toggle node read-write state.
servicelight*    Toggle node service light.
tape*            Manage tape and media changer devices.
File System Configuration:
get*             View file system object properties.
set*             Manage file system object settings.
quota            Manage SmartQuotas, notifications and reports.
smartlock*       Manage SmartLock settings.
domain*          Manage file system domains.
worm*            Manage SmartLock WORM settings.
dedupe           Manage Dedupe settings.
Access Management:
auth             Manage authentication, identities and role-based access.
zone             Manage access zones.
Data Protection:
avscan*          Manage antivirus service settings.
ndmp*            Manage backup (NDMP) settings.
snapshot         Manage SnapshotIQ file system snapshots and settings.
sync             SyncIQ management interface.
Protocols:
ftp*             Manage FTP settings.
hdfs*            Manage HDFS settings.
iscsi*           Manage iSCSI settings.
nfs              Manage NFS exports and protocol settings.
smb              Manage SMB shares and protocol settings.
snmp*            Manage SNMP settings.
Utilities:
exttools*        External tools.
Other Subcommands:
filepool         Manage filepools on the cluster.
storagepool      Configure and monitor storage pools
Options:
Display Options:
--timeout <integer>
Number of seconds for a command timeout.
--help | -h
Display help for this command.

Noe that subcommands or actions that are marked with an asterisk(*) require root login.  If you log in with a normal admin account you’ll get the following error message when you run the command:

isilon01-1% isi status
Commands not enabled for role-based administration require root user access.

Isilon’s OneFS operating system is based on FreeBSD as opposed to Linux for the Celerra/VNX DART OS.   Since it’s a unix based OS with console access, you can create shell scripts and cron job schedules just like the Celerra/VNX File.

Because all of the commands I want to run require root access, I had to create the scripts logged in as root.  Be careful doing this!  This is only a test for me, I would likely look for a workaround for a prod system.    Knowing that I’m going to be FTPing the output files to my web report server, I started by creating the .netrc file in the /root folder.  This is where I store the default login and password for the FTP server. Permissions must be changed for it to work, use chmod 600 on the file after you create it.  It didn’t work for me at first as the syntax is different on FreeBSD than on Linux, so looking at my Celerra notes didn’t help  (For Celerra/VNX File I used “machine <ftp_server_name> login <ftp_login_id> password <ftp_password>”).

For Isilon, the correct syntax is this:

default login <ftp_username> password <ftp_password>
 

The script that FTP’s the files would then look like this for Isilon:

HOST=”10.1.1.1″
ftp $HOST <<SCRIPT
put /root/isilon_stats.txt
put /root/isilon_repl.txt
put /root/isilon_df.txt
put /root/isilon_dedupe.txt
put /root/isilon_perf.txt
SCRIPT
 

For this demo, I created a script that generates reports for File system utilization, Deduplication Status, Performance Stats, Replication stats, and Array Status.  For the Filesystem utilization report, I used two different methods as I wasn’t sure which I’d like better.   Using ‘df –h –a /ifs’ will get you similar information to ‘isi storagepool list’, but the output format is different.   I used cron to schedule the job directly on the Isilon.

Here is the reporting script:

TODAY=$(date)
HOST=$(hostname)
sleep 15
echo “———————————————————————————” > /root/isilon_stats.txt
echo “Date: $TODAY  Host:  $HOST” >> /root/isilon_stats.txt
echo “———————————————————————————” >> /root/isilon_stats.txt
/usr/bin/isi status >> /root/isilon_stats.txt
echo “———————————————————————————” > /root/isilon_repl.txt
echo “Date: $TODAY  Host:  $HOST” >> /root/isilon_repl.txt
echo “———————————————————————————” >> /root/isilon_repl.txt
/usr/bin/isi sync reports list >> /root/isilon_repl.txt
echo “———————————————————————————” > /root/isilon_df.txt
echo “Date: $TODAY  Host:  $HOST” >> /root/isilon_df.txt
echo “———————————————————————————” >> /root/isilon_df.txt
df -h -a /ifs >> /root/isilon_df.txt
echo ” ” >> /root/isilon_df.txt
/usr/bin/isi storagepool list >> /root/isilon_df.txt
echo “———————————————————————————” > /root/isilon_dedupe.txt
echo “Date: $TODAY  Host:  $HOST” >> /root/isilon_dedupe.txt
echo “———————————————————————————” >> /root/isilon_dedupe.txt
/usr/bin/isi dedupe stats  >> /root/isilon_dedupe.txt
echo “———————————————————————————” > /root/isilon_perf.txt
echo “Date: $TODAY  Host:  $HOST” >> /root/isilon_perf.txt
echo “———————————————————————————” >> /root/isilon_perf.txt
sleep 1
echo ”  ” >> /root/isilon_perf.txt
echo “–System Stats–” >> /root/isilon_perf.txt
echo ”  ” >> /root/isilon_perf.txt
/usr/bin/isi statistics system  >> /root/isilon_perf.txt
sleep 1
echo ”  ” >> /root/isilon_perf.txt
echo “–Client Stats–” >> /root/isilon_perf.txt
echo ”  ” >> /root/isilon_perf.txt
/usr/bin/isi statistics client  >> /root/isilon_perf.txt
sleep 1
echo ”  ” >> /root/isilon_perf.txt
echo “–Protocol Stats–” >> /root/isilon_perf.txt
echo ”  ” >> /root/isilon_perf.txt
/usr/bin/isi statistics protocol  >> /root/isilon_perf.txt
sleep 1
echo ”  ” >> /root/isilon_perf.txt
echo “–Protocol Data–” >> /root/isilon_perf.txt
echo ”  ” >> /root/isilon_perf.txt
/usr/bin/isi statistics pstat  >> /root/isilon_perf.txt
sleep 1
echo ”  ” >> /root/isilon_perf.txt
echo “–Drive Stats–” >> /root/isilon_perf.txt
echo ”  ” >> /root/isilon_perf.txt
/usr/bin/isi statistics drive  >> /root/isilon_perf.txt
 

Once the ouput files are FTP’d to the web server, I have a basic HTML page that uses iframes to show the text files.  The web page is then automatically updated as soon as the new text files are FTP’d.  Below is a screenshot of my demo report page.  It doesn’t show the entire page, but you’ll get the idea.

IsilonReports

Advertisements

Easy reporting on data domain using the autosupport log

EMC-Data-Domain-DD2200-IMG-03.png

I was looking for an easy (and free) way to do some daily reporting on our data domain hardware.  I was most interested in reporting on disk space, but decided to gather some other data as well.  The easiest method I found is to use the info collected in the autosupport log.  First, I’ll explain how to automatically gather the autosupport log from your data domain, and then move on to how to pull only the information you want from it for reporting purposes.

First, you need to enable ftp on the data domain and add the IP address of the FTP server you’re going to use to pull the file.  Here are the commands you need to run on your data domain:

adminaccess enable ftp
adminaccess add ftp 10.1.1.1 
 

Next, you’ll want to pull the files from the data domain.  I am using a windows FTP server to pull the autosupport files.  The windows batch script I use is scheduled to run once a day and is listed below.  I use the ftp command and call a text file with the specific IP and login info for each data domain box.

ftp -s:10.1.1.2.txt
ftp -s:10.1.1.3.txt
ftp -s:10.2.1.2.txt
ftp -s:10.2.1.3.txt
 

The text files that the ftp script calls (that are named <ipaddress.txt>) look like this:

open 10.1.1.2
sysadmin
<password>
cd support
lcd e:\reports\dd\SITE1_DD_01
get autosupport
quit
 

Once you’ve downloaded the autosupport file, you can run scripts against it to pull out only the info you’re interested in.  I prefer to use unix shell scripts for parsing files because of the extra functionality over standard windows batch scripts.  I have Cygwin installed on my web server to run these shell scripts.

If you take a look at the autosupport log, you’ll notice that it’s very long and contains some duplicate characters in multiple areas, which makes using grep, awk, and sed a bit more challenging.  It took a bit of trial and error, but the scripts below will pull only the specific areas I wanted:  System Alerts, File system compression, Locked files, Replication info, File distribution, and Disk space information.

The first script below will gather disk space information using grep.  Spacing inside the quotes is very important in this case, as I was looking for specific unique strings within the autosupport log, and using grep for these unique strings will produce the correct output.  In this example, I am gathering info for four data domain boxes. For those unfamiliar with Cygwin, you can access windows drive letters by navigating to /cygdrive/<drive letter>.  To view the root of the C drive, you would type ‘cd /cygdrive/c’ from the CLI.  In this case, my windows batch script that pulls the autosupport files places them in the e:\reports\dd folder, which would be /cygdrive/e/reports/dd folder when using the cygwill shell.

Here is the script:

# Site 1 Data Domain 01
cat /cygdrive/e/reports/dd/SITE1_DD_01/autosupport | grep “=  SERVE” > /cygdrive/e/reports/dd/SITE1_DD_01/SITE1_DD_01_diskspace.txt
cat /cygdrive/e/reports/dd/SITE1_DD_01/autosupport | grep “Active Tier:” >> /cygdrive/e/reports/dd/SITE1_DD_01/SITE1_DD_01_diskspace.txt
cat /cygdrive/e/reports/dd/SITE1_DD_01/autosupport | grep “Resource           Size” >> /cygdrive/e/reports/dd/SITE1_DD_01/SITE1_DD_01_diskspace.txt
cat /cygdrive/e/reports/dd/SITE1_DD_01/autosupport | grep “/data:” >> /cygdrive/e/reports/dd/SITE1_DD_01/SITE1_DD_01_diskspace.txt
cat /cygdrive/e/reports/dd/SITE1_DD_01/autosupport | grep “/ddvar     ” >> /cygdrive/e/reports/dd/SITE1_DD_01/SITE1_DD_01_diskspace.txt
 
# Site 1 Data Domain 02
cat /cygdrive/e/reports/dd/SITE1_DD_02/autosupport | grep “=  SERVE” > /cygdrive/e/reports/dd/SITE1_DD_02/SITE1_DD_02_diskspace.txt
cat /cygdrive/e/reports/dd/SITE1_DD_02/autosupport | grep “Active Tier:” >> /cygdrive/e/reports/dd/SITE1_DD_02/SITE1_DD_02_diskspace.txt
cat /cygdrive/e/reports/dd/SITE1_DD_02/autosupport | grep “Resource           Size” >> /cygdrive/e/reports/dd/SITE1_DD_02/SITE1_DD_02_diskspace.txt
cat /cygdrive/e/reports/dd/SITE1_DD_02/autosupport | grep “/data:” >> /cygdrive/e/reports/dd/SITE1_DD_02/SITE1_DD_02_diskspace.txt
cat /cygdrive/e/reports/dd/SITE1_DD_02/autosupport | grep “/ddvar     ” >> /cygdrive/e/reports/dd/SITE1_DD_02/SITE1_DD_02_diskspace.txt
 
# Site 2 Data Domain 01
cat /cygdrive/e/reports/dd/Site2_DD_01/autosupport | grep “=  SERVE” > /cygdrive/e/reports/dd/Site2_DD_01/Site2_DD_01_diskspace.txt
cat /cygdrive/e/reports/dd/Site2_DD_01/autosupport | grep “Active Tier:” >> /cygdrive/e/reports/dd/Site2_DD_01/Site2_DD_01_diskspace.txt
cat /cygdrive/e/reports/dd/Site2_DD_01/autosupport | grep “Resource           Size” >> /cygdrive/e/reports/dd/Site2_DD_01/Site2_DD_01_diskspace.txt
cat /cygdrive/e/reports/dd/Site2_DD_01/autosupport | grep “/data:” >> /cygdrive/e/reports/dd/Site2_DD_01/Site2_DD_01_diskspace.txt
cat /cygdrive/e/reports/dd/Site2_DD_01/autosupport | grep “/ddvar     ” >> /cygdrive/e/reports/dd/Site2_DD_01/Site2_DD_01_diskspace.txt
 
# Site 2 Data Domain 02
cat /cygdrive/e/reports/dd/Site2_DD_02/autosupport | grep “=  SERVE” > /cygdrive/e/reports/dd/Site2_DD_02/Site2_DD_02_diskspace.txt
cat /cygdrive/e/reports/dd/Site2_DD_02/autosupport | grep “Active Tier:” >> /cygdrive/e/reports/dd/Site2_DD_02/Site2_DD_02_diskspace.txt
cat /cygdrive/e/reports/dd/Site2_DD_02/autosupport | grep “Resource           Size” >> /cygdrive/e/reports/dd/Site2_DD_02/Site2_DD_02_diskspace.txt
cat /cygdrive/e/reports/dd/Site2_DD_02/autosupport | grep “/data:” >> /cygdrive/e/reports/dd/Site2_DD_02/Site2_DD_02_diskspace.txt
cat /cygdrive/e/reports/dd/Site2_DD_02/autosupport | grep “/ddvar     ” >> /cygdrive/e/reports/dd/Site2_DD_02/Site2_DD_02_diskspace.txt
 
# Copy the reports to the web server directory.
cp /cygdrive/e/reports/dd/SITE1_DD_01/SITE1_DD_01_diskspace.txt /cygdrive/c/inetpub/wwwroot/
cp /cygdrive/e/reports/dd/SITE1_DD_02/SITE1_DD_02_diskspace.txt /cygdrive/c/inetpub/wwwroot/
cp /cygdrive/e/reports/dd/Site2_DD_01/Site2_DD_01_diskspace.txt /cygdrive/c/inetpub/wwwroot/
cp /cygdrive/e/reports/dd/Site2_DD_02/Site2_DD_02_diskspace.txt /cygdrive/c/inetpub/wwwroot/
 

For the remaining reports, I used the ‘sed’ command rather than ‘grep’.  As an example, I’ll explain how I stripped out the information for the System Alerts section.  In order to strip out it out, I use sed to start cutting text when it reaches ‘Current Alerts’, and stop cutting text when it reaches ‘There are’.   This same process is repeated to pull the information for the other reports as well (alerts, compression, locked files, replication, distribution).

This is what the Current Alerts section looks like in the autosupport log:

Current Alerts
--------------
Alert Id   Time                       Severity   Class               Object          Message                                                             
--------   ------------------------   --------   -----------------   -------------   ---------------------------------------------------------------------
774        Wed Dec 19 11:32:19 2012   CRITICAL   SystemMaintenance                   Core dump capability is now disabled due to lack of space in /ddvar.
807        Sun Jan 20 09:07:18 2013   WARNING    Replication         context=3       repl ctx 3: Sync-as-of time is more than 461 hours ago.             
808        Sun Jan 20 09:07:18 2013   WARNING    Replication         context=4       repl ctx 4: Sync-as-of time is more than 461 hours ago.             
809        Sun Jan 20 09:07:19 2013   WARNING    Replication         context=5       repl ctx 5: Sync-as-of time is more than 461 hours ago.              
814        Mon Jan 28 23:20:45 2013   CRITICAL   Filesystem          FilesysType=2   Space usage in Data Collection has exceeded 95% threshold.          
820        Wed Jan 30 15:52:40 2013   WARNING    Replication         context=2       repl ctx 2: Sync-as-of time is more than 148 hours ago.             
824        Mon Feb  4 12:36:39 2013   WARNING    Replication         context=10      repl ctx 10: Sync-as-of time is more than 24 hours ago.             
825        Wed Feb  6 02:13:51 2013   WARNING    Replication         context=19      repl ctx 19: Sync-as-of time is more than 24 hours ago.             
--------   ------------------------   --------   -----------------   -------------   ---------------------------------------------------------------------
There are 8 active alerts.
 

In this case, sed searches for ‘Current Alerts’ and copies all of the text until it reaches the string ‘There are’, at which point it stops and outputs the file.  The output files are written directly to the wwwroot folder on my internal reporting web page.  Each page encapsulates the output files in an iframe, so the web page is automatically updated every morning when the script runs.

Here is the second script that captures the remaining data:

#For Alerts
sed -n ‘/Current Alerts/,/There are/p’ /cygdrive/e/reports/dd/SITE1_DD_01/autosupport > /cygdrive/c/inetpub/wwwroot/SITE1_DD_01_alerts.txt
sed -n ‘/Current Alerts/,/There are/p’ /cygdrive/e/reports/dd/SITE1_DD_02/autosupport > /cygdrive/c/inetpub/wwwroot/SITE1_DD_02_alerts.txt
sed -n ‘/Current Alerts/,/There are/p’ /cygdrive/e/reports/dd/Site2_DD_01/autosupport > /cygdrive/c/inetpub/wwwroot/Site2_DD_01_alerts.txt
sed -n ‘/Current Alerts/,/There are/p’ /cygdrive/e/reports/dd/Site2_DD_02/autosupport > /cygdrive/c/inetpub/wwwroot/Site2_DD_02_alerts.txt
 
#For Filesystem Compression
sed -n ‘/Filesys Compression/,/((Pre-/p’ /cygdrive/e/reports/dd/SITE1_DD_01/autosupport > /cygdrive/c/inetpub/wwwroot/SITE1_DD_01_filecomp.txt
sed -n ‘/Filesys Compression/,/((Pre-/p’ /cygdrive/e/reports/dd/SITE1_DD_02/autosupport > /cygdrive/c/inetpub/wwwroot/SITE1_DD_02_filecomp.txt
sed -n ‘/Filesys Compression/,/((Pre-/p’ /cygdrive/e/reports/dd/Site2_DD_01/autosupport > /cygdrive/c/inetpub/wwwroot/Site2_DD_01_filecomp.txt
sed -n ‘/Filesys Compression/,/((Pre-/p’ /cygdrive/e/reports/dd/Site2_DD_02/autosupport > /cygdrive/c/inetpub/wwwroot/Site2_DD_02_filecomp.txt
 
#For Locked Files:
sed -n ‘/Locked files/,/Active/p’ /cygdrive/e/reports/dd/SITE1_DD_01/autosupport > /cygdrive/c/inetpub/wwwroot/SITE1_DD_01_lockedfiles.txt
sed -n ‘/Locked files/,/Active/p’ /cygdrive/e/reports/dd/SITE1_DD_02/autosupport > /cygdrive/c/inetpub/wwwroot/SITE1_DD_02_lockedfiles.txt
sed -n ‘/Locked files/,/Active/p’ /cygdrive/e/reports/dd/Site2_DD_01/autosupport > /cygdrive/c/inetpub/wwwroot/Site2_DD_01_lockedfiles.txt
sed -n ‘/Locked files/,/Active/p’ /cygdrive/e/reports/dd/Site2_DD_02/autosupport > /cygdrive/c/inetpub/wwwroot/Site2_DD_02_lockedfiles.txt
 
#Repl Transferred over 24 hrs
sed -n ‘/Replication Data Transferred/,/(sum)/p’ /cygdrive/e/reports/dd/SITE1_DD_01/autosupport > /cygdrive/c/inetpub/wwwroot/SITE1_DD_01_repl.txt
sed -n ‘/Replication Data Transferred/,/(sum)/p’ /cygdrive/e/reports/dd/SITE1_DD_02/autosupport > /cygdrive/c/inetpub/wwwroot/SITE1_DD_02_repl.txt
sed -n ‘/Replication Data Transferred/,/(sum)/p’ /cygdrive/e/reports/dd/Site2_DD_01/autosupport > /cygdrive/c/inetpub/wwwroot/Site2_DD_01_repl.txt
sed -n ‘/Replication Data Transferred/,/(sum)/p’ /cygdrive/e/reports/dd/Site2_DD_02/autosupport > /cygdrive/c/inetpub/wwwroot/Site2_DD_02_repl.txt
 
#File Distribution
sed -n ‘/File Distribution/,/> 500 GiB/p’ /cygdrive/e/reports/dd/SITE1_DD_01/autosupport > /cygdrive/c/inetpub/wwwroot/SITE1_DD_01_filedist.txt
sed -n ‘/File Distribution/,/> 500 GiB/p’ /cygdrive/e/reports/dd/SITE1_DD_02/autosupport > /cygdrive/c/inetpub/wwwroot/SITE1_DD_02_filedist.txt
sed -n ‘/File Distribution/,/> 500 GiB/p’ /cygdrive/e/reports/dd/Site2_DD_01/autosupport > /cygdrive/c/inetpub/wwwroot/Site2_DD_01_filedist.txt
sed -n ‘/File Distribution/,/> 500 GiB/p’ /cygdrive/e/reports/dd/Site2_DD_02/autosupport > /cygdrive/c/inetpub/wwwroot/Site2_DD_02_filedist.txt
 

When all is said and done, the report output looks like this:

Alerts:

Current Alerts
--------------
Alert Id   Time                       Severity   Class         Object          Message                                                   
--------   ------------------------   --------   -----------   -------------   ----------------------------------------------------------
1157       Wed Feb  5 12:23:19 2014   WARNING    Replication   context=3       Sync-as-of time is more than 24 hours ago.                
1161       Sun Feb  9 08:33:48 2014   CRITICAL   Filesystem    FilesysType=2   Space usage in Data Collection has exceeded 95% threshold.
--------   ------------------------   --------   -----------   -------------   ----------------------------------------------------------
There are 2 active alerts.

Filesystem Compression:

Filesys Compression
--------------                  
From: 2014-03-06 05:00 To: 2014-03-13 06:00

                  Pre-Comp   Post-Comp   Global-Comp   Local-Comp      Total-Comp
                     (GiB)       (GiB)        Factor       Factor          Factor
                                                                    (Reduction %)
---------------   --------   ---------   -----------   ----------   -------------
Currently Used:   495122.5    122948.5             -            -     4.0x (75.2)
Written:*                                                                        
  Last 7 days       9204.2      5656.6          1.1x         1.5x     1.6x (38.5)
  Last 24 hrs         74.1        36.4          1.0x         2.0x     2.0x (50.9)
---------------   --------   ---------   -----------   ----------   -------------
 * Does not include the effects of pre-comp file deletes/truncates
   since the last cleaning on 2014/03/11 02:07:50.

Replications:

Replication Data Transferred over 24hr
--------------------------------------
Directory/MTree Replication:
Date         Time         CTX   Pre-Comp (KB)   Pre-Comp (KB)           Replicated (KB)   Low-bw-   Sync-as-of      
                                      Written       Remaining    Pre-Comp       Network     optim   Time            
----------   --------   -----   -------------   -------------   -----------------------   -------   ----------------
2014/03/13   06:36:18       2     104,186,675      86,628,349   55,522,200   37,756,047      1.00   Thu Mar 13 05:53
                            3               0               0            0            0      0.00   Tue Feb  4 12:00
                            4               0               0            0        4,882      0.00   Thu Mar 13 06:00
                            8               0               0            0        5,120      0.00   Thu Mar 13 06:00
                           29               0               0            0        5,098      0.00   Thu Mar 13 06:00
                        (sum)     104,186,675                   55,522,200   37,771,148      1.00  

File Distribution:

File Distribution
-----------------
169,432 files in 8,691 directories

                          Count                         Space
               -----------------------------   --------------------------
         Age         Files       %    cumul%        GiB       %    cumul%
   ---------   -----------   -----   -------   --------   -----   -------
       1 day            13     0.0       0.0       74.4     0.0       0.0
      1 week           223     0.1       0.1     2133.7     0.4       0.4
     2 weeks         5,103     3.0       3.2    39121.2     7.8       8.3
     1 month        12,819     7.6      10.7    79336.5    15.9      24.1
    2 months        21,853    12.9      23.6   153873.8    30.8      54.9
    3 months         5,743     3.4      27.0    45154.6     9.0      64.0
    6 months         3,402     2.0      29.0    13674.3     2.7      66.7
      1 year        17,035    10.1      39.1    72937.7    14.6      81.3
    > 1 year       103,241    60.9     100.0    93461.7    18.7     100.0
   ---------   -----------   -----   -------   --------   -----   -------

                          Count                         Space
               -----------------------------   --------------------------
        Size         Files       %    cumul%        GiB       %    cumul%
   ---------   -----------   -----   -------   --------   -----   -------
       1 KiB        11,257     6.6       6.6        0.0     0.0       0.0
      10 KiB        44,396    26.2      32.8        0.3     0.0       0.0
     100 KiB        38,488    22.7      55.6        1.3     0.0       0.0
     500 KiB         9,652     5.7      61.3        2.1     0.0       0.0
       1 MiB        27,460    16.2      77.5       70.4     0.0       0.0
       5 MiB        12,136     7.2      84.6       27.3     0.0       0.0
      10 MiB         3,861     2.3      86.9       26.9     0.0       0.0
      50 MiB         4,367     2.6      89.5       96.6     0.0       0.0
     100 MiB           853     0.5      90.0       58.2     0.0       0.1
     500 MiB           861     0.5      90.5      201.0     0.0       0.1
       1 GiB           495     0.3      90.8      309.7     0.1       0.2
       5 GiB           567     0.3      91.1     1460.2     0.3       0.5
      10 GiB           336     0.2      91.3     2574.1     0.5       1.0
      50 GiB        14,691     8.7     100.0   494083.5    98.9      99.8
     100 GiB            11     0.0     100.0      683.3     0.1     100.0
     500 GiB             1     0.0     100.0      173.0     0.0     100.0
   > 500 GiB             0     0.0     100.0        0.0     0.0     100.0

Disk Space:

==========  SERVER USAGE   ==========
Active Tier:
Resource           Size GiB   Used GiB   Avail GiB   Use%   Cleanable GiB
/data: pre-comp           -   245211.8           -      -               -
/data: post-comp    51484.9    31811.0     19673.9    62%            35.4
/ddvar                 78.7        7.9        66.8    11%               -

Comparing Dot Hill and EMC Auto Tiering

The problem with auto tiering in general is that a large amount of hot I/Os is “hot” only for a brief moment in time. Workloads are not uniform across an entire data set constantly and if the data isn’t moved in real time it’s very likely that hot data will be accessed from capacity storage.  Ideally a storage system would be able to react to these performance improvements in real time, but the cost in overhead to the storage processors is generally too great.  The problem is somewhat mitigated by having a large Cache (like EMC’s Fast Cache) but the ability to automatically tier data in real time would be ideal.

I recently did a comparison of Dot Hill’s auto tiering strategy to EMC’s, and found that Dot Hill takes a very unique approach to auto tiering that looks promising.  Here’s a brief comparison between the two.

EMC

EMC greatly improved FAST VP on their VNX2.   While the VNX uses 1GB data slices, the VNX2 uses more granular 256MB data slices.  This greatly improves efficiency.  EMC is also shipping MLC SSD instead of SLC SSD. This makes using SSD’s much more affordable in FAST VP pools. (Note that SLC is still required for FAST Cache).

How does the smaller data slice size improve efficiency?  As an example, assume that a 500MB contiguous range of hot data residing on a SAS tier needs to be moved to EFD. ON the VNX1, an entire 1GB slice would be moved.   If this “hot” 1GB slice was moved to a 100GB EFD drive, we would have 100GB of data sitting on the EFD drive but only 50GB of the data is hot.  This is obviously inefficient, as 50% of the data is cold.  With the VNX2’s 256MB data slice size, only 500GB would be moved, resulting in 100% efficiency for that block of data.  The VNX2 makes much more efficient use of the extreme performance and performance tiers.

EMC’s FAST VP auto tiering on the VNX1 is configured by either setting it to manual or creating a schedule.  The schedule can be set to run 24 hours a day to move data in real time, but in practice it’s not practical.  The overhead on the storage processors is simply too great and we’ve configured it to run during off peak hours.  On our busiest VNX1 we see the storage processors jump from ~50% utilization to ~75% utilization when the relocation job is running.  This may improve with the VNX2, but it’s been a problem on the CX series and the VNX1.

vnx_data_slice_explanation

DotHill

According to Dot Hill, their automatic auto tiering doesn’t look at every single IO like the VNX or VNX2, it looks for trends in how data is accessed.  Their rep told me to think of it as examining every 10th IO rather than every single IO.  The idea is to allow the array to move data in real time without overloading the storage processors.  Dot Hill also moves data in 4MB data slices (which is very efficient, as I explained earlier when discussing the VNX2), and will not move more than 80 MB in a 5 second time span (or 960MB/minute maximum) to keep the CPU load down.

So, how does the Dot Hill auto tiering actually work?  They use scoring, scanning, and sorting and each are separate processes that work in real time.  Scoring is used to maintain a current page ranking on every I/O using a process that adds less than one microsecond of overhead.  The algorithm looks at how frequently and how recently the data was accessed.  Higher scores are given to data that is more frequently and recently accessed. Scanning for the high scoring pages happens every 5 seconds. The scanning process uses less than 1.0% of the CPU. The pages with the highest scores become candidates for promotion to SSD.  Sorting is the process that actually moves or migrates the pages up or down based on their score.  As I mentioned earlier, no more than 80 MB of data is moved during any 5 second sort to minimize the overall performance impact.

Summary

I haven’t used EMC’s new VNX2 or Dot Hill’s AssuredSAN to provide any information that uses real world experience.  I think Dot Hill’s implementation looks very promising on paper, and I look forward to reading more customer experiences about it in the future.  They’ve been around a long time but they only recently started offering their products directly to customers as they’ve primarily been an OEM storage manufacturer.  As I mentioned earlier, my experience with EMC’s FAST VP on the CX series and VNX1 show that real time FAST VP consumes too many CPU cycles to be used in real time, we have always run it as an off-business hours process. That’s exactly what Dot Hill’s implementation is trying to address.  We’ve made adjustments to the FAST VP relocation schedule based on monitoring our workload.  We also use FAST Cache, which at least partially solves the problem of suddenly hot data needing extra IO.  FAST Cache and FAST VP work very well together.  Overall I’ve been happy with EMC’s implementation, but it’s good to see another company taking a different approach that could be very competitive with EMC.

You can read more about Dot Hill’s auto tiering here:

http://www.dothill.com/wp-content/uploads/2012/08/RealStorWhitePaper8.14.12.pdf

You can read more about EMC’s VNX1 FAST VP Here:  

https://www.emc.com/collateral/software/white-papers/h8058-fast-vp-unified-storage-wp.pdf

You can read more about EMC’s VNX2 FAST VP Here:

https://www.emc.com/collateral/white-papers/h12208-vnx-multicore-fast-cache-wp.pdf