Category Archives: post-archive

EMC World 2015

I’m at EMC World in Las Vegas this week and it’s been fantastic so far.  I’m excited about the new 40TB XtremIO X-bricks and how we might leverage that for our largest and most important 80TB oracle database, also excited about possible use cases for  the Virtual VNX in our small branch locations, and all the other exciting futures that I can’t publicly share because I’m under an NDA with EMC.  Truly exciting and innovative technology is coming from them.  VXblock was also really impressive, although that’s not likely something my company will implement anytime soon.

I found out for the first time today that the excellent VNX Monitoring and Reporting application is now free for the VNX1 platform as well as VNX2.  If you would like to get a license for any of your VNX1 arrays, simply ask your local  sales representative to submit a zero dollar sales order for a license.  We’re currently evaluating ViPR SRM as a replacement for our soon to be “end of life” Control Center install, but until then VNX MR is a fantastic tool that provides nearly the same performance data for no cost at all.  SRM adds much more functionality beyond just VNX monitoring and reporting (i.e., monitoring SAN switches) and I’d highly recommend doing a demo if you’re also still using Control Center.

We also implemented a VPLEX last year and it’s truly been a lifesaver and is an amazing platform.  We currently have a VPLEX local implantation in our primary data center and it’s allowed us to easily migrate workloads from one array to another seamlessly with no disruption to applications.   I’m excited about the possibilities with RecoverPoint as well, I’m still learning about it.

If anyone else who’s at EMC World happens to read this, comment!  I’d love to hear your experiences and what you’re most excited about with EMC’s latest technology.

Advertisements

Comparing Dot Hill and EMC Auto Tiering

The problem with auto tiering in general is that a large amount of hot I/Os is “hot” only for a brief moment in time. Workloads are not uniform across an entire data set constantly and if the data isn’t moved in real time it’s very likely that hot data will be accessed from capacity storage.  Ideally a storage system would be able to react to these performance improvements in real time, but the cost in overhead to the storage processors is generally too great.  The problem is somewhat mitigated by having a large Cache (like EMC’s Fast Cache) but the ability to automatically tier data in real time would be ideal.

I recently did a comparison of Dot Hill’s auto tiering strategy to EMC’s, and found that Dot Hill takes a very unique approach to auto tiering that looks promising.  Here’s a brief comparison between the two.

EMC

EMC greatly improved FAST VP on their VNX2.   While the VNX uses 1GB data slices, the VNX2 uses more granular 256MB data slices.  This greatly improves efficiency.  EMC is also shipping MLC SSD instead of SLC SSD. This makes using SSD’s much more affordable in FAST VP pools. (Note that SLC is still required for FAST Cache).

How does the smaller data slice size improve efficiency?  As an example, assume that a 500MB contiguous range of hot data residing on a SAS tier needs to be moved to EFD. ON the VNX1, an entire 1GB slice would be moved.   If this “hot” 1GB slice was moved to a 100GB EFD drive, we would have 100GB of data sitting on the EFD drive but only 50GB of the data is hot.  This is obviously inefficient, as 50% of the data is cold.  With the VNX2’s 256MB data slice size, only 500GB would be moved, resulting in 100% efficiency for that block of data.  The VNX2 makes much more efficient use of the extreme performance and performance tiers.

EMC’s FAST VP auto tiering on the VNX1 is configured by either setting it to manual or creating a schedule.  The schedule can be set to run 24 hours a day to move data in real time, but in practice it’s not practical.  The overhead on the storage processors is simply too great and we’ve configured it to run during off peak hours.  On our busiest VNX1 we see the storage processors jump from ~50% utilization to ~75% utilization when the relocation job is running.  This may improve with the VNX2, but it’s been a problem on the CX series and the VNX1.

vnx_data_slice_explanation

DotHill

According to Dot Hill, their automatic auto tiering doesn’t look at every single IO like the VNX or VNX2, it looks for trends in how data is accessed.  Their rep told me to think of it as examining every 10th IO rather than every single IO.  The idea is to allow the array to move data in real time without overloading the storage processors.  Dot Hill also moves data in 4MB data slices (which is very efficient, as I explained earlier when discussing the VNX2), and will not move more than 80 MB in a 5 second time span (or 960MB/minute maximum) to keep the CPU load down.

So, how does the Dot Hill auto tiering actually work?  They use scoring, scanning, and sorting and each are separate processes that work in real time.  Scoring is used to maintain a current page ranking on every I/O using a process that adds less than one microsecond of overhead.  The algorithm looks at how frequently and how recently the data was accessed.  Higher scores are given to data that is more frequently and recently accessed. Scanning for the high scoring pages happens every 5 seconds. The scanning process uses less than 1.0% of the CPU. The pages with the highest scores become candidates for promotion to SSD.  Sorting is the process that actually moves or migrates the pages up or down based on their score.  As I mentioned earlier, no more than 80 MB of data is moved during any 5 second sort to minimize the overall performance impact.

Summary

I haven’t used EMC’s new VNX2 or Dot Hill’s AssuredSAN to provide any information that uses real world experience.  I think Dot Hill’s implementation looks very promising on paper, and I look forward to reading more customer experiences about it in the future.  They’ve been around a long time but they only recently started offering their products directly to customers as they’ve primarily been an OEM storage manufacturer.  As I mentioned earlier, my experience with EMC’s FAST VP on the CX series and VNX1 show that real time FAST VP consumes too many CPU cycles to be used in real time, we have always run it as an off-business hours process. That’s exactly what Dot Hill’s implementation is trying to address.  We’ve made adjustments to the FAST VP relocation schedule based on monitoring our workload.  We also use FAST Cache, which at least partially solves the problem of suddenly hot data needing extra IO.  FAST Cache and FAST VP work very well together.  Overall I’ve been happy with EMC’s implementation, but it’s good to see another company taking a different approach that could be very competitive with EMC.

You can read more about Dot Hill’s auto tiering here:

http://www.dothill.com/wp-content/uploads/2012/08/RealStorWhitePaper8.14.12.pdf

You can read more about EMC’s VNX1 FAST VP Here:  

https://www.emc.com/collateral/software/white-papers/h8058-fast-vp-unified-storage-wp.pdf

You can read more about EMC’s VNX2 FAST VP Here:

https://www.emc.com/collateral/white-papers/h12208-vnx-multicore-fast-cache-wp.pdf

Troubleshooting NAS Discovery issues on EMC Ionix Control Center (ECC)

We recently converted two of our existing VNX arrays to unified systems, and I was attempting to add our newly added NAS to Control Center.  I went through the normal assisted discovery using the ‘NAS Container’ option.  Unfortunately, I got an error in the discovery results window.  Here’s the error I saw:

SESSION_ACTION: Discover  [4]  MO Type = NasContainer
Container_IP=10.10.10.4 | Container_Port=443 | Container_Username=root | Container_Password=****** | Container_Type=Celerra
  command status = finished, errors
  objects found = 6  agents responding = 2
  completed in 231 seconds
  action begins at: Wed Nov 06 09:48:05 CST 2013
  action ends at: Wed Nov 06 09:51:56 CST 2013
 
Reported objects:
[1]  NasContainer=10.10.10.4
 
Reported agent errors:
[1]  ADAResult: (9) Celerra@10.10.10.4 : SSH communication failed – Please verify emcplink settings
    Responding agent: NAS Agent @ eccagtserver.rgare.net
[2]  ADAResult: (9) Celerra@10.10.10.4: nas_version returned invalid response
    Responding agent: NAS Agent @ eccinfserver.rgare.net
Responding agents:
[1]  NAS Agent @ eccagtserver.rgare.net
[2]  NAS Agent @ eccinfserver.rgare.net

 

It looks like there’s an ssh setting that’s incorrect.  Being unfamiliar with the emcplink utility, I did a bit of research on how to configure it properly, and I will go through what needs to be done.

Before diving in to using and configuring emcplink, here are some simple troubleshooting steps you should run through first:

   – Verify that the NAS agent is installed and active.  You can view all of the running agents by clicking on the gear icon on the lower right hand side (in the status bar).  Scroll through the agents and make sure the agent is active.

   – Verify that the Java Process is running on the Control Station.

         * Log in to the Control Station and type the following command:

                      ps -aex |grep java

          * If it’s running, you will see lines similar to the following:

                      21927 ? S 0:15 /usr/java/bin/java -server …..

                      22200 ? S 0:00 /usr/java/bin/java -server …..

 – Make sure that the ssh daemon is running (I’m assuing you’re using ssh for remote connectivity):

           * Log in to the Control Station and type the following command:

                       ps -aex |grep sshd

           * If it’s running, you will see a line similar to the following:

                       882 ?      Ss     0:00 /usr/sbin/sshd

 – Verify that the Celerra (or VNX) data mover is connected to an array

                       nas_storage -list

 – Verify the user name and password youre using during the assisted discovery works. Try logging on with that ID/password directly.

Here are the troubleshooting steps I took, and some more info about emcplink:

What is emcplink? It’s a utility allows you to specify security policies for secure shell (SSH) client authentication which is required for the Storage Agent for NAS to discover NAS containers.

The highest ssh security level (full security) requires that users manually run emcplink in order to provide a username and password for ssh authentication and to manually accept an ssh key returned from emcplink to discover the NAS container.  Afer it’s accepted the key is stored on the NAS Agent host. If the key changes, to rediscover the NAS container you must manually run emcplink again and accept the changed key. If your environment does not require full ssh security, use emcplink to set lower security levels that will automatically accept new or changed keys without requiring the manual entering of ssh usernames, passwords, and keys.

The emcplink command is a command line utility. To run emcplink, first open a command prompt window.  Change to the <install_root>/exec/CNN610 directory on the host where Storage Agent for NAS resides, where <install_root> is the ControlCenter infrastructure install directory.

If your installation uses SSH version 2, update your agent configuration so emcplink uses SSH version 2 when handling SSH keys, the default is version 1. Note that SSH version 2 is not backward compatible with version 1. If you switch to SSH version 2, you must run emcplink again to rediscover all NAS containers that were previously discovered with SSH version 1.

If you want to update your install to version 2, follow these steps (I did this during my troubleshooting):

1. Stop Storage Agent for NAS using the ControlCenter Console.

2. Edit the following file:

      <install_root>/exec/CNN610/cnn.ini

3. In cnn.ini [ssh] change version = 1 to version = 2

4. Save and exit cnn.ini.

5. Restart Storage Agent for NAS.

The next step is to enable the policy that you need for your environment. The default policy is EMC_SSH_KEY_SECURITY_FULL.  You can add more than one policy.  If added policies contradict one another, the most recently added policy takes effect.

Enter the following command to add or remove an SSH security policy:

       emcplink -setpolicy [+|-]policy_name

       Example:  emcplink -setpolicy +EMC_SSH_KEY_SECURITY_FULL

In my case, I first disabled the default policy, then enabled the policy that I wanted. Here are the commands I ran:

       emcplink -setpolicy -EMC_SSH_KEY_SECURITY_FULL

       emcplink -setpolicy +EMC_SSH_KEY_SECURITY_ALLOW_NEW

After running it, I used the ‘getpolicy’ option to verify what the current active policy was:

       emcplink -getpolicy

the output looks like this:

          Policy Is:
              EMC_SSH_KEY_SECURITY_ALLOW_NEW
 

Here are the policy options you can choose from:

EMC_SSH_KEY_SECURITY_FULL (default)

Do not automatically accept any new or changed NAS container keys. They must be accepted manually, using emcplink (refer to emcplink – interactive). Provides the same functionality that plink (no longer valid) provided in previous ControlCenter versions, when manual user name/password/key entry was required.

EMC_SSH_KEY_SECURITY_ALLOW_NEW

Accept new keys, but not changed keys. SSH authentication occurs automatically for initial discovery, and also for subsequent discoveries as long as the NAS Container key does not change. If a key is changed, discovery is attempted via telnet.

EMC_SSH_KEY_SECURITY_ALLOW_CHANGE

Accept changed keys, but not new keys. When a Celerra is initially discovered, SSH authentication occurs manually. If the key is changed for subsequent discoveries of that NAS Container, SSH authentication occurs automatically.

EMC_SSH_KEY_SECURITY_NONE

Accept both new and changed keys. SSH authentication occurs automatically at initial discovery and all subsequent discoveries, regardless of whether NAS container keys are changed.

After I verified the policy I wanted was running, I then manually enterted SSH security information for each array I wanted to add to accept the NAS container keys. Run the following command for each array to add the key to the server cache (the -2 optionally tells emcplink to use SSH version 2):

         emcplink -ssh -interactive -2 -pw password username@Array_IP_address

Note that SSH version 2 is not backward compatible with version 1. If you switch to SSH version 2, you must run emcplink again to rediscover all NAS containers that were previously discovered with SSH version 1.

That’s it! Once I accepted all of the ssh keys and re-ran the discoveries, the new arrays were discovered just fine.

 

Gartner’s Market Share Analysis for Storage Vendors

Here’s an interesting market share analysis by Gartner that was published a few months ago:  http://www.gartner.com/technology/reprints.do?id=1-1GUZA31&ct=130703&st=sb.  It looks like EMC and NetApp rule the market, with EMC on top.  Below are the key findings copied from the linked article.

  • EMC and NetApp retained almost 80% of the market. They were separated by more than $2 billion in revenue from IBM and HP, their next two largest competitors.
  • No. 1 EMC grew its overall network-attached storage (NAS)/unified storage share to 47.9% (up from 41.7% in 2011), while No. 2 NetApp’s overall NAS/unified storage share dropped to 30.3% (down from 36% in 2011).
  • In the overall NAS/unified storage share ranking, the positions of the nine named vendors remained unchanged in 2012 (in order of share rank: EMC, NetApp, IBM, HP, Oracle, Netgear, Dell, Hitachi/Hitachi Data Systems and Fujitsu).
  • For the fifth consecutive year, iSCSI storage area network (SAN) revenue and Fibre Channel SAN revenue continued to gain proportionate share in the overall NAS/unified market.
  • The “pure NAS” market continues to grow at a much faster rate (15.9%) than the overall external controller-based (ECB) block-access market (2.3%), in large part due to the expanding NAS support of growing vertical applications and virtualization.

VNX vs. V7000

Like my previous post about the XIV, I did a little bit of research on how the IBM V7000 compares as an alternative to EMC’s VNX series (which I use now).  It’s only a 240 disk array but allows for 4 nodes in a cluster, so it comes close to matching the VNX 7500’s 1000 disk capability.  It’s an impressive array overall but does have a few shortcomings.  Here’s a brief overview list of some interesting facts I’ve read about, along with some of my opinions based on what I’ve read.  It’s by no means a comprehensive list. 

Good:

·         Additional Processing power.  Block IO can be spread out among 8 Storage Processors in a four node cluster.  You need to buy four frames to get 8 SP’s and up to 960 disks.  Storage Pools can span clustered nodes.

·         NAS Archiving Support. NAS has a built in archiving option to easily move files to a different array or different disks based on date.  EMC does not have this feature built in to DART.

·         File Agent. There is a management agent for both block and file.  EMC does not have a NAS/file host agent, you have to open a session and log in to the Linux based OS with DART.

·         SDD multipathing agent is free.  IBM’s SDD multi-pathing agent (comparable to PowerPath) is free of charge.  EMC’s is not.

·         Nice Management GUI.  The GUI is extremely impressive and easy to use.

·         Granularity.  The EasyTier data chunk size is configurable;  EMC’s FAST VP is stuck at 1GB.

·         Virtualization.  The V7000 can virtualize other vendor’s storage, making it appear to be disks on the actual V7000.  Migrations from other arrays would be simple.

·         Real time Compression.  Real time compression that, according to IBM, actually increases performance when enabled.  EMC utilizes post process compression. 

Bad: 

·         No Backend Bus between nodes. Clustered nodes have no backend bus and must be zoned together.  Inter-node IO traverses the same fabric as hosts.  I don’t see this configuration as optimal, it is a fact that the IO between clustered nodes must travel across the core fabric.  All nodes plug in to the core switches and are zoned together to facilitate communication between them.  According to IBM this introduces a 0.06ms latency, but in my opinion that latency could increase based on IO contention from hosts.  You may see an improvement in performance and response time due to the extra cache and striping across more disks, but you would see that on the VNX as well by adding additional disks to a pool and increasing the size of Fast Cache, and all the disk IO would remain on the VNX’s faster backend bus.  The fact that the clustered configuration introduces any latency at all is a negative in my mind.  Yes, this is another “FUD” argument.

·         Lots of additional connectivity required.  Each node would use 4 FC ports on the core switches (16 ports on each fabric in a four node cluster).  That’s significantly more port utilization on the core switches than a VNX would use.

·         No 3 Tier support in storage pools.  IBM’s EasyTier supports only two tiers of disk.  For performance reasons I would want to use only SSD and SAS.  Eliminating NL-SAS from the pools would significantly increase the number of SAS drives I’d have to purchase to make up the difference in capacity.  EMC’s FAST VP of course supports 3 tiers in a pool.

·         NAS IO limitation.  On unified systems, NAS file systems can only be serviced by IOgroup0, which means they can only use the two SP’s in the first node of the cluster.  File systems can be spread out among disks on other nodes, however.

 

VNX vs. XIV

I recently started researching and learning more about IBM’s XIV as an alternative to EMC’s VNX.  We already have many VNX’s installed and I’m a very happy EMC customer, but checking out EMC’s competition is always a good thing.  I’m looking at other vendors for comparison as well, but I’m just focusing on the XIV for this post.  On the surface the XIV looks like a very impressive system.  It’s got a fully virtualized grid design that distributes data over all of the disks in the array, which maximizes performance by eliminating hotspots.  It also offers industry leading rebuild times for disk failures.  It’s almost like a drop in storage appliance, you choose the size of the NL-SAS disks you want, choose a capacity, drop it in and go.  There’s no configuration to be done and an array can be up and running in just hours.  They are also very competitive on price.  While it really is an impressive array there are some important things to note when you compare it to a VNX.  Here’s a brief overview list of some interesting facts I’ve read about, along with some of my opinions based on what I’ve read. 

Good:

1.       It’s got more CPU power.  The XIV has more CPU’s and more overall processing power than the VNX.  It’s performance scales very well with capacity, as each grid module adds 1 additional CPU and cache.

2.       Remote Monitoring.  There’s an app for that!  IBM has a very nice monitoring app available for iOS.  I was told, however, that IBM does not provide the iPad to run it. 🙂

3.       Replication/VMWare Integration.  VNX and XIV have similar capabilities with regard to block replication and VMWare integration.

5.       Granularity & Efficiency.  The XIV stores data in 1MB chunks evenly across all disks, so reading and writing is spread evenly on all disks (eliminating hotspots).  The VNX stores data in 1 GB chunks across all disks in a given storage pool.

6.       Cloning and Rebuild Speed.  Cloning and disk rebuilds are lightning fast on the XIV because of the RAID-X implementation.  All of the disks are used to rebuild one.

7.       Easy upgrades.  The XIV has a very fast, non disruptive upgrade process for hardware and software.  The VNX has a non-disruptive upgrade process for FLARE (block), but not so much on DART (file).

Bad:

1.       No Data Integrity Checking.  The XIV is the only array that IBM offers that doesn’t have T10-DIF to protect against Data Corruption and has no persistent CRC written to the drives.  EMC has it across the board.

2.       It’s Block Only.  The XIV is not a unified system so you’d have to use a NAS Gateway if you want to use CIFS or NFS.

3.       It’s tierless storage with no real configuration choices.  The XIV doesn’t tier.  It has a large SSD Cache, similar to the VNX’s FastCache which is supported by up to 180 NL-SAS drives.  You have no choice on disk type, no choice on RAID type, and you must stay with the same drive size that you choose from the beginning for any expansions later on.  It eliminates the ability of storage administrators to manage or separate workloads based on application or business priority, you’d need to buy multiple arrays.  The XIV is not a good choice if you have an application that requires extensive tuning or requires very aggressive/low latency response times.

4.       It’s an entirely NL-SAS array.  In my opinion you’d need a very high cache hit ratio to get the IO numbers that IBM claims on paper.  It feels to me like they’ve come up with a decent method to use NL-SAS disks for something they weren’t designed to do, but I’m not convinced it’s the best thing to do.  There is still a very good use case for having SAS and SSD drives used for persistent storage.

5.       There’s an increased Risk of Data Loss on the XIV vs VNX.  All LUNs are spread across all of the drives in an XIV array so a part of every LUN is on every single drive in the array.  When a drive is lost the remaining drives have to regenerate mirror copies of any data that was on the failed drive.  The probability of a second drive failure during the rebuild of the first is not zero, although very close to zero due to their very fast rebuilt times.  What happens if a second drive fails before the XIV array has completed rebuilding the first failed drive?  You lose ALL of the LUNs on that XIV array.  It’s a “FUD” argument yes, but the possibility is real.  Note:  The first comment on this blog post states that this is not true, you can read the reply below.

6.       Limited Usable Capacity on one array.  In order to get up to the maximum capacity of 243TB on the XIV you’d need to fill it with 180 3TB drives.  Also, once you choose a drive size in the initial config you’re stuck with that size. The maximum raw capacity (using 3TB drives) for the VNX 5700 is 1485TB, the VNX 7500 is 2970TB.

Upgrading EMC Ionix Control Center from Update Bundle 12 (UB12) to Update Bundle 14 (UB14)

Below are the steps I just took to upgrade our ECC environment from UB12 to UB14. There are quite a few steps and EMC has this process documented very well. Every install file and patch has it’s own readme file with detailed installation instructions along with additional info about special considerations depending on your environment. If you’re about to begin this upgrade, you can use these steps as a general guide but I highly recommend reading all of EMC’s documentation as well. There are general notes, warnings, and information in their documentation that did not apply to me but may apply to you. From beginning to end the upgrade process took me about seven hours to complete.

Prerequisites for the repository server:

    • Oracle 10g OCPU Update must be installed
    • Operating system locale settings must be set to English
    • Minimum 13GB of free disk space
    • Disable IOPath (ECC_IOPath.bat –i to check, ECC_IOPath.bat –d to disable)
    • You must back up the repository for ECC and StorageScope
    • ControlCenter Environment Checker must be run
    • All infrastructure services must be stopped
    • The following additional services must be stopped: AntiVirus, COM+, Server management services, vmware tools, Distributed Transaction Coordinator, Terminal Services, and WMI.

1. Download all of the required files. I’m running ECC on two 32 bit Windows servers (one repository server and one application server), so all of the download links I provided below are for windows.

a. Download the UB14 Update Bundle (32 bit is CC_4979.zip, 64 bit is CC_5013.zip):

https://download.emc.com/downloads/DL44408_ControlCenter_Update_Bundle_14_(64-bit)_software_download.zip

https://download.emc.com/downloads/DL44407_ControlCenter_Update_Bundle_14_(32-bit)_software_download.zip

b. Download the latest cumulative Patch (currently CC_5046.zip):

https://download.emc.com/downloads/DL47699_Cumulative_6.1_Update_Bundle_14_Storage_Agent_for_HDS_Patch_5046_Download.zip

c. Download the latest Ionix ControlCenter 6.1 Backup and Restore Utility:

https://download.emc.com/downloads/DL37276_Ionix_ControlCenter_6.1_Backup_and_Restore_Utility_.zip

d. Download the latest Ionix ControlCenter Environment Checker (Currently version 1.0.5.3):

https://download.emc.com/downloads/DL37274_Ionix_ControlCenter_Environment_Checker_1.0.5.3.exe

e. Download the latest Oracle 10g OCPU Patch here (CC_5030.zip):

https://download.emc.com/downloads/DL46584_ControlCenter_6.1_OCPU_(5030)_Download.zip

f. Download the latest Unisphere Host Client/Agent:

https://download.emc.com/downloads/DL30840_Unisphere__Client_(Windows)_1.2.26.1.0094.exe

g. Download the latest Navisphere CLI (Current release is 7.32.25.1.63):

https://download.emc.com/downloads/DL30859_Navisphere_CLI_(Windows_-_all_supported_32_&_64-bit_versions)_7.32.25.1.63.exe

h. I’d recommend reviewing the UB14 Read Me file to identify any specific warnings that may be relevant to your specific installation. The Read Me file for Update Bundle 14 can be downloaded here:

https://support.emc.com/docu44826_Ionix_ControlCenter_6.1_Update_Bundle_14_Read_Me.pdf?language=en_US

2. Log in to the repository server. Run the ECC Backup and Restore Utility. Because our ECC servers are virtualized, I also ran a snapshot from the VSphere console so I could revert back if needed. Below is an overview of the steps I took, review the readme file from EMC for more detail. Note that I shut down our ECC application server for the duration of the upgrade on the repository server.

a. Run the patch to install the files

b. Go to the ECC install directory, HF4938 folder, note timestamp on regutil610.exe

c. Check that the ECC/tools/util/regutil610.exe file has the same timestamp

d. Run ECC/HF498/backup.bat (that sets all ECC services to manual as well)

e. Reboot hosts, verify ECC services are not running

f. Manually back up the ECC folder to another location

3. Run Environment checker on the repository server.

a. Make sure that all checks have passed before proceeding with the upgrade.

4. Check ECC 6.1 compatibility matrix, make sure everything is up to date and compatible

a. Here’s a link to the compatibility matrix: https://support.emc.com/docu31657_Ionix-ControlCenter-6.1-Support-Matrix.pdf?language=en_US

b. Note that UB14 requires Solutions Enabler 7.5, 7.5.1 or 7.6.

c. Host Agent must be at 1.2.x and CLI must be at 7.32.x if you’re running the latest VNX hardware.

5. Run Oracle 10g OCPU Patch (CC_5030) on the repository server if required. This step took quite a bit of time to complete. I’d expect 60-90 minutes for this step.

a. Make sure all ECC application services are stopped.

b. Run %ECC_INSTALL_ROOT%\Repository\admin\Ramb_scripts\ramb_hotback.bat

c. Run %ECC_INSTALL_ROOT%\Repository\admin\Ramb_scripts\ram_export_db.bat

d. Run %emcstsoh%\admin\emcsts_scripts\emcsts_coldback.bat

e. Run %emcstsoh%\admin\emcsts_scripts\emcsts_export_db.bat

f. Copy the contents of the %ECC_INSTALL_ROOT% to a backup location of your choice

g. Extract the SQL_PATCH_5030.zip to a local drive on the ECC Server.

h. Verify repository by running the following: %ECC_INSTALL_ROOT%\Repository\admin\Ramb_scripts\ramb_recomp_invalid.bat

i. Verify the storage scope repository by running the following: %ECC_INSTALL_ROOT%\Repository\admin\emcsts_scripts\emcsts_recomp_invalid.bat

j. If the prior two scripts ran successfully, proceed, if not call EMC.

k. Run the prereboot.bat script from the SQL_PATCH_5030 directory.

l. Reboot

m. Run the postreboot.bat script from the SQL_PATCH_5030 directory.

n. At the end of the postreboot script, it will ask if you want to revert all the services back to their original state. Respond with Y.

o. Verify successful installation by running %ECC_INSTALL_ROOT%\Repository\admin\emcsts_scripts\emcsts_hotfixstatus.bat

6. Update the Repository server with the latest Host agent and Navisphere CLI:

a. Unisphere Host Agent 1.2.25.1.0163: https://download.emc.com/downloads/DL30839_Unisphere_Host_Agent_(Windows_Â -_all_supported_32_&_64-bit_versions)__1.2.25.1.0163.exe

b. Navisphere CLI 7.32.25.1.63: https://download.emc.com/downloads/DL30859_Navisphere_CLI_(Windows_-_all_supported_32_&_64-bit_versions)_7.32.25.1.63.exe

c. A reboot is not required after installing the Agent and CLI. Make sure to preserve your security settings when asked in the installation.

7. Install a newer version of Solutions Enabler, if required. As mentioned earlier, UB14 requires version 7.5, 7.5.1 or 7.6.

a. Link to the MRLK Control Center supported version of SE (v7.5.0 – CC_5014): https://download.emc.com/downloads/DL44373_Solutions_Enabler_7.5_MRLK.zip

b. There is a readme pdf file included in the download with details on installation.

c. Stop the EMC storapid service.

d. Stop the EMC Control Center Server and Store Services.

e. Do NOT stop the OracleECCREP_HOMETNSListener or OracleServiceRAMBDB services.

f. Run the install.

g. Restart the services.

8. Prepare for UB14 Installation. Now all the prerequisite installs are complete on my system. The next step is to verify that the correct services are stopped and begin the UB14 upgrade. We have everything running on one repository server, your environment could be different.

a. On the repository server, stop the following services, set them to Manual, and reboot:

i. EMC ControlCenter API Server

ii. EMC ControlCenter Key Management Server

iii. EMC ControlCenter Master Agent

iv. EMC ControlCenter Repository

v. EMC ControlCenter Server

vi. EMC ControlCenter Store

vii. EMC ControlCenter STS MSA Web Server

viii. EMC ControlCenter Web Server

ix. EMC StorageScope Repository

x. EMC StorageScope Server

b. The following services must be stopped: AntiVirus, COM+, Server management services, vmware tools, Distributed Transaction Coordinator, and WMI.

c. Verify the following services are running before beginning the UB14 installation:

i. OracleECCREP_HOMETNSListener

ii. OracleServiceRAMBDB

iii. OracleServiceEMCSTSDB

9. Begin the UB14 Installation (CC_4079 for 32bit, CC_5013 for 64bit)

a. Execute the Patch61014383_4979_x86.exe patch (or the 64 bit patch if you’re on a 64bit server).

b. I recommend reading the UB14 readme file. If you encounter any errors, a few common install issues are listed.

c. This step takes a very long time, count on at least 3-4 hours. I also have StorageScope which added additional time.

d. After the patch install completes, change all the services back to their original ‘automatic’ state.

e. Reboot the repository server

10. Install the latest cumulative Update Bundle 14 patch next (The most current right now is CC_5046).

a. The current version of the UB14 patch can be downloaded here:

https://download.emc.com/downloads/DL47699_Cumulative_6.1_Update_Bundle_14_Storage_Agent_for_HDS_Patch_5046_Download.zip

b. Stop the EMC Control Center Store and Server Services

c. Stop the following Services:

i. EMC ControlCenter API Server

ii. EMC ControlCenter Key Management Server

iii. EMC ControlCenter Master Agent

iv. EMC ControlCenter Repository

v. EMC ControlCenter Server

vi. EMC ControlCenter Store

vii. EMC ControlCenter STS MSA Web Server

viii. EMC ControlCenter Web Server

ix. EMC StorageScope Repository

x. EMC StorageScope Server

d. Run Patch 61014615_5046_hds.exe

e. Restart all Services that were stopped earlier in steps B and C.

f. Apply patch to agents:

i. Start the ECC Console

ii. Right click on the repository server object

iii. Select Agents à Apply Patch

iv. If the task fails, restart the master agent from the console

11. Update The application server

a. Install the latest Unisphere Host Agent 1.2.25.1.0163: https://download.emc.com/downloads/DL30839_Unisphere_Host_Agent_(Windows_Â -_all_supported_32_&_64-bit_versions)__1.2.25.1.0163.exe

b. Install the latest Navisphere CLI 7.32.25.1.63: https://download.emc.com/downloads/DL30859_Navisphere_CLI_(Windows_-_all_supported_32_&_64-bit_versions)_7.32.25.1.63.exe

c. Update the ECC Console by navigating to https://<repository_server>:30002/webinstall

i. Click Installation

ii. Click Console Patch 6.1.0.14.383

d. Install Solutions Enabler on the application server (not the MRLK version)

i. Windows 32 bit download

https://download.emc.com/downloads/DL43505_se7500-WINDOWS-x86.exe.exe

ii. Windows 64 bit download:

https://download.emc.com/downloads/DL43504_se7500-WINDOWS-x64.exe.exe

iii. Stop existing ECC and Solutions Enabler services

iv. Launch Install

e. Apply Agent patches from within ECC Console

i. Start the ECC Console

ii. Right click on the application server object

iii. Select Agents à Apply Patch

iv. If the task fails, restart the master agent from the console

12. Verify that WLA Archive collection (performance data) is collecting properly.

a. Go to your WLAArchives folder

b. Go to the Clariion\<serial number>\interval subfolder

c. Sort the files by date. If new files are being written then it is working.

13. Verify that all agents are running and are at the correct patch level

a. Launch the Control Center console

b. Click on the small gear icon on the lower right side of the window, which will launch the agents view on the right side of the screen

c. Verify that all agents are running and are patched to 6.1.0.14.383. Any that require updates can be updated from this screen, right click, choose agents and install patch.

14. Remove backup directories (optional). The List is on page 16 & 17 of the UB14 readme file.

That’s it! You’re done.

Note: 24 hours after the completion of the upgrade I noticed that WLA archive data collection wasn’t working for some of the arrays. I deleted the arrays that were’nt working and rediscovered them, which resolved the problem. Deleting the arrays removes all historical data from StorageScope.

What’s new in EMC Control Center Update Bundle 14 (UB14)?

I’m in the process right now of upgrading our ECC installation from UB12 to UB14.  I’ll be making another post soon with the steps I took to complete the upgrade.  Below is a list of all the new features in UB14.

  • Added support for Oracle 11g.
  • Added Oracle 64-bit support in ControlCenter for 64-bit operating systems.
  • Added support for VMAX 1.5 and CKD devices.
  • Added support for Solutions Enabler 7.5.
  • Added support to discover and display new Federated Tiered Storage (FTS).
  • Added support for EMC CentraStar® 4.2.2.
  • Added support for Java 1.7 update 7.
  • Added support for HDS WLA by using JRE 1.7.
  • Added support for the availability of pool compression information in STS query builder while generating custom reports.
  • Added support for Mixed RAID types in VNX storage pool.
  • Added a new memory check during installation to verify if 4GB of RAM is available to install/upgrade UB14 in an all-in-one setup.
  • Added a new memory check during installation to verify if 3GB of RAM is available to install/upgrade UB14 in a distributed setup.
  • Added a new disk space check during installation to verify if minimum disk space is available for UB14 upgrade in both all-in-one and distributed setups and notify users if the requirement is not met.
  • Added support to export/import database to/from an external drive.
  • Added new free disk space availability alert.
  • Removed support for Symmetrix, SDM, and CLARiiON agents on HP-UX and AIX operating systems.
  • EMC Ionix ControlCenter 6.1 Update Bundle 14 Read Me 3
  • Removed support for Microsoft Internet Explorer version 6.0.
  • Added support to discover or rediscover Managed Objects through NAS, CLARiiON, Symmetrix, and Host Agents using DefaultHiddenDCP feature.
  • Added support to change the fully qualified domain name of non-infrastructure hosts or platforms.
  • Added support to remove virtual machines marked as Deleted on the console automatically using a scheduled job that runs every day.
  • Added support to reserve connections for WLA communications.
  • Added support to optimize the mechanism of sending MOLIST for Brocade switches.
  • Added support to optimize performance statistics collection for WLA.
  • Added support to optimize provider calls for switch and fabric discovery.
  • Added support to optimize logging and error handling in agent.
  • Added support to start processing opcode only after the agent INITX initialization.
  • Added virtual provisioning support for CLARiiON FLARE 28.
  • Added support to edit MessageReplyTimeout value in Gateway agent ini file.
  • Improved error messages in console for action.
  • Improved Store log to simplify root cause determination of issues.
  • Reduced Integration Gateway agent crashes.

First day at EMC World 2013

It’s been a great first day at EMC World 2013 so far.  I’ve been to three breakout sessions, two of which were very informative and useful. I focused my day today on learning more about best practices for the EMC technologies that I already use.  I’m not going to go into great detail now as I need to get to the next session. 🙂 The notes I made are specific to a brocade switch implemenation as that’s what I use. Here are some useful bullet points from a few of my sessions today.

Storage Area Networking Best Practices:

1.  Single Initiator Zoning.  Only put one initiator and the targets it will access in one zone.  I’ve already practiced this for many years, but it was interesting to hear the reasons for doing it that way.  It drastically reduces the number of server queries to the switch.

2.  Dynamic Interface Management.  Use Brocade port fencing.  It can prevent a SAN outage by giving you the ability to shut down a single host or port.

3.  Monitor for Congestion.  Congestion from one host can spread an cause problems in other operating environments.  Definitely enable Brocade bottleneck detection.

4.  Periodic SAN health checks.  Use the EMC SANity or Brocade SAN health tools regularly.  If you’re reading this, you’re verly likely already doing that. 🙂

5.  Monitor for bit errors.  These can lead to bb_credit loss.  The Brocade parameter portcvglongdistance should be set (bb_credit recovery option), which will prevent the problem.  Bit errors could still be a problem on F Ports, however.

6.  Cable Hygeine.  Always clean your cables!  They are a major contributor to physical connectivity problems.

7.  Target Releases.  Always run a target release of Firmware.

VNX NAS Configuration Best Practices:

1. For transactional NAS, always use the correct transfer size for your workload.  The latest VNX OE release supports up to 1MB.

2. Use Jumbo frames end to end.

3. 10G is available now. Don’t use 1G anymore!

4. Use link aggregation for redundancy and load balancing, failsafe for high availability.

5. Use AVM for typical NAS configurations, MVM for transactional NAS where you have high concurrency from a single NFS stream to a single NFS export. 

6. Use Direct Writes for apps that use concurrent IO or are asynchronous (Oracle, VMWare, etc.).

7. For NAS Pool LUNs, use thick LUNs that are all the same size, balance the SP designation for each one, use the same tiering policy for each one, use approximately 1 LUN for every 4 drives, and use thin enabled file systems to maximize the consumption of the highest tier.

8. Always enable FASTCache for NAS LUNs.  They benefit from it even more than block LUNs.

 

Magic Quadrant for Storage Vendors

I was reading over the Gartner report today that dives into specifics on the strengths and weaknesses of the big players in the storage market.  The original Gartner article can be viewed here: http://www.gartner.com/technology/reprints.do?id=1-1EL3WXN&ct=130321&st=sb.

It came as no surprise to me that EMC was ranked as the market leader, with NetApp as #2, Hitachi #3, and IBM #4.  I’ve been very pleased with the performance and reliability of EMC products in my environment and their global support is top notch.  Below is the magic quadrant from the article. marketleaders

What’s new in FLARE 32?

We recently installed a new VNX5700 at the end of July and EMC was on-site to install it the day after the general release of FLARE 32. We had planned our pool configuration around this release, so the timing couldn’t have been more perfect. After running the new release for about a month now it’s proven to be rock solid, and to date no critical security patches have been released that we needed to apply.

The most notable new feature for us was the addition of mixed RAID types within the same pool.  We can finally use RAID6 for the large NL-SAS drives in the pool and not have to make the entire pool RAID6.  There also are several new performance enhancements that should make a big impact, including load balancing within a tier and pool rebalancing.

Below is an overview of the new features in the initial release (05.32.000.5.006).

· VNX Block to Unified Services Plug-n-play: This feature allows customers to perform Block to Unified upgrades.

· Support for In-Family upgrades: This feature allows for the In-family Data In Place (DIP) Conversion kits that are to be available for the VNX OE for File 7.1 and VNX OE for Block 05.32 release. In-family DIP conversions across the VNX family will re-use the installed SAS DAEs with associated drives and SLICs

· Windows 2008 R2: Branch Cache Support: This feature provides support for the branch cache feature in Windows 2008 R2. This feature allows a Windows client to cache a file locally in one of the branch office servers and then use the cached copy whenever the same file is being requested. The VNX for File CIFS Server operates as the Central Office Content Server in support of the Branch Cache feature.

· VAAI for NFS: Snaps of snaps: Supports snap of snap of VMDK files on a NFS file systems to at least 1 level of depth (source to snap to snap). Though the functionality will initially only be available through the VAAI for NFS interface, it may in future be exposed through the VSI plug-in as well. This feature allows VMware View 5.0 to use VAAI for NFS on a VNX system. This feature also requires that file systems be enabled during creation for “VMware VAAI nested clone support”, and the installation of the EMCNasPlugin-1-0.10.zip on the ESX servers.”

· Load balance within a tier: This feature allows for redistribution of slices across all drives in a given tier of a pool to improve performance. This also includes proactive load balancing, such that slices are relocated from one RAID group in a tier to another, based on activity level, to balance the load within the tier. These relocations will occur within the user-defined tiering relocation window.

· Improve block compression performance: This feature provides for improved block compression performance. This includes increasing speed of compression operations, decreasing storage system impact, and decreasing host response time impact.

· Deeper File Compression Algorithm: This feature provides an option to utilize deeper compression algorithm for greater capacity savings, when using file-level compression. This option can be leveraged by 3rd party application servers such as the FileMover Appliance Server, based on data types (i.e. per metadata definitions) that are best suited for the deeper compression algorithm).

· Rebalance when adding drives to Pools: This feature provides for the redistribution of slices across all drives in a newly expanded pool to improve performance.

· Conversion of DLU to TLU and back, when enabling Block Compression: This feature provides an internal mechanism (not user-invoked), when enabling Compression on a thick pool LUN, that would result in an in-place conversion from Thick (“direct”) pool LUNs to Thin Pool LUNs, rather than performing a migration. Additionally, for LUNs that were originally Thick and then converted to Thin, it provides an internal mechanism, upon disabling compression, to convert the LUNs back to Thick, without requiring a user-invoked migration.

· Mixed RAID Types in Pools: This feature allows a user to define RAID types per storage tier in pool, rather than requiring a single RAID type across all drives in a single pool.

· Improved TLU performance, no worse than 115% of a FLU: This feature provides improved TLU performance. This includes decreasing host response time impact and potentially decreasing storage system impact.

· Distinguished Compression Capacity Savings: This feature provides a display of the capacity savings due to compression. This display will inform the user of the savings due to compression, distinct from the savings due to thin provisioning. The benefit for the user is that he can determine the incremental benefit of using compression. This is necessary because there is currently a performance impact when using Compression, so users need to be able to make a cost/benefit analysis.

· Additional Tiering Policies: This feature provides additional tiering options in pools: namely, “Start High, then Auto-tier”. When the user selects this policy for a given LUN, the initial allocation is on highest tier, and subsequent tiering is based on activity.

· Additional RAID Options in Pools: This feature provides 2 additional RAID options in pools, for better efficiency: 8+1 for RAID 5, and 14+2 for RAID 6. These options will be available for new pools. These options will be available in both the GUI and the CLI.

· E-Trace Enhancements: Top files per fs and other stats: This feature allows the customer to identify the top files in a file system or quota tree. The files can be identified by pathnames instead of ids.

· Support VNX Snapshots in Unisphere Quality of Service Manager: This feature provides Unisphere Quality of Service Manager (UQM) support for both the source LUN and the snapshot LUN introduced by the VNX Snapshots feature.

· Support new VNX Features in Unisphere Analyzer: This feature provides support for all new VNX features in Unisphere Analyzer, including but not limited to VNX Snapshots and 64-bit counters.

· Unified Network Services: This feature provides several enhancements that will improve user experience. The various enhancements delivered by UNS in VNX OE for File 7.1 and VNX OE for Block 05.32 release are as follows:

· Continuous Monitoring: This feature provides the ability to monitor the vital statistics of a system continuously (out-of-box) and then take appropriate action when specific conditions are detected. The user can specify multiple statistical counters to be monitored – the default counters that will be monitored are CPU utilization, memory utilization and NFS IO latency on all file systems. The conditions when an event would be raised can also be specified by the user in terms of a threshold value and time interval during which the threshold will need to be exceeded for each statistical counter being monitored. When an event is raised, the system can perform any number of actions – possible choices are log the event, start detailed correlated statistics collection for a specified time period, send email or send a SNMP trap.

· Unisphere customization for VNX: This feature provides the addition of custom references within Unisphere Software (VNX) via editable source files for product documentation and packaging, custom badges and nameplates.

· VNX Snapshots: This feature provides VNX Snapshots (a.k.a write-in-place pointer-based snapshots) that in their initial release will support Block LUNs only and require pool-based LUNs. VNX Snapshots will support File Systems in a later release. The LUNs referred to below are pool-based LUNs (thick LUNs and Thin LUNs.)

· NDMP V4 IPv6 extension: This feature provides support for the Symantec, NetApp, and EMC authored and approved NDMP v4 IPv6 extension in order to back up using NDMP 2-way and 3-way in IPv6 networked environments.

· NDMP Access Time: This feature provides the last access time (atime). In prior releases, this was not retained during an NDMPCopy and was thus set to the time of migration. So, after a migration, the customer lost the ability to archive “cold” or “inactive” data. This feature adds an optional NDMP variable (RETAIN_ATIME=y/n, the default being ‘n’) which if set includes the atime in the NDMP data stream so that it can be restored properly on the destination.

· SRDF Interoperability for Control Station: This feature provides SRDF (Symmetrix Remote Data Facility) with the ability to manage the failover process between local and remote VNX Gateways. VNX Gateway needs a way to give up control of the failover and failback to an external entity and to suppress the initiation of these processes from within the Gateway.

 

ProSphere 1.6 Updates

ProSphere 1.6 was released this week, and it looks like EMC was listening!  Several of the updates are features that I specifically requested when I gave my feedback to EMC at EMC World.  I’m sure it’s just a coincidence, but it’s good to finally see some valuable improvements that make this product that much closer to being useful in my company’s environment.  The most important items I wanted to see was the ability to export performance data to a csv file and improved documentation on the REST API.  Both of those things were included with this release.  I haven’t looked yet to see if the performance exports can be run from a command line (a requirement for it to be useful to me for scripting).  The REST API documentation was created in the form of a help file.  It can be downloaded an run from an internal web server as well, which is what I did.

Here are the new features in v1.6:

Alerting

ProSphere can now receive Brocade alerts for monitoring and analysis. These alerts can be forwarded through SNMP traps.

Consolidation of alerts from external sources is now extended to include:

• Brocade alerts (BNA and CMCNE element managers)

• The following additional Symmetrix Management Console (SMC) alerts:
– Device Status
– Device Pool Status
– Thin Device Allocation
– Director Status
– Port Status
– Disk Status
– SMC Environmental Alert

Capacity

– Support for Federated Tiered Storage (FTS) has been added, allowing ProSphere to identify LUNs that have been presented from external storage logically, positioned behind the Unisphere for VMAX 10K, 20K and 40K.

– Service Levels are now based on the Fully Automated Storage Tier (FAST) policies defined in Symmetrix arrays. ProSphere reports on how much capacity is available for each Service Level, and how much is being consumed by each host in the environment.

Serviceability

– Users can now export ProSphere reports for performance and capacity statistics in CSV format.

Unisphere for VMAX 1.0 compatibility

– ProSphere now supports the new Unisphere for VMAX as well as Single Sign On and Launch-in-Context to the management console of the Unisphere for VMAX element manager. ProSphere, in conjunction with Unisphere for VMAX, will have the same capabilities as Symmetrix Management Console and Symmetrix Performance Analyzer.

Unisphere support

– In this release, you can launch Unisphere (CLARiiON, VNX, and Celerra) from ProSphere, but without the benefits of Single Sign On and Launch-in-Context.

EMC World 2012 – Thoughts on preparing for a VDI deployment

We’re in the process of testing the deployment of VDI now so I attended a session about preparing for a deployment.   As I mentioned in my previous post, I’m not an expert on any subject after attending a one hour session, but I thought I’d share my thoughts.

The most important take away from the session was mentioned near the beginning – for a truly successful deployment you must do a detailed desktop IO analysis.  It’s important to have a firm grasp on the amount of desktop IO that your company requires and a detailed analysis is the only way to gather that info.   There are “rules of thumb” that can be followed ( 5 IOPs for light users, 10 IOPs for medium users, and 20 IOPs for heavy users), but you could easily end up either over-allocating or under-allocating without knowing the actual numbers.  Lakeside software and Liquidware labs were mentioned as vendors who provide software that does such an an analysis, however I’ve never heard of either of them and can provide no information or feedback on their services. There’s a good free VDI calculator available at http://myvirtualcloud.net/?page_id=1076.  Once you have a good grasp of the amount of IO you’ll need to support, scaling for the 95% percentile should be your target.

What’s the best way to prepare for VDI on a VNX array?  If you conclude from your analysis that your desktop environment is more read intensive, consider hosting it on storage pools that utilize FAST VP with EFD and FastCache.  Using VMWare’s View Storage Accelerator and EMC’s VFCache would also be of great benefit.  If your environment is more write intensive, focus on increasing the spindle count (even at the expense of wasted capacity).  All of the other items mentioned for read intensive environments still apply and would be beneficial.

Final thoughts:

– If you’re using Windows XP make sure disk alignment is set correctly.  If it’s not, you could see up to a 50% penalty in performance.

– Image optimization is very important.  Remove all unneeded services.

– Schedule your normal maintenance operations off hours.  Running patches and updates during business hours could cause the help desk phones to light up.

– Using NFS vs. block is a wash performance wise.  NFS is currently better for scaling up, however, as it allows for up to a 32 node cluster vs. only 8 for block (on linked clones).

– Desktops tend to be write heavy.  FAST VP is great!

EMC World 2012 – Thoughts on Isilon

I’m at EMC World in Las Vegas and I finished up my first full day at EMC World 2012 today.  There are about 15,000 attendees this year, which is much more than last year and it’s obvious. The crowds are huge and the Venetian is packed full.  Joe Tucci’s Keynote was amazing, the video screen behind Joe was longer than a football field and he took the time to point that out. 🙂  He went into detail about the past, present and future of IT and it was very interesting.

Many of the sessions I’m signed up for have non-disclosure agreements, so I can’t speak about some of the new things being announced or the sessions I’ve attended.  I’m trying to focus on learning about (and attending breakout sessions) about EMC technologies that we don’t currently use in my organization to broaden my scope of knowledge. There may be better solutions from EMC available than what the company I work for is currently using, and I want to learn about all the options available.

My first session today was about EMC’s Isilon product and I was excited to learn more about it. My only experience so far with EMC’s file based solutions is with legacy Celerra arrays and VNX File.  So, what’s the difference, and why would anyone choose to purchase an Isilon solution over simply adding a few data movers to their VNX? Why is Isilon better? Good Question. I attended an introductory level session but it was very informative.

I’m not going to pretend to be an expert after listening to a one hour session this morning, but here were my take-aways from it. Isilon, in a nutshell, is a much higher performance solution than VNX file. There are several different iterations of the platform available (S, X, and NL Series) all focused on specific customer needs.  One is for high transactional, IOPs intensive needs, another geared for capacity, and another geared for a balance (and a smaller budget). It uses the OneFS single filesystem (impressive by itself), which eliminates the standard abstraction layers of the filesystem, volume manager, and RAID types.  All of the disks use a single file system regardless of the total size.  The data is arranged in a fully symmetric cluster with data striped across all of the nodes.  The single, OneFS filesystem works regardless of the size of your filesystem – 18TB minimum all the way up to 15 PB.

Adding a new node to Isilon is seamless, it’s instantly added to the cluster (hence the term “Scale-out NAS” EMC has been touting throughout the conference).  You can add up to 144 nodes to a single Isilon array. It also features auto balancing, in that it will automatically rebalance data to the new node that was just added.  It can also remove data from a node and move it to a new one if you decide to decomission a node and replace it with a newer model. Need to replace that 4 year old isilon node with the old, low capacity disks? No problem. Another interesting item to node is how data is stored across the nodes. Isilon does not use a standard RAID model at all, it distributes data across the disks based on how much protection you decide you need.  You can decide as an administrator how the data is protected, choosing to keep as many copies of data as you want (at the expense of total available storage). The more duplicate copies of data you want to keep, the less total storage you have available for production use.  One great benefit of Isilon vs. VNX file is that rebuilds are much faster, as traditional RAID groups are dependant on the total IO available to the single drive being rebuilt, while Isilon rebuilds are spanned across the entire system. It could mean the differnce between a 12 hour single disk RAID5 rebuild vs. less than one hour on Isilon. Pretty cool stuff.

I only have experience with Celerra replicator, but it was also mentioned in the session that Isilon replications can go down to the specific folder level within a file system.  Very cool.  I can only do replications at the entire file system level on VNX file and Celerra right now. I don’t have any experience with that functionality yet, but it sounds very interesting.

There is a new upcoming version of the OneFS (called “Mavericks”) that will introduce even more new features, I’m not going to go into those as they may be part of the non-disclosure agreement.  Everything I’ve mentioned thus far is available currently.  Overall, I was very impressed with the Isilon architecture as compared to VNX file.  EMC claimed that they have the highest FS NAS throughput for any vendor with Isilon at 106GB/sec.  Again, very impressive.

I’ll make another update this week after attending a few more breakout sessions.  I’m also looking forward to learning more about Greenplum, the promise of improved performance through paralellism (using scale out archiceture on standard hardware) is also very interesting to me.  If anyone else is at EMC World this week, please comment!

Cheers!

Performance Data Collection/Discovery issues in ProSphere 1.5.0

I was an early adopter of ProSphere 1.0, it was deployed at all of our data centers globally within a few weeks of it’s release.  I gave up on 1.0.3, as the syncing between instances didn’t work and EMC told me that it wouldn’t be fixed until the next major release.  Well, that next major release was 1.5 so I jumped back in when it was released in early March 2012.

My biggest frustration initially was that I performance data didn’t seem to be collecting for any of the arrays.  I was able to discover all of the arrays but there wasn’t any detailed information available for any of them.  No LUN detail, no performance data.  Why?  Well, it seems ProSphere data collection is extremely dependant on a full path discovery, from host to switch to array.  Simply discovering the arrays by themselves isn’t sufficient.  Unless at least one host is seeing the complete path the performance collection on the array is not triggered.

With that said, my next step was to get everything properly discovered.  Below is an overview of what I did to get everything discovered and performance data collection working.

1 – Switch Discovery.

Because an SMI-S agent is required to discover arrays and switches, you’ll need a separate server to run the SMI-S agents. I’m using a Windows 2008 server.  If you want to keep data collection separated between geographical locations, you’ll need to install separate instances of ProSphere at each site and have separate SMI-S agent servers at each site.  The instances can then be synchronized together in a federated configuration (in Admin | System | Synchronize ProSphere Applications).

We use brocade switches so I initially downloaded and installed the brocade SMI-S agent.  It can be downloaded directly from Brocade here:  http://www.brocade.com/services-support/drivers-downloads/smi-agent/index.page.  I installed 120.9.0 and had some issues with discoveries.  EMC later told me that I needed to use 120.11.0 or later, which didn’t seem to be available on Brocade’s website. After speaking to an EMC rep regarding the Brocade SMI-S agent version issue, it was recommended to me that I use EMC’s software instead. Either should work, however.  You can use the SMI-S agent that’s included with Connectrix Manager Converged Network Edition (CMCNE).  The product itself requires a license, but you do not need to use a license to use only the SMI-S agent.  After installation, launch “C:\CMCNE 11.1.4\bin\smc.bat” and click on the Configure SMI Agent button to add the IP’s of your switches.  The one issue I ran in to with this was user security.  Only one userid and password can be used across all switches, so you may need to create a new id/password across all of your switches.  I had to do that and spent about a half of a day finishing that up. Once you add the switches in, use the IP of the host that the agent is installed on as your target for switch discovery in ProSphere. The default userid and password is administrator / password.

Make sure that port 5988 is open on the server you’re running this agent on. If it is Windows 2K8, disable the windows firewall or add an exception for ports 5988 and 5989 as well as the SMI-S processes ECOM and SLPD.exe.

2 – EMC Array Discovery

I had initially downloaded and installed the Solutions Enabler vApp thinking that it would work for my Clariion & VNX discoveries.  I was told later (after opening an SR) that it does provide SMI-S services.  EMC has their own SMI-S agent that will need to be installed a on a separate server, as it will use the same ports (5988/5989) as the Brocade agent (or CMCNE).  It can be downloaded here:  http://powerlink.emc.com/km/appmanager/km/secureDesktop?_nfpb=true&_pageLabel=servicesDownloadsTemplatePg&internalId=0b014066800251b8&_irrt=true, or by navigating in Powerlink to Home > Support > Software Downloads and Licensing > Downloads S > SMI-S Provider.

Once EMC’s SMI-S agent is installed you’ll need to add your arrays to it.  Open a command prompt and navigate to C:\Program Files\EMC\ECIM\ECOM\bin, and launch testsmiprovider.   When it prompts, choose “localhost”, “port 5988”, and use admin / #1Password as the login credentials.  Once logged in, you can use the “addsys” command to add the IP’s of your arrays.

Just like before, make sure that port 5988 is open on the server you’re running this agent on and disable the windows firewall or add an exception for ports 5988 and 5989.  You’ll again use the IP of the host that the agent is installed on as your target for array discovery.

3 – Host Discoveries

Host discoveries can be done directly without an agent.  You can use the root password for UNIX or ESX and any AD account in windows that has local administrator rights on each server.  Of course you can also set up specialized service accounts with the appropriate rights based on your company’s security regulations.

4 – Enable Path Data Collection

In order to see specific information about LUNs on the array, you will need to enable Path Performance Collection for each host.  If the host isn’t discovered and performance collection isn’t enabled, you won’t see any LUN information when looking at arrays.  To enable it, go to Discovery | Objects list | Hosts from the ProSphere console and click on the “On | Off” slider button to turn it on for each host.

5 – Verify Full path connectivity

Once all of the discoveries are complete, you can verify full path connectivity for an array by going to Discovery | Objects list | Arrays, click on any array, and look at the map.  If there is a cloud representing a switch with a red line to the array, you’re seeing the path.  You can use the same method for a host, if you go to Discovery | Objects list | Hosts and click on a host, you should see the host, the switch fabric, and the array on the map.  If you don’t see that full path you won’t get any data collected.

Comments and Thoughts

You can go directly to EMC’s community forum for general support and information here:   https://community.emc.com/community/support/prosphere.

After using ProSphere 1.5.0  for a little while now, I must say it’s definitely NOT Control Center.  It isn’t quite as advanced or full featured, but I don’t think it’s supposed to be.  It’s supposed to be an easy to deploy tool to get basic, useful information quickly.

I use the pmcli.exe command line tool in ECC extensively for custom data exports and reporting, and ProSphere does not provide a similar tool.  EMC does have an API built in to ProSphere that can be used to pull information over http (for example, to get a host list, type https://<prosphere_app_server>/srm/hosts).  I haven’t done too much research into that feature yet.  Version 1.5 added API support for array capacity, performance, and topology data.  You can read about it more in the white paper titled “ProSphere Integration: An overview of the REST API’s and information model” (h8893), which should be available on powerlink.

My initial upgrade from 1.0.3  to 1.5.0 did not go smoothly, I had about a 50% success rate across all of my installations.  My issues related to upgrades that would work but services wouldn’t start afterwards, and in one case the update web page simply stayed blank and would not let me run the upgrade to begin with.  Beware of upgrades if you want to preserve existing historical data, I ended up deleting the vApp and starting over for most of my deployments.

I’ve only recently been able to get all of my discoveries completed. I feel that the ProSphere documentation is somewhat lacking, I found myself wanting/needing more detail in many areas.  Most of my time has been spent doing trial and error testing with a little help from EMC support after I opened an SR.  I’ll give a more detailed post in the future about actually using ProSphere in the real world once I’ve had more time to use it.

Other items to note:

-ProSphere does not provide the same level of detail for historical data that you get in StorageScope, nor does it give the same amount of detail as Performance Manager.  It’s meant more for a quick “at a glance” view.

-ProSphere does not include the root password in the documentation, customers aren’t necessarily supposed to log in to the console.  I’m sure with a call to support you could obtain it.  Having the ability to at least start and stop services would be useful, as I had an issue with one of my upgrades where services wouldn’t start.  You can view the status of the services on any ProSphere server by navigating to https://prosphere_app_server/cgi-bin/mon.cgi?command=query_opstatus_full.

-ProSphere doesn’t gather the same level of detail about hosts and storage as ECC, but that’s the price you pay for agentless discovery.  Agents are needed for more detailed information.

How to troubleshoot EMC Control Center WLA Archive issues

We’re running EMC Control Center 6.1 UB12, and we use it primarly for it’s robust performance data collection and reporting capabilities.  Performance Manager is a great tool and I use it frequently.

Over the years I’ve had occasional issues with the WLA Archives not collecting performance data and I’ve had to open service requests to get it fixed.  Now that I’ve been doing this for a while, I’ve collected enough info to troubleshoot this issue and correct it without EMC’s assistance in most cases.

Check your ..\WLAArchives\Archives directory and look under the Clariion (or Celerra) folder, then the folder with your array’s serial number, then the interval folder.  This is where the “*.ttp” (text) and “*.btp” (binary) performance data files are stored for Performance Manager.  Sort by date.  If there isn’t a new file that’s been written in the last few hours data is not being collected.

Here are the basic items I generally review when data isn’t being collected for an array:

  1. Log in to every array in Unisphere, go to system properties, and on the ‘General’ tab make sure statistics logging is enabled.  I’ve found that if you don’t have an analyzer license on your array and start the 7 day data collection for a “naz” file, after the 7 days is up the stats logging option will be disabled.  You’ll have to go back in and re-enable it after the 7 day collection is complete.  If stats logging isn’t enabled on the array the WLA data collection will fail.
  2. If you recently changed the password on your clarion domain account, Make sure that naviseccli is updated properly for security access to all of your arrays (use the “addusersecurity” CLI option) and perform a rediscovery of all your arrays as well from within the ECC console.  There is no way from within the ECC console to update the password on an array, you must go through the discovery process again for all of them.
  3.  Verify the agents are running.  In the ECC console, click on the gears icon in the lower right hand corner.  It will create a window that shows the status of all the agents, including the WLA Archiver.  If WLA isn’t started, you can start it by right clicking on any array, choosing Agents, then start.  Check the WLAArchives  directories again (after waiting about an hour) and see if it’s collecting data again.

If those basic steps don’t work, checking the logs may point you in the right direction:

  1.  Review the Clariion agent logs for errors.  You’re not looking for anything specific here, just do a search for “error”, “unreachable” or for the specific IP’s of your arrays and see if there is anything obvious wrong. 
            %ECC_INSTALL_ROOT%\exec\MGL610\MGL.log
            %ECC_INSTALL_ROOT%\exec\MGL610\MGL_Bx.log.gz
            %ECC_INSTALL_ROOT%\exec\MGL610\MGL.ini
            %ECC_INSTALL_ROOT%\exec\MGL610\MGL_Err.log
            %ECC_INSTALL_ROOT%\exec\MGL610\MGL_Bx_Err.log
            %ECC_INSTALL_ROOT%\exec\MGL610\MGL_Discovery.log.gz
 

Here’s an example of an error I found in one case:

            MGL 14:10:18 C P I 2536   (29.94 MB) [MGLAgent::ProcessAlert] => Processing SP
            Unreachable alert. MO = APM00100600999, Context = Clariion, Category = SP
            Element = Unreachable
 

      2.   Review the WLA Agent logs.  Again, just search for errors and see if there is anything obvious that’s wrong. 

            %ECC_INSTALL_ROOT%\exec\ENW610\ENW.log
            %ECC_INSTALL_ROOT%\exec\ENW610\ENW_Bx.log.gz
            %ECC_INSTALL_ROOT%\exec\ENW610\ENW.ini
            %ECC_INSTALL_ROOT%\exec\ENW610\ENW_Err.log
            %ECC_INSTALL_ROOT%\exec\ENW610\ENW_Bx_Err.log
 

If the logs don’t show anything obvious, here are the steps I take to restart everything.  This has worked on several occasions for me.

  1. From the Control Center console, stop all agents on the ECC Agent server.  Do this by right clicking on the agent server (in the left pane), choose agents and stop.  Follow the prompts from there.
  2. Log in to the ECC Agent server console and stop the master agent.  You can do this in Computer Management | Services, stop the service titled “EMC ControlCenter Master Agent”.
  3. From the Control Center console, stop all agents on the Infrastructure server.  Do this by right clicking on the agent server (in the left pane), choose agents and stop.  Follow the prompts from there.
  4. Verify that all services have stopped properly.
  5. From the ECC Agent server console, go to C:\Windows\ECC\ and delete all .comfile and .lck files.
  6. Restart all agents on the Infrastructure server.
  7. Restart the Master Agent on the Agent server.
  8. Restart all other services on the Agent server.
  9. Verify that all services have restarted properly.
  10. Wait at least an hour and check to see if the WLA Archive files are being written.

If none of these steps resolve your problem and you don’t see any errors in the logs, it’s time to open an SR with EMC.  I’ve found the EMC staff  that supports ECC to be very knowledgeable and helpful.

 

 

Critical Celerra FCO for Mac OS X 10.7

Here’s the description of this issue from EMC:

EMC has become aware of a compatibility issue with the new version of MacOS 10.7 (Lion). If you install this new MacOS 10.7 (Lion) release, it will cause the data movers in your Celerra to reboot which may result in data unavailability. Please note that your Celerra may encounter this issue when internal or external users operating the MacOS 10.7 (Lion) release access your Celerra system.  In order to protect against this occurrence, EMC has issued the earlier ETAs and we are proactively upgrading the NAS code operating on your affected Celerra system. EMC strongly recommends that you allow for this important upgrade as soon as possible.

Wow.  A single Mac user running 10.7 can cause a kernel panic and a reboot of your data mover.  According to our rep, Apple changed the implementation of SMB in 10.7 and added a few commands that the Celerra doesn’t recognize.  This is a big deal folks, call your Account rep for a DART upgrade if you haven’t already.

Auto generating daily performance graphs with EMC Control Center / Performance Manager

This document describes the process I used to pull performance data using the ECC pmcli command line tool, parse the data to make it more usable with a graphing tool, and then use perl scripts to automatically generate graphs.

You must install Perl.  I use ActiveState Perl (Free Community Edition) (http://www.activestate.com/activeperl/downloads).

You must install Cygwin.  Link: http://www.cygwin.com/install.html. I generally choose all packages.

I use the follow CPAN Perl modules:

Step 1:

Once you have the software set up, the first step is to use the ECC command line utility to extract the interval performance data that you’re interested in graphing.  Below is a sample PMCLI command line script that could be used for this purpose.

:Get the current date

For /f “tokens=2-4 delims=/” %%a in (‘date /t’) do (set date=%%c%%a%%b)

:Export the interval file for today’s date.

D:\ECC\Client.610\PerformanceManager\pmcli.exe -export -out D:\archive\interval.csv -type interval -class clariion -date %date% -id APM00324532111

:Copy all the export data to my cygwin home directory for processing later.

copy /y e:\san712_interval.csv C:\cygwin\home\<userid>

You can schedule the command script above to run using windows task scheduler.  I run it at 11:46PM every night, as data is collected on our SAN in 15 minute intervals, and that gives me a file that reports all the way up to the end of one calendar day.

Note that there are 95 data collection points from 00:15 to 23:45 every day if you collect data at 15 minute intervals.  The storage processor data resides in the last two lines of the output file.

Here is what the output file looks like:

EMC ControlCenter Performance manager generated file from: <path>

Data Collected for DiskStats

Data Collected for DiskStats – 0_0_0

                                                             3/28/11 00:15       3/28/11 00:30      3/28/11  00:45      3/28/11 01:00 

Number of Arrivals with Non Zero Queue     12                         20                        23                      23 

% Utilization                                                30.2                     33.3                     40.4                  60.3

Response Time                                              1.8                        3.3                        5.4                     7.8

Read Throughput IO per sec                        80.6                    13.33                   90.4                    10.3

Great information in there, but the format of the data makes it very hard to do anything meaningful with the data in an excel chart.  If I want to chart only % utilization, that data is difficult to chart because there are so many counters around it that are also have data collected on them.   My next goal was to write a script to reformat the data in a much more usable format to automatically create a graph for one specific counter that I’m interested in (like daily utilization numbers), which could then be emailed daily or auto-uploaded to an internal website.

Step 2:

Once the PMCLI data is exported, the next step is to use cygwin bash scripts to parse the csv file and pull out only the performance data that is needed.  Each SAN will need a separate script for each type of performance data.  I have four scripts configured to run based on the data that I want to monitor.  The scripts are located in my cygwin home directory.

The scripts I use:

  • Iostats.sh (for total IO throughput)
  • Queuestats.sh (for disk queue length)
  • Resptime.sh (for disk response time in ms)
  • Utilstats.sh (for % utilization)

Here is a sample shell script for parsing the CSV export file (iostats.sh):

#!/usr/bin/bash

#This will pull only the timestamp line from the top of the CSV output file. I’ll paste it back in later.

grep -m 1 “/” interval.csv > timestamp.csv

#This will pull out only lines that begin with “total througput io per sec”.

grep -i “^Total Throughput IO per sec” interval.csv >> stats.csv

#This will pull out the disk/LUN title info for the first column.  I’ll add this back in later.

grep -i “Data Collected for DiskStats -” interval.csv > diskstats.csv

grep -i “Data Collected for LUNStats -” interval.csv > lunstats.csv

#This will create a column with the disk/LUN number .  I’ll paste it into the first column later.

cat diskstats.csv lunstats.csv > data.csv

#This adds the first column (disk/LUN) and combines it with the actual performance data columns.

paste data.csv stats.csv > combined.csv

#This combines the timestamp header at the top with the combined file from the previous step to create the final file we’ll use for the graph.  There is also a step to append the current date and copy the csv file to an archive directory.

cat timestamp.csv combined.csv > iostats.csv

cp iostats.csv /cygdrive/e/SAN/csv_archive/iostats_archive_$(date +%y%m%d).csv

#  This removes all the temporary files created earlier in the script.  They’re no longer needed.

rm timestamp.csv

rm stats.csv

rm diskstats.csv

rm lunstats.csv

rm data.csv

rm combined.csv

#This strips the last two lines of the CSV (Storage Processor data).  The resulting file is used for the “all disks” spreadsheet.  We don’t want the SP
data to skew the graph.  This CSV file is also copied to the archive directory.

sed ‘$d’ < iostats.csv > iostats2.csv

sed ‘$d’ < iostats2.csv > iostats_disk.csv

rm iostats2.csv

cp iostats_disk.csv /cygdrive/e/SAN/csv_archive/iostats_disk_archive_$(date +%y%m%d).csv

Note: The shell script above can be run in the windows task scheduler as long as you have cygwin installed.  Here’s the syntax:

c:\cygwin\bin\bash.exe -l -c “/home/<username>/iostats.sh”

After running the shell script above, the resulting CSV file contains only Total Throughput (IO per sec) data for each disk and lun.  It will contain data from 00:15 to 23:45 in 15 minute increments.  After the cygwin scripts have run we will have csv datasets that are ready to be exported to a graph.

The Disk and LUN stats are combined into the same CSV file.  It is entirely possible to rewrite the script to only have one or the other.  I put them both in there to make it easier to manually create a graph in excel for either disk or lun stats at a later time (if necessary).  The “all disks graph” does not look any different with both disk and lun stats in there, I tried it both ways and they overlap in a way that makes the extra data indistinguishable in the image.

The resulting data output after running the iostats.sh script is shown below.  I now have a nice, neat excel spreadsheet that lists the total throughput for each disk in the array for the entire day in 15 minute increments.   Having the data formatted in this way makes it super easy to create charts.  But I don’t want to have to do that manually every day, I want the charts to be created automatically.

                                                             3/28/11 00:15       3/28/11 00:30      3/28/11  00:45      3/28/11 01:00

Total Throughput IO per sec   – 0_0_0          12                             20                             23                           23 

Total Throughput IO per sec    – 0_0_1        30.12                        33.23                        40.4                         60.23

Total Throughput IO per sec    – 0_0_2         1.82                          3.3                           5.4                              7.8

Total Throughput IO per sec    -0_0_3         80.62                        13.33                        90.4                         10.3 

Step 3:

Now I want to automatically create the graphs every day using a Perl script.  After the CSV files are exported to a more usable format from the previous step, I Use the GD::Graph library from CPAN (http://search.cpan.org/~mverb/GDGraph-1.43/Graph.pm) to auto-generate the graphs.

Below is a sample Perl script that will autogenerate a great looking graph based on the CSV ouput file from the previous step.

#!/usr/bin/perl

#Declare the libraries that will be used.

use strict;

use Text::ParseWords;

use GD::Graph::lines;

use Data::Dumper;

#Specify the csv file that will be used to create the graph

my $file = ‘C:\cygwin\home\<username>\iostats_disk.csv’;

#my $file  = $ARGV[0];

my ($output_file) = ($file =~/(.*)\./);

#Create the arrays for the data and the legends

my @data;

my @legends;

#parse csv, generate an error if it fails

open(my $fh, ‘<‘, $file) or die “Can’t read csv file ‘$file’ [$!]\n”;

my $countlines = 0;

while (my $line = <$fh>) {

chomp $line;

my @fields = Text::ParseWords::parse_line(‘,’, 0, $line);

#There are 95 fields generated to correspond to the 95 data collection points in each
of the output files.

my @field =

(@fields[1],@fields[2],@fields[3],@fields[4],@fields[5],@fields[6],@fields[7],@fields[8],@fields[9],@fields[10],@fields[11],@fields[12],@fields[13],@fields[14],@fields[15],@fields[16],@fields[17],@fields[18],@fields[19],@fields[20],@fields[21],@fields[22],@fields[23],@fields[24],@fields[25],@fields[26],@fields[27],@fields[28],@fields[29],@fields[30],@fields[31],@fields[32],@fields[33],@fields[34],@fields[35],@fields[36],@fields[37],@fields[38],@fields[39],@fields[40],@fields[41],@fields[42],@fields[43],@fields[44],@fields[45],@fields[46],@fields[47],@fields[48],@fields[49],@fields[50],@fields[51],@fields[52],@fields[53],@fields[54],@fields[55],@fields[56],@fields[57],@fields[58],@fields[59],@fields[60],@fields[61],@fields[62],@fields[63],@fields[64],@fields[65],@fields[66],@fields[67],@fields[68],@fields[69],@fields[70],@fields[71],@fields[72],@fields[3],@fields[74],@fields[75],@fields[76],@fields[77],@fields[78],@fields[79],@fields[80],@fields[81],@fields[82],@fields[83],@fields[84],@fields[85],@fields[86],@fields[87],@fields[88],@fields[89],@fields[90],@fields[91],@fields[92],@fields[93],@fields[94],@fields[95]);
push @data, \@field;

if($countlines >= 1){

push @legends, @fields[0];

}

$countlines++;

}

#The data and legend arrays will read 820 lines of the CSV file.  This number will change based on the number of disks in the SAN, and will be different depending on the SAN being reported on.  The legend info will read the first column of the spreadsheet and create a color box that corresponds to the graph line.  For the purpose of this graph, I won’t be using it because 820+ legend entries look like a mess on the screen.

splice @data, 1, -820;

splice @legends, 0, -820;

#Set Graphing Options

my $mygraph = GD::Graph::lines->new(1024, 768);

# There are many graph options that can be changed using the GD::Graph library.  Check the website (and google) for lots of examples.

$mygraph->set(

title => ‘SP IO Utilization (00:15 – 23:45)’,

y_label => ‘IOs Per Second’,

y_tick_number => 4,

values_vertical => 6,

show_values => 0,

x_label_skip => 3,

) or warn $mygraph->error;

#As I said earlier, because of the large number of legend entries for this type of graph, I change the legend to simply read “All disks”.  If you want the legend to actually put the correct entries and colors, use this line instead:  $mygraph->set_legend(@legends);

$mygraph->set_legend(‘All Disks’);

#Plot the data

my $myimage = $mygraph->plot(\@data) or die $mygraph->error;

# Export the graph as a gif image.  The images are currently moved to the IIS folder (c:\inetpub\wwwroot) with one of the scripts.  The could also be emailed using a sendmail utility.

my $format = $mygraph->export_format;

open(IMG,”>$output_file.$format”) or die $!;

binmode IMG;

print IMG $myimage->gif;

close IMG;

After this script runs the resulting image file will be saved in the cygwin home directory (It saves it in the same directory that the CSV file is located in).  One of the nightly scripts I run will copy the image to our interal IIS server’s image directory, and sendmail will email the graph to the SAN Admin team.

That’s it!  You now have lots of pretty graphs with which you can impress your management team. 🙂

Here is a sample graph that was generated with the Perl script:

Filesystem Alignment

You’re likely to have seen the filesystem alignment check fail on most, if not all, of the EMC HEAT reports that you run on your windows 2003 servers.  The starting offset for partition 1 should optimally be a multiple of 128 sectors.  So, how do you fix this problem, and what does it mean?

If you align the partition to 128 blocks (or 64KB as each block is 512bytes) then you don’t cross a track boundary and thereby issue the minimum number of IOs.   Issuing the minimum number of IOs sounds good, right? 🙂

Because NTFS reserves 31.5 KB of signature space, if a LUN has an element size of 64 KB with the default alignment offset of 0 (both are default Navisphere settings), a 64 KB write to that LUN would result in a disk crossing even though it would seem to fit perfectly on
the disk.  A disk crossing can also be referred to as a split IO because the read or write must be split into two or more segments. In this case, 32.5 KB would be written to the first disk and 31.5 KB would be written to the following disk, because the beginning of the stripe is offset by 31.5 KB of signature space. This problem can be avoided by providing the correct alignment offset.  Each alignment offset value represents one block.  Therefore, EMC recommends setting the alignment offset value to 63, because 63 times 512 bytes is 31.5 KB.

Checking your offset:

1. Launch System Information in windows (msinfo32.exe)

2. Select Components -> Storage -> Disks.

3. Scroll to the bottom and you will see the partition starting offset information.  This number needs to be perfectly divisible by 4096, if it’s not then your partition is not properly aligned.

Correcting your starting offset:

Launch diskpart:

C:\>diskpart

DISKPART> list disk

Two disks should be listed

DISKPART> select disk 1

This selects the second disk drive

DISKPART> list partitions

This step should give a message “There are no partitions on this disk to show”.  This confirms a blank disk.

DISKPART> create partition primary align=64

That’s it.  You now have a perfectly aligned disk.