Tag Archives: emc

ProSphere 1.6 Updates

ProSphere 1.6 was released this week, and it looks like EMC was listening!  Several of the updates are features that I specifically requested when I gave my feedback to EMC at EMC World.  I’m sure it’s just a coincidence, but it’s good to finally see some valuable improvements that make this product that much closer to being useful in my company’s environment.  The most important items I wanted to see was the ability to export performance data to a csv file and improved documentation on the REST API.  Both of those things were included with this release.  I haven’t looked yet to see if the performance exports can be run from a command line (a requirement for it to be useful to me for scripting).  The REST API documentation was created in the form of a help file.  It can be downloaded an run from an internal web server as well, which is what I did.

Here are the new features in v1.6:

Alerting

ProSphere can now receive Brocade alerts for monitoring and analysis. These alerts can be forwarded through SNMP traps.

Consolidation of alerts from external sources is now extended to include:

• Brocade alerts (BNA and CMCNE element managers)

• The following additional Symmetrix Management Console (SMC) alerts:
– Device Status
– Device Pool Status
– Thin Device Allocation
– Director Status
– Port Status
– Disk Status
– SMC Environmental Alert

Capacity

– Support for Federated Tiered Storage (FTS) has been added, allowing ProSphere to identify LUNs that have been presented from external storage logically, positioned behind the Unisphere for VMAX 10K, 20K and 40K.

– Service Levels are now based on the Fully Automated Storage Tier (FAST) policies defined in Symmetrix arrays. ProSphere reports on how much capacity is available for each Service Level, and how much is being consumed by each host in the environment.

Serviceability

– Users can now export ProSphere reports for performance and capacity statistics in CSV format.

Unisphere for VMAX 1.0 compatibility

– ProSphere now supports the new Unisphere for VMAX as well as Single Sign On and Launch-in-Context to the management console of the Unisphere for VMAX element manager. ProSphere, in conjunction with Unisphere for VMAX, will have the same capabilities as Symmetrix Management Console and Symmetrix Performance Analyzer.

Unisphere support

– In this release, you can launch Unisphere (CLARiiON, VNX, and Celerra) from ProSphere, but without the benefits of Single Sign On and Launch-in-Context.

Advertisements

Performance Data Collection/Discovery issues in ProSphere 1.5.0

I was an early adopter of ProSphere 1.0, it was deployed at all of our data centers globally within a few weeks of it’s release.  I gave up on 1.0.3, as the syncing between instances didn’t work and EMC told me that it wouldn’t be fixed until the next major release.  Well, that next major release was 1.5 so I jumped back in when it was released in early March 2012.

My biggest frustration initially was that I performance data didn’t seem to be collecting for any of the arrays.  I was able to discover all of the arrays but there wasn’t any detailed information available for any of them.  No LUN detail, no performance data.  Why?  Well, it seems ProSphere data collection is extremely dependant on a full path discovery, from host to switch to array.  Simply discovering the arrays by themselves isn’t sufficient.  Unless at least one host is seeing the complete path the performance collection on the array is not triggered.

With that said, my next step was to get everything properly discovered.  Below is an overview of what I did to get everything discovered and performance data collection working.

1 – Switch Discovery.

Because an SMI-S agent is required to discover arrays and switches, you’ll need a separate server to run the SMI-S agents. I’m using a Windows 2008 server.  If you want to keep data collection separated between geographical locations, you’ll need to install separate instances of ProSphere at each site and have separate SMI-S agent servers at each site.  The instances can then be synchronized together in a federated configuration (in Admin | System | Synchronize ProSphere Applications).

We use brocade switches so I initially downloaded and installed the brocade SMI-S agent.  It can be downloaded directly from Brocade here:  http://www.brocade.com/services-support/drivers-downloads/smi-agent/index.page.  I installed 120.9.0 and had some issues with discoveries.  EMC later told me that I needed to use 120.11.0 or later, which didn’t seem to be available on Brocade’s website. After speaking to an EMC rep regarding the Brocade SMI-S agent version issue, it was recommended to me that I use EMC’s software instead. Either should work, however.  You can use the SMI-S agent that’s included with Connectrix Manager Converged Network Edition (CMCNE).  The product itself requires a license, but you do not need to use a license to use only the SMI-S agent.  After installation, launch “C:\CMCNE 11.1.4\bin\smc.bat” and click on the Configure SMI Agent button to add the IP’s of your switches.  The one issue I ran in to with this was user security.  Only one userid and password can be used across all switches, so you may need to create a new id/password across all of your switches.  I had to do that and spent about a half of a day finishing that up. Once you add the switches in, use the IP of the host that the agent is installed on as your target for switch discovery in ProSphere. The default userid and password is administrator / password.

Make sure that port 5988 is open on the server you’re running this agent on. If it is Windows 2K8, disable the windows firewall or add an exception for ports 5988 and 5989 as well as the SMI-S processes ECOM and SLPD.exe.

2 – EMC Array Discovery

I had initially downloaded and installed the Solutions Enabler vApp thinking that it would work for my Clariion & VNX discoveries.  I was told later (after opening an SR) that it does provide SMI-S services.  EMC has their own SMI-S agent that will need to be installed a on a separate server, as it will use the same ports (5988/5989) as the Brocade agent (or CMCNE).  It can be downloaded here:  http://powerlink.emc.com/km/appmanager/km/secureDesktop?_nfpb=true&_pageLabel=servicesDownloadsTemplatePg&internalId=0b014066800251b8&_irrt=true, or by navigating in Powerlink to Home > Support > Software Downloads and Licensing > Downloads S > SMI-S Provider.

Once EMC’s SMI-S agent is installed you’ll need to add your arrays to it.  Open a command prompt and navigate to C:\Program Files\EMC\ECIM\ECOM\bin, and launch testsmiprovider.   When it prompts, choose “localhost”, “port 5988”, and use admin / #1Password as the login credentials.  Once logged in, you can use the “addsys” command to add the IP’s of your arrays.

Just like before, make sure that port 5988 is open on the server you’re running this agent on and disable the windows firewall or add an exception for ports 5988 and 5989.  You’ll again use the IP of the host that the agent is installed on as your target for array discovery.

3 – Host Discoveries

Host discoveries can be done directly without an agent.  You can use the root password for UNIX or ESX and any AD account in windows that has local administrator rights on each server.  Of course you can also set up specialized service accounts with the appropriate rights based on your company’s security regulations.

4 – Enable Path Data Collection

In order to see specific information about LUNs on the array, you will need to enable Path Performance Collection for each host.  If the host isn’t discovered and performance collection isn’t enabled, you won’t see any LUN information when looking at arrays.  To enable it, go to Discovery | Objects list | Hosts from the ProSphere console and click on the “On | Off” slider button to turn it on for each host.

5 – Verify Full path connectivity

Once all of the discoveries are complete, you can verify full path connectivity for an array by going to Discovery | Objects list | Arrays, click on any array, and look at the map.  If there is a cloud representing a switch with a red line to the array, you’re seeing the path.  You can use the same method for a host, if you go to Discovery | Objects list | Hosts and click on a host, you should see the host, the switch fabric, and the array on the map.  If you don’t see that full path you won’t get any data collected.

Comments and Thoughts

You can go directly to EMC’s community forum for general support and information here:   https://community.emc.com/community/support/prosphere.

After using ProSphere 1.5.0  for a little while now, I must say it’s definitely NOT Control Center.  It isn’t quite as advanced or full featured, but I don’t think it’s supposed to be.  It’s supposed to be an easy to deploy tool to get basic, useful information quickly.

I use the pmcli.exe command line tool in ECC extensively for custom data exports and reporting, and ProSphere does not provide a similar tool.  EMC does have an API built in to ProSphere that can be used to pull information over http (for example, to get a host list, type https://<prosphere_app_server>/srm/hosts).  I haven’t done too much research into that feature yet.  Version 1.5 added API support for array capacity, performance, and topology data.  You can read about it more in the white paper titled “ProSphere Integration: An overview of the REST API’s and information model” (h8893), which should be available on powerlink.

My initial upgrade from 1.0.3  to 1.5.0 did not go smoothly, I had about a 50% success rate across all of my installations.  My issues related to upgrades that would work but services wouldn’t start afterwards, and in one case the update web page simply stayed blank and would not let me run the upgrade to begin with.  Beware of upgrades if you want to preserve existing historical data, I ended up deleting the vApp and starting over for most of my deployments.

I’ve only recently been able to get all of my discoveries completed. I feel that the ProSphere documentation is somewhat lacking, I found myself wanting/needing more detail in many areas.  Most of my time has been spent doing trial and error testing with a little help from EMC support after I opened an SR.  I’ll give a more detailed post in the future about actually using ProSphere in the real world once I’ve had more time to use it.

Other items to note:

-ProSphere does not provide the same level of detail for historical data that you get in StorageScope, nor does it give the same amount of detail as Performance Manager.  It’s meant more for a quick “at a glance” view.

-ProSphere does not include the root password in the documentation, customers aren’t necessarily supposed to log in to the console.  I’m sure with a call to support you could obtain it.  Having the ability to at least start and stop services would be useful, as I had an issue with one of my upgrades where services wouldn’t start.  You can view the status of the services on any ProSphere server by navigating to https://prosphere_app_server/cgi-bin/mon.cgi?command=query_opstatus_full.

-ProSphere doesn’t gather the same level of detail about hosts and storage as ECC, but that’s the price you pay for agentless discovery.  Agents are needed for more detailed information.

How to troubleshoot EMC Control Center WLA Archive issues

We’re running EMC Control Center 6.1 UB12, and we use it primarly for it’s robust performance data collection and reporting capabilities.  Performance Manager is a great tool and I use it frequently.

Over the years I’ve had occasional issues with the WLA Archives not collecting performance data and I’ve had to open service requests to get it fixed.  Now that I’ve been doing this for a while, I’ve collected enough info to troubleshoot this issue and correct it without EMC’s assistance in most cases.

Check your ..\WLAArchives\Archives directory and look under the Clariion (or Celerra) folder, then the folder with your array’s serial number, then the interval folder.  This is where the “*.ttp” (text) and “*.btp” (binary) performance data files are stored for Performance Manager.  Sort by date.  If there isn’t a new file that’s been written in the last few hours data is not being collected.

Here are the basic items I generally review when data isn’t being collected for an array:

  1. Log in to every array in Unisphere, go to system properties, and on the ‘General’ tab make sure statistics logging is enabled.  I’ve found that if you don’t have an analyzer license on your array and start the 7 day data collection for a “naz” file, after the 7 days is up the stats logging option will be disabled.  You’ll have to go back in and re-enable it after the 7 day collection is complete.  If stats logging isn’t enabled on the array the WLA data collection will fail.
  2. If you recently changed the password on your clarion domain account, Make sure that naviseccli is updated properly for security access to all of your arrays (use the “addusersecurity” CLI option) and perform a rediscovery of all your arrays as well from within the ECC console.  There is no way from within the ECC console to update the password on an array, you must go through the discovery process again for all of them.
  3.  Verify the agents are running.  In the ECC console, click on the gears icon in the lower right hand corner.  It will create a window that shows the status of all the agents, including the WLA Archiver.  If WLA isn’t started, you can start it by right clicking on any array, choosing Agents, then start.  Check the WLAArchives  directories again (after waiting about an hour) and see if it’s collecting data again.

If those basic steps don’t work, checking the logs may point you in the right direction:

  1.  Review the Clariion agent logs for errors.  You’re not looking for anything specific here, just do a search for “error”, “unreachable” or for the specific IP’s of your arrays and see if there is anything obvious wrong. 
            %ECC_INSTALL_ROOT%\exec\MGL610\MGL.log
            %ECC_INSTALL_ROOT%\exec\MGL610\MGL_Bx.log.gz
            %ECC_INSTALL_ROOT%\exec\MGL610\MGL.ini
            %ECC_INSTALL_ROOT%\exec\MGL610\MGL_Err.log
            %ECC_INSTALL_ROOT%\exec\MGL610\MGL_Bx_Err.log
            %ECC_INSTALL_ROOT%\exec\MGL610\MGL_Discovery.log.gz
 

Here’s an example of an error I found in one case:

            MGL 14:10:18 C P I 2536   (29.94 MB) [MGLAgent::ProcessAlert] => Processing SP
            Unreachable alert. MO = APM00100600999, Context = Clariion, Category = SP
            Element = Unreachable
 

      2.   Review the WLA Agent logs.  Again, just search for errors and see if there is anything obvious that’s wrong. 

            %ECC_INSTALL_ROOT%\exec\ENW610\ENW.log
            %ECC_INSTALL_ROOT%\exec\ENW610\ENW_Bx.log.gz
            %ECC_INSTALL_ROOT%\exec\ENW610\ENW.ini
            %ECC_INSTALL_ROOT%\exec\ENW610\ENW_Err.log
            %ECC_INSTALL_ROOT%\exec\ENW610\ENW_Bx_Err.log
 

If the logs don’t show anything obvious, here are the steps I take to restart everything.  This has worked on several occasions for me.

  1. From the Control Center console, stop all agents on the ECC Agent server.  Do this by right clicking on the agent server (in the left pane), choose agents and stop.  Follow the prompts from there.
  2. Log in to the ECC Agent server console and stop the master agent.  You can do this in Computer Management | Services, stop the service titled “EMC ControlCenter Master Agent”.
  3. From the Control Center console, stop all agents on the Infrastructure server.  Do this by right clicking on the agent server (in the left pane), choose agents and stop.  Follow the prompts from there.
  4. Verify that all services have stopped properly.
  5. From the ECC Agent server console, go to C:\Windows\ECC\ and delete all .comfile and .lck files.
  6. Restart all agents on the Infrastructure server.
  7. Restart the Master Agent on the Agent server.
  8. Restart all other services on the Agent server.
  9. Verify that all services have restarted properly.
  10. Wait at least an hour and check to see if the WLA Archive files are being written.

If none of these steps resolve your problem and you don’t see any errors in the logs, it’s time to open an SR with EMC.  I’ve found the EMC staff  that supports ECC to be very knowledgeable and helpful.

 

 

Errors when creating new replication jobs

I was attempting to create a new replication job on one of our VNX5500’s and was receiving several errors when selecting our DR NS-960 as the ‘destination celerra network server’.

It was displaying the following errors at the top of the window:

– “Query VDMs All.  Cannot access any Data Mover on the remote system, <celerra_name>”. The error details directed me to check that all the Data Moverss are accessible, that the time difference between the source and destination doesn’t exceed 10 min, and that the passphrase matches.  I confirmed that all of those were fine.

– “Query Storage Pools All.  Remote command failed:\nremote celerra – <celerra_name>\nremote exit status =0\nremote error = 0\nremote message = HTTP Error 500: Internal Server Error”.  The error details on this message say to search powerlink, not a very useful description.

– “There are no destination pools available”.  The details on this error say to check available space on the destination storage pool.  There is 3.5TB available in the pool I want to use on the destination side, so that wasn’t the issue either.

All existing replication jobs were still running fine so I knew there was not a network connectivity problem.  I reviewed the following items as well:

– I was able to validate all of the interconnects successfully, that wasn’t the issue.

– I ran nas_cel -update on the interconnects on both sides and received no errors, but it made no difference.

– I checked the server logs and didn’t see any errors relating to replication.

Not knowing where to look next, I opened an SR with EMC.  As it turns out, it was a security issue.

About a month ago an EMC CE accidently deleted our global security accounts during a service call.  I had recreated all of the deleted accounts and didn’t think there would be any further issues.  Logging in with the re-created nasadmin account after the accidental deletion was the root cause of the problem.  Here’s why:

The clariion global user account is tied to a local user account on the control station in /etc/passwd. When nasadmin was recreated on the domain, it attempted to create the nasadmin account on the control station as well.  Because the account already existed as a local account on the control station, it created a local account named ‘nasadmin1‘ instead, which is what caused the problem.  The two nasadmin accounts were no longer synchronized between the Celerra and the Clariion domain, so when logging in with the global nasadmin account you were no longer tied to the local nasadmin account on the control station.  Deleting all nasadmin accounts from the global domain and from the local /etc/passwd on the Celerra, and then recreating nasadmin in the domain solves the problem.  Because the issue was related only to the nasadmin account in this case, I could have also solved the problem by simply creating a new global account (with administrator priviliges) and using that to create the replication job.  I tested that as well and it worked fine.

Problem with soft media errors on SSD drives and FastCache

4/25/2012 Update:  EMC has released a fix for this issue.  Call your account service representative and say you need to upgrade your NS-960 dart to 6.0.55.300 and flare to 4.30.000.5.524 plus drive firmware upgrade on all SSD drives to TC3Q.

Do you have FastCache enabled on your array?  Keep a close eye on your SP event logs for soft media errors on your SSD drives.  I just noticed over 2000 soft media errors on one of my FastCache enabled arrays, and found a technical advisory from EMC (emc282741) that desribes this as a potentially critical problem.  I just opened a case with EMC for my array to be reviewed for a possible disk replacement.  In the event a second disk drive in the same FastCache RAID group encounters soft media errors before the system automatically retires the first drive a dual faulted RAID Group could occur.  This can result in storage pools going offline and becoming completely inaccessible to the attached hosts.  That’s basically a total SAN outage, not good.

Look for errors like the following in your SP event logs:

“Date Stamp”  “Time Stamp” Bus1 Enc1 Dsk0  820 Soft Media Error [Bad block]

EMC states in emc282741 that enhancements are targeted for Q1 2012 to address SSD media errors and dual hardware faults, but in the meantime, make sure you review the SP logs if you have CLARiiON or VNX arrays that are configured with SSD disk drives or are using FAST Cache.  If any instance of the “Soft Media Error” listed above is associated with any one of the solid state disk drives in your arrays, the array should be upgraded to at least FLARE Release 04.30.000.5.522 (for CX4 Series arrays) or Release 05.31.000.5.509 (for VNX Series arrays) and then start a Proactive Copy (PACO) to a hot spare and replace the drive as soon as possible.

In order to quickly review this on each of my arrays, I wrote the following script to update my intranet site with a report every morning:

naviseccli -h clariion1a getlog >clariion1a.txt
naviseccli -h clariion1b getlog >clariion1b.txt  
cat clariion1a.txt | grep -i ‘soft media’ >clariion1_softmedia_errors.csv
cat clariion1b.txt | grep -i ‘soft media’ >>clariion1_softmedia_errors.csv
./csv2htm.pl -e -T -i /home/scripts/clariion1_softmedia_errors.csv -o /<intranet_web_server>/clariion1_softmedia_errors.html
 

The script dumps the entire SP log from each SP into a text file, greps for only soft media errors in each file, then converts the output to HTML and writes it to my intranet web server.

 

Critical Celerra FCO for Mac OS X 10.7

Here’s the description of this issue from EMC:

EMC has become aware of a compatibility issue with the new version of MacOS 10.7 (Lion). If you install this new MacOS 10.7 (Lion) release, it will cause the data movers in your Celerra to reboot which may result in data unavailability. Please note that your Celerra may encounter this issue when internal or external users operating the MacOS 10.7 (Lion) release access your Celerra system.  In order to protect against this occurrence, EMC has issued the earlier ETAs and we are proactively upgrading the NAS code operating on your affected Celerra system. EMC strongly recommends that you allow for this important upgrade as soon as possible.

Wow.  A single Mac user running 10.7 can cause a kernel panic and a reboot of your data mover.  According to our rep, Apple changed the implementation of SMB in 10.7 and added a few commands that the Celerra doesn’t recognize.  This is a big deal folks, call your Account rep for a DART upgrade if you haven’t already.

Filesystem Alignment

You’re likely to have seen the filesystem alignment check fail on most, if not all, of the EMC HEAT reports that you run on your windows 2003 servers.  The starting offset for partition 1 should optimally be a multiple of 128 sectors.  So, how do you fix this problem, and what does it mean?

If you align the partition to 128 blocks (or 64KB as each block is 512bytes) then you don’t cross a track boundary and thereby issue the minimum number of IOs.   Issuing the minimum number of IOs sounds good, right? 🙂

Because NTFS reserves 31.5 KB of signature space, if a LUN has an element size of 64 KB with the default alignment offset of 0 (both are default Navisphere settings), a 64 KB write to that LUN would result in a disk crossing even though it would seem to fit perfectly on
the disk.  A disk crossing can also be referred to as a split IO because the read or write must be split into two or more segments. In this case, 32.5 KB would be written to the first disk and 31.5 KB would be written to the following disk, because the beginning of the stripe is offset by 31.5 KB of signature space. This problem can be avoided by providing the correct alignment offset.  Each alignment offset value represents one block.  Therefore, EMC recommends setting the alignment offset value to 63, because 63 times 512 bytes is 31.5 KB.

Checking your offset:

1. Launch System Information in windows (msinfo32.exe)

2. Select Components -> Storage -> Disks.

3. Scroll to the bottom and you will see the partition starting offset information.  This number needs to be perfectly divisible by 4096, if it’s not then your partition is not properly aligned.

Correcting your starting offset:

Launch diskpart:

C:\>diskpart

DISKPART> list disk

Two disks should be listed

DISKPART> select disk 1

This selects the second disk drive

DISKPART> list partitions

This step should give a message “There are no partitions on this disk to show”.  This confirms a blank disk.

DISKPART> create partition primary align=64

That’s it.  You now have a perfectly aligned disk.