Tag Archives: problems

Powerpath Install / Upgrade Issues

I recently had several issues when attempting to upgrade a Windows 2008 server from Powerpath v5.3 to v5.5 SP1. I uninstalled 5.3 using the windows utility, rebooted, then reinstalled v5.5 SP1. After the reboot, the server did not come back up. In order to get it to boot, the “last known good configuration” option had to be chosen. I opened an SR with EMC, and they determined that the uninstall process was not completing correctly.

To resolve the problem, you need to run the executable from the command line and add a few parameters. The name of the Powerpath install file will vary depending on the version you are installing, but the command looks like this:

EMCPower.Net32.signed.5.3.b310.exe /v”/L*v C:\logs\PPremove.log NO_REBOOT=1 PPREMOVE=ALL”

In this example, the c:\logs directory must exist before you run it. After running that command to uninstall powerpath and then reinstalling the new version, I no longer had the problem of the server not booting correctly.

After properly installing it, I continued to have a problem with Powerpath administrator not properly recognizing the devices. All of the devices showed up as “DEV ??”. I also saw “harddisk ??” when running powermt display dev=all. To resolve the problem I ran through the following steps:

1. Open Device Manager under Disks, and right-click the device drive that had a yellow ‘!’.
2. Choose “Update Driver Software”.
3. Click on “Browse my comptuer for driver software”.
4. Click on “Let me pick from a list of device drivers on my computer”.
5. In the next screen make sure the “Show compatible hardware” box is checked.
6. Under the Model list you should see the ‘PowerPath Devices’ driver. Highlight it and click next. This will install the PowerPath Driver. When it is done it will require a reboot. Once the server has come back online run another: ‘powermt display dev=all’ to see that the harddisk?? will have changed to harddisk## as expected.

Advertisements

Close requests fail on data mover when using Riverbed Steelhead appliance

We recently had a problem with one of our corporate applications having file close requests fail, resulting in 200,000+ open files on our production data mover.  This was causing numerous issues within the application.  We determined that the problem was a result of our Riverbed Steelhead appliance requiring a certain level of DART code in order to properly close the files.  The Steelhead applicance would fail when attempting to optimize SMBV2 connections.

Because a DART code upgrade is required to resolve the problem, the only temporary fix is to reboot the data mover.  I wrote a quick script on the Celerra to grab the number of open files, write it to a text file, and publish to our internal web server.  The command to check how many open files are on the data mover is below.

This command provides all the detailed information:

/nas/bin/server_stats server_2 -monitor cifs-std -interval 1 -count 1

server_2    CIFS     CIFS     CIFS    CIFS Avg   CIFS     CIFS    CIFS Avg     CIFS       CIFS
Timestamp   Total    Read     Read      Read     Write    Write    Write       Share      Open
            Ops/s    Ops/s    KiB/s   Size KiB   Ops/s    KiB/s   Size KiB  Connections   Files
11:15:36     3379      905     9584        11        9      272        30         1856     4915   

server_2    CIFS     CIFS     CIFS    CIFS Avg   CIFS     CIFS    CIFS Avg     CIFS       CIFS
Summary     Total    Read     Read      Read     Write    Write    Write       Share      Open
            Ops/s    Ops/s    KiB/s   Size KiB   Ops/s    KiB/s   Size KiB  Connections   Files
Minimum      3379      905     9584        11        9      272        30         1856     4915   
Average      3379      905     9584        11        9      272        30         1856     4915   
Maximum      3379      905     9584        11        9      272        30         1856     4915   

Adding a grep for Maximum and using awk to grab only the last column, this command will output only the number of open files, rather than the large output above:

/nas/bin/server_stats server_2 -monitor cifs-std -interval 1 -count 1 | grep Maximum | awk ‘{print $10}’

The output of that command would simply be ‘4915’ based on the sample full output I used above.

The solution number from Riverbed’s knowledgebase is S16257.  Your DART code needs to be at least 6.0.60.2 or 7.0.52.  You will also see in your steelhead logs a message similar to the one below indicating that the close request has failed for a particular file:

Sep 1 18:19:52 steelhead port[9444]: [smb2cfe.WARN] 58556726 {10.0.0.72:59207 10.0.0.72:445} Close failed for fid: 888819cd-z496-7fa2-2735-0000ffffffff with ntstatus: NT_STATUS_INVALID_PARAMETER

Performance Data Collection/Discovery issues in ProSphere 1.5.0

I was an early adopter of ProSphere 1.0, it was deployed at all of our data centers globally within a few weeks of it’s release.  I gave up on 1.0.3, as the syncing between instances didn’t work and EMC told me that it wouldn’t be fixed until the next major release.  Well, that next major release was 1.5 so I jumped back in when it was released in early March 2012.

My biggest frustration initially was that I performance data didn’t seem to be collecting for any of the arrays.  I was able to discover all of the arrays but there wasn’t any detailed information available for any of them.  No LUN detail, no performance data.  Why?  Well, it seems ProSphere data collection is extremely dependant on a full path discovery, from host to switch to array.  Simply discovering the arrays by themselves isn’t sufficient.  Unless at least one host is seeing the complete path the performance collection on the array is not triggered.

With that said, my next step was to get everything properly discovered.  Below is an overview of what I did to get everything discovered and performance data collection working.

1 – Switch Discovery.

Because an SMI-S agent is required to discover arrays and switches, you’ll need a separate server to run the SMI-S agents. I’m using a Windows 2008 server.  If you want to keep data collection separated between geographical locations, you’ll need to install separate instances of ProSphere at each site and have separate SMI-S agent servers at each site.  The instances can then be synchronized together in a federated configuration (in Admin | System | Synchronize ProSphere Applications).

We use brocade switches so I initially downloaded and installed the brocade SMI-S agent.  It can be downloaded directly from Brocade here:  http://www.brocade.com/services-support/drivers-downloads/smi-agent/index.page.  I installed 120.9.0 and had some issues with discoveries.  EMC later told me that I needed to use 120.11.0 or later, which didn’t seem to be available on Brocade’s website. After speaking to an EMC rep regarding the Brocade SMI-S agent version issue, it was recommended to me that I use EMC’s software instead. Either should work, however.  You can use the SMI-S agent that’s included with Connectrix Manager Converged Network Edition (CMCNE).  The product itself requires a license, but you do not need to use a license to use only the SMI-S agent.  After installation, launch “C:\CMCNE 11.1.4\bin\smc.bat” and click on the Configure SMI Agent button to add the IP’s of your switches.  The one issue I ran in to with this was user security.  Only one userid and password can be used across all switches, so you may need to create a new id/password across all of your switches.  I had to do that and spent about a half of a day finishing that up. Once you add the switches in, use the IP of the host that the agent is installed on as your target for switch discovery in ProSphere. The default userid and password is administrator / password.

Make sure that port 5988 is open on the server you’re running this agent on. If it is Windows 2K8, disable the windows firewall or add an exception for ports 5988 and 5989 as well as the SMI-S processes ECOM and SLPD.exe.

2 – EMC Array Discovery

I had initially downloaded and installed the Solutions Enabler vApp thinking that it would work for my Clariion & VNX discoveries.  I was told later (after opening an SR) that it does provide SMI-S services.  EMC has their own SMI-S agent that will need to be installed a on a separate server, as it will use the same ports (5988/5989) as the Brocade agent (or CMCNE).  It can be downloaded here:  http://powerlink.emc.com/km/appmanager/km/secureDesktop?_nfpb=true&_pageLabel=servicesDownloadsTemplatePg&internalId=0b014066800251b8&_irrt=true, or by navigating in Powerlink to Home > Support > Software Downloads and Licensing > Downloads S > SMI-S Provider.

Once EMC’s SMI-S agent is installed you’ll need to add your arrays to it.  Open a command prompt and navigate to C:\Program Files\EMC\ECIM\ECOM\bin, and launch testsmiprovider.   When it prompts, choose “localhost”, “port 5988”, and use admin / #1Password as the login credentials.  Once logged in, you can use the “addsys” command to add the IP’s of your arrays.

Just like before, make sure that port 5988 is open on the server you’re running this agent on and disable the windows firewall or add an exception for ports 5988 and 5989.  You’ll again use the IP of the host that the agent is installed on as your target for array discovery.

3 – Host Discoveries

Host discoveries can be done directly without an agent.  You can use the root password for UNIX or ESX and any AD account in windows that has local administrator rights on each server.  Of course you can also set up specialized service accounts with the appropriate rights based on your company’s security regulations.

4 – Enable Path Data Collection

In order to see specific information about LUNs on the array, you will need to enable Path Performance Collection for each host.  If the host isn’t discovered and performance collection isn’t enabled, you won’t see any LUN information when looking at arrays.  To enable it, go to Discovery | Objects list | Hosts from the ProSphere console and click on the “On | Off” slider button to turn it on for each host.

5 – Verify Full path connectivity

Once all of the discoveries are complete, you can verify full path connectivity for an array by going to Discovery | Objects list | Arrays, click on any array, and look at the map.  If there is a cloud representing a switch with a red line to the array, you’re seeing the path.  You can use the same method for a host, if you go to Discovery | Objects list | Hosts and click on a host, you should see the host, the switch fabric, and the array on the map.  If you don’t see that full path you won’t get any data collected.

Comments and Thoughts

You can go directly to EMC’s community forum for general support and information here:   https://community.emc.com/community/support/prosphere.

After using ProSphere 1.5.0  for a little while now, I must say it’s definitely NOT Control Center.  It isn’t quite as advanced or full featured, but I don’t think it’s supposed to be.  It’s supposed to be an easy to deploy tool to get basic, useful information quickly.

I use the pmcli.exe command line tool in ECC extensively for custom data exports and reporting, and ProSphere does not provide a similar tool.  EMC does have an API built in to ProSphere that can be used to pull information over http (for example, to get a host list, type https://<prosphere_app_server>/srm/hosts).  I haven’t done too much research into that feature yet.  Version 1.5 added API support for array capacity, performance, and topology data.  You can read about it more in the white paper titled “ProSphere Integration: An overview of the REST API’s and information model” (h8893), which should be available on powerlink.

My initial upgrade from 1.0.3  to 1.5.0 did not go smoothly, I had about a 50% success rate across all of my installations.  My issues related to upgrades that would work but services wouldn’t start afterwards, and in one case the update web page simply stayed blank and would not let me run the upgrade to begin with.  Beware of upgrades if you want to preserve existing historical data, I ended up deleting the vApp and starting over for most of my deployments.

I’ve only recently been able to get all of my discoveries completed. I feel that the ProSphere documentation is somewhat lacking, I found myself wanting/needing more detail in many areas.  Most of my time has been spent doing trial and error testing with a little help from EMC support after I opened an SR.  I’ll give a more detailed post in the future about actually using ProSphere in the real world once I’ve had more time to use it.

Other items to note:

-ProSphere does not provide the same level of detail for historical data that you get in StorageScope, nor does it give the same amount of detail as Performance Manager.  It’s meant more for a quick “at a glance” view.

-ProSphere does not include the root password in the documentation, customers aren’t necessarily supposed to log in to the console.  I’m sure with a call to support you could obtain it.  Having the ability to at least start and stop services would be useful, as I had an issue with one of my upgrades where services wouldn’t start.  You can view the status of the services on any ProSphere server by navigating to https://prosphere_app_server/cgi-bin/mon.cgi?command=query_opstatus_full.

-ProSphere doesn’t gather the same level of detail about hosts and storage as ECC, but that’s the price you pay for agentless discovery.  Agents are needed for more detailed information.

How to troubleshoot EMC Control Center WLA Archive issues

We’re running EMC Control Center 6.1 UB12, and we use it primarly for it’s robust performance data collection and reporting capabilities.  Performance Manager is a great tool and I use it frequently.

Over the years I’ve had occasional issues with the WLA Archives not collecting performance data and I’ve had to open service requests to get it fixed.  Now that I’ve been doing this for a while, I’ve collected enough info to troubleshoot this issue and correct it without EMC’s assistance in most cases.

Check your ..\WLAArchives\Archives directory and look under the Clariion (or Celerra) folder, then the folder with your array’s serial number, then the interval folder.  This is where the “*.ttp” (text) and “*.btp” (binary) performance data files are stored for Performance Manager.  Sort by date.  If there isn’t a new file that’s been written in the last few hours data is not being collected.

Here are the basic items I generally review when data isn’t being collected for an array:

  1. Log in to every array in Unisphere, go to system properties, and on the ‘General’ tab make sure statistics logging is enabled.  I’ve found that if you don’t have an analyzer license on your array and start the 7 day data collection for a “naz” file, after the 7 days is up the stats logging option will be disabled.  You’ll have to go back in and re-enable it after the 7 day collection is complete.  If stats logging isn’t enabled on the array the WLA data collection will fail.
  2. If you recently changed the password on your clarion domain account, Make sure that naviseccli is updated properly for security access to all of your arrays (use the “addusersecurity” CLI option) and perform a rediscovery of all your arrays as well from within the ECC console.  There is no way from within the ECC console to update the password on an array, you must go through the discovery process again for all of them.
  3.  Verify the agents are running.  In the ECC console, click on the gears icon in the lower right hand corner.  It will create a window that shows the status of all the agents, including the WLA Archiver.  If WLA isn’t started, you can start it by right clicking on any array, choosing Agents, then start.  Check the WLAArchives  directories again (after waiting about an hour) and see if it’s collecting data again.

If those basic steps don’t work, checking the logs may point you in the right direction:

  1.  Review the Clariion agent logs for errors.  You’re not looking for anything specific here, just do a search for “error”, “unreachable” or for the specific IP’s of your arrays and see if there is anything obvious wrong. 
            %ECC_INSTALL_ROOT%\exec\MGL610\MGL.log
            %ECC_INSTALL_ROOT%\exec\MGL610\MGL_Bx.log.gz
            %ECC_INSTALL_ROOT%\exec\MGL610\MGL.ini
            %ECC_INSTALL_ROOT%\exec\MGL610\MGL_Err.log
            %ECC_INSTALL_ROOT%\exec\MGL610\MGL_Bx_Err.log
            %ECC_INSTALL_ROOT%\exec\MGL610\MGL_Discovery.log.gz
 

Here’s an example of an error I found in one case:

            MGL 14:10:18 C P I 2536   (29.94 MB) [MGLAgent::ProcessAlert] => Processing SP
            Unreachable alert. MO = APM00100600999, Context = Clariion, Category = SP
            Element = Unreachable
 

      2.   Review the WLA Agent logs.  Again, just search for errors and see if there is anything obvious that’s wrong. 

            %ECC_INSTALL_ROOT%\exec\ENW610\ENW.log
            %ECC_INSTALL_ROOT%\exec\ENW610\ENW_Bx.log.gz
            %ECC_INSTALL_ROOT%\exec\ENW610\ENW.ini
            %ECC_INSTALL_ROOT%\exec\ENW610\ENW_Err.log
            %ECC_INSTALL_ROOT%\exec\ENW610\ENW_Bx_Err.log
 

If the logs don’t show anything obvious, here are the steps I take to restart everything.  This has worked on several occasions for me.

  1. From the Control Center console, stop all agents on the ECC Agent server.  Do this by right clicking on the agent server (in the left pane), choose agents and stop.  Follow the prompts from there.
  2. Log in to the ECC Agent server console and stop the master agent.  You can do this in Computer Management | Services, stop the service titled “EMC ControlCenter Master Agent”.
  3. From the Control Center console, stop all agents on the Infrastructure server.  Do this by right clicking on the agent server (in the left pane), choose agents and stop.  Follow the prompts from there.
  4. Verify that all services have stopped properly.
  5. From the ECC Agent server console, go to C:\Windows\ECC\ and delete all .comfile and .lck files.
  6. Restart all agents on the Infrastructure server.
  7. Restart the Master Agent on the Agent server.
  8. Restart all other services on the Agent server.
  9. Verify that all services have restarted properly.
  10. Wait at least an hour and check to see if the WLA Archive files are being written.

If none of these steps resolve your problem and you don’t see any errors in the logs, it’s time to open an SR with EMC.  I’ve found the EMC staff  that supports ECC to be very knowledgeable and helpful.