Category Archives: Troubleshooting

Best Practices for FAST Cache

I recently received a comment asking for more information on EMC’s FAST Cache, specifically about why increased CPU Utilization was observed after a FAST Cache expansion. It’s likely due to the rebuilding of the cache after the expansion and possibly having it enabled on LUNs that shouldn’t, like those with high sequential I/O. It’s hard to pinpoint the exact cause of an issue like that without a thorough analysis of the array itself, however.   I thought I’d do a quick write-up of EMC’s best practices for implementing FAST Cache and the caveats to consider when implementing it.

What is FAST Cache?

First, a quick overview of what it is.  EMC’s FAST Cache uses a RAID set of EFD drives that sits between DRAM Cache and the disks themselves. It holds a large percentage of the most often used data in high performance EFD drives.  It hits a price/performance sweet spot between DRAM and traditional spinning disks for cache, and can greatly increase array performance.

The theory behind FAST Cache is simple:  we divide the array’s storage up in 64KB blocks, we count the number of hits on those blocks, and then we create a cache page on the FAST Cache EFDs if there have been three read (or write) hits on that block.  If FAST Cache fills up, the array will start to seek pages in the EFDs that will make a full stripe write to the spinning disks in the array, and then force flush out to traditional spinning disks.

FAST Cache uses a “three strikes” algorithm.  If you are moving large amounts of data, the FAST Cache algorithm does not activate, which is by design, as cache does not help at all in large copy transactions.  Random hits on active blocks, however,  will ultimately cache those blocks into FAST Cache.  This is where the 64KB granularity makes a difference.  Typical workloads I/O are 64KB or less, and there is a significant chance that even if a workload is performing 4KB reads and writes to different blocks, they will still hit the same 64KB FAST Cache block, resulting in the promotion of that data into FAST Cache.  Cool, right?  It works very well in practice.  With all that said, there are still plenty of implementation considerations for an ideal FAST Cache configuration.  Below is an overview of EMC’s best practices.

Best Practices for LUNs and Pools

  • Only use it where you need it. The FAST Cache driver has to track every I/O to calculate whether a block needs promotion to FAST Cache, which then adds to the SP CPU utilization.  As a best practice, you should disabling FAST Cache for LUNs that won’t need it.  It will cut this overhead and thus can improve overall performance levels.  Having a separate storage pool for LUNs that don’t need FASTCache would be ideal.

Disable FASTCache for the following LUN types:

– Secondary Mirror and Clone destination LUNs
– LUNs with small, high sequential I/O, such as Oracle Database Logs & snapsure dvols
– LUNs in the reserved LUN pool.
– Recoverpoint Journal LUNs
– SnapView Clones and MirrorView Secondary Mirrors

  • Analyze where you need it most.  Based on a workload analysis, I’d consider restricting the use of FAST Cache to the LUNs or Pools that need it the most.  For every new block that is added into FAST Cache, old blocks that are the oldest in terms of the most recent access are removed.  If your FAST Cache capacity is limited, even frequently accessed blocks may be removed before they’re accessed again.
  • Upgrade to the latest OS Release. On the VNX platform, upgrading to the latest FLARE or MCx release can greatly improve the performance of FAST Cache.  It’s been a few years now, but as an example r32 recovers performance much faster after a FAST Cache drive failure compared to r31, as well as automatically avoiding the promotion of small sequential block I/O to FAST Cache.  It’s always a good idea to run a current version of the code.

Best Practices For VNX arrays with MCx:

  • Spread it out. Spread the drives as evenly as possible across the available backend busses.  Be careful, though, as you shouldn’t add more than 8 FAST Cache flash drives per bus including any unused flash drives for use as hot-spares.
  • Always use DAE 0. Try and use DAE 0 on each bus for flash drives as it provides for the lowest latency.

Best Practices for VNX and CX4 arrays with FLARE 30-32: 

  • CX4? No more than 4 per bus. If you’re still using an older CX4 series array, don’t use more than 4 FAST Cache drives per bus, and don’t put all of them on bus 0. If they are all on the same bus, they could completely saturate this bus with I/O.
  • Spread it out. Spread the FAST Cache drives over as many buses as possible. This would especially be an issue if the drives were all on bus 0, because it is used to access the vault drives.  Note that the VNX has six times the back-end bandwidth per bus compared to a CX, so it’s less of a concern.
  • Match the drive sizes. All the drives in FAST Cache must be of the same capacity; otherwise the workload on each drive would rise proportionately with its capacity.  In other words, a 200GB drive would have double the workload of a 100Gb drive.
  • VNX? Use enclosure 0. Put the EFD drives in the first DAE on any bus (i.e. Enclosure 0).  The I/O has to pass through the LCC of each DAE between the drive and the SP, and each extra LCC it passes through will add a small amount of latency. The latency would normally be negligible, but is significant for flash drives.  Note that on the CX4, all I/O has to pass through every LCC anyway.
  • Mind the order the disks are added.  The order the drives are added dictates which drives are primary & secondary. The first drive added is the primary for the first mirror, the next drive added is its secondary for the first mirror, the third drive is the primary for the second mirror, etc.
  • Location, Location, Location. It’s a more advanced configuration and requires the use of the CLI, but for highest availability place the primary and secondary for each FAST Cache RAID 1 pair are on different buses.

 

 

 

 

Advertisements

VNX NAS Files incorrectly report as Locked for Editing

When opening a shared Microsoft Office file, you may see the error message “File in Use, file_name is locked for editing by user_name“, when in fact no other user is currently using the file.

We had users that would view files with the preview pane, it would create a lock on the file, and then when the explorer window was closed the lock would remain.  The next time the file was accessed it would state that it was locked even though the user didn’t have it open.  Below are some steps you can take to troubleshoot and resolve the issue.  Note that changing some of these parameters can have a performance impact, so make these changes at your own risk.  Oplocks let clients lock files and locally cache information while preventing another user from changing the file. This increases performance for many file operations.

1. Disable Oplocks on the VNX

Disabling oplocks can affect client performance. It will increase the number of metadata requests that are sent to the server because when you use SMB with oplocks, the client caches the data that is locked to speed up access to frequently accessed files. When oplocks are disabled, the client does not cache data and all reads are made directly to the NAS server.

Syntax for disabling oplocks and verifying the change:

[nasadmin@VNX1 ~]$ server_mount vdm_file_system -o nooplock test_file_system01_fs /test_file_system01
vdm_file_system : done

[nasadmin@VNX1 ~]$ server_mount vdm_file_system | grep test_file_system01_fs
test_file_system01_fs on /test_file_system01 uxfs,perm,rw,noprefetch,nonotify,accesspolicy=NATIVE,nooplock

2. Disable caching on the Windows client

The Windows client setting controls the cache lifetime. As stated earlier, if caching is disabled on the windows client then all reads are to the NAS server directly. In order to to disable caching on the windows client rather than disabling oplocks on the VNX Data Mover, the following three registry changes would need to be made:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\LanmanWorkstation\Parameters\

– Directory cache,  set DirectoryCacheLifetime to Zero.
– File Not Found cache, set FileNotFoundCacheLifetime to Zero.
– File information cache, set FileInfoCacheLifetime to Zero.

3. Apply a Microsoft hotfix

The Microsoft KB article http://support.microsoft.com/kb/942146 describes the problem and the fix in detail.  It directly addresses the issue with files locking from the preview pane.  It applies to all versions of Windows Vista and 7 as well as Windows Server 2008.

Web interface disabled on brocade switch

I ran into an issue where one of our brocade switches was inaccessible via the web browser. The error below was displayed when connecting to the IP:

Interface disabled
This Interface (10.2.2.23) has been blocked by the administrator.

In order to resolve this, you’ll need to allow port 80 traffic on the switch.  It was disabled on mine.

First, Log in to the switch and review the existing IP filters (Look for port 80 set to deny):

switcho1:admin> ipfilter –show

Name: default_ipv4, Type: ipv4, State: active
Rule Source IP Protocol Dest Port Action
1 any tcp 22 permit
2 any tcp 23 deny
3 any tcp 897 permit
4 any tcp 898 permit
5 any tcp 111 permit
6 any tcp 80 deny
7 any tcp 443 permit
8 any udp 161 permit
9 any udp 111 permit
10 any udp 123 permit
11 any tcp 600 – 1023 permit
12 any udp 600 – 1023 permit

Next, clone the default policy, as you cannot make changes to the default policy.  Note that you can name the policy anything you like, I chose to name it “Allow80”.

ipfilter –clone Allow80 -from default_ipv4

Delete the rule that denys port 80 (rule 6 in the above example):

ipfilter –delrule Allow80 -rule 6

Add a rule back in to permit it:

ipfilter –addrule Allow80 -rule 12 -sip any -dp 80 -proto tcp -act permit

Save it:

ipfilter –save Allow80

Activate it (this will change default policy to a “defined” state):

ipfilter –activate Allow80

 

That’s it… you should now be able to access your switch via the web browser.

VPLEX Unisphere Login hung at “Retrieving Meta-Volume Information”

I recently had an issue where I was unable to log in to the Unisphere GUI on the VPLEX, it would hang with the message “Retrieving Meta-Volume Information” after progressing about 30% on the progress bar.

This was caused by a hung Java process.  In order to resolve it, you must restart the management server. This will not cause any disruption to hosts connected to the VPLEX.

To do this, run the following command:

ManagementServer:/> sudo /etc/init.d/VPlexManagementConsole restart

If this hangs or does not complete, you will need to run the top command to identify the PID for the java service:

admin@service:~>top
Mem:   3920396k total,  2168748k used,  1751648k free,    29412k buffers
Swap:  8388604k total,    54972k used,  8333632k free,   527732k cached

  PID USER      PR  NI  VIRT  RES  SHR S   %CPU %MEM    TIME+  COMMAND
26993 service   20   0 2824m 1.4g  23m S     14 36.3  18:58.31 java
 4948 rabbitmq  20   0  122m  42m 1460 S      1  1.1  13118:32 beam.smp
    1 root      20   0 10540   48   36 S      0  0.0  12:34.13 init

Once you’ve identified the PID for the java service, you can kill the process with the kill command, and then run the command to restart the management console again.

ManagementServer:/> sudo kill -9 8798
ManagementServer:/> sudo /etc/init.d/VPlexManagementConsole start

Once the management server restarts, you should be able to log in to the Unisphere for VPLEX GUI again.

Rescan Storage System command on Celerra results in conflict:storageID-devID error

I was attempting to extend our main production NAS file pool on our NS-960 and ran into an issue.  I had recently freed up 8 SATA disks from a block pool and was attempting to re-use them and extend a Celerra file pool.  I created a new RAID Group and LUN that used the maximum capacity of the RAID Group.  I then added the LUN to the celerra storage group, making sure to set the HLU to a number greater than 15.  I then changed the setting on our main production file pool to auto-extend, and clicked on the “Rescan Storage Systems” option.  Unfortunately rescanning produced an error every time it was run.  I have done this exact same procedure in the past and it’s worked fine.  Here is the error:

conflict:storageID-devID: disk=17 old:symm=APM00100600999,dev=001F new:symm=APM00100600999,dev=001F addr=c16t1l11

I checked the disks on the Celerra using the nas_disk –l command, and the new disk shows up as “in use” even though the rescan command didn’t properly complete.

[nasadmin@Celerra tools]$ nas_disk -l
id   inuse  sizeMB    storageID-devID      type   name  servers
17    y     7513381   APM00100600999-001F  CLATA  d17   <BLANK>

Once the dvol is presented to Celerra (assuming the rescan goes fine) it should not be inuse until it is assigned to a storage pool and a file system uses it.  In this case that didn’t happen.  If you run /nas/tools/whereisfs (depending on your DART version, it may be “.whereisfs” with the dot) it shows a listing of every file system and which disk and which LUN they reside on.  I verified that the disk was not in use using that command.

In order to be on the safe side, I opened an SR with EMC rather than simply deleting the disk.  They suggested that the NAS database has a corruption. I’m going to have EMC’s Recovery Team check the usage of the diskvol and then delete it and re-add it.  In order to engage the recovery team you need to sign a “Data Deletion Form” absolving EMC of any liability for data loss, which is standard practice when they delete volumes on a customer array.  If there are any further caveats or important things to note after EMC has taken care of this I’ll update this post.

VPLEX initiator paths dropped

We recently ran into an SP bug check on one of our VNX arrays and after it came back up several of the initiator paths to the VPLEX did not come back up.  We were also seeing IO timeouts.  This is a known bug that happens when there is an SP reboot and is fixed with Patch 1 for GeoSynchrony 5.3.  EMC has released a script that provides a workaround until the patch can be applied: https://download.emc.com/downloads/DL56253_VPLEX_VNX_SCRIPT.zip.zip

The following pre-conditions need to happen during a VNX NDU to see this issue on VPLEX:
1] During a VNX NDU, SPA goes down.
2] At this point IO time-outs start happening on IT nexus’s pertaining to SPA.
3] The IO time-outs cause the VPLEX SCSI Layer to send LU Reset TMF’s. These LU Reset TMF’s get timed out as well.

You can review ETA 000193541 on EMC’s support site for more information.  It’s a critical bug and I’d suggest patching as soon as possible.

 

Dynamic allocation pool limit has been reached

We were having issues with our backup jobs failing on CIFS share backups using Symantec Netbackup.  The jobs died with a “status 24”, which means it was losing communicaiton with the source.  Our backup administrator provided me with the exact times & dates of the failures and I noticed that immediately preceding his failures this error appeared in the server log on the control station:

2012-08-05 07:09:37: KERNEL: 4: 10: Dynamic allocation pool limit has been reached. Limit=0x30000 Current=0x50920 Max=0x0
 

A quick google search came up with this description of the error:  “The maximum amount of memory (number of 8K pages) allowed for dynamic memory allocation has almost been reached. This indicates that a possible memory leak is in progress and the Data Mover may soon panic. If Max=0(zero) then the system forced panic option is disabled. If Max is not zero then the system will force a panic if dynamic memory allocation reaches this level.”

Based on the fact that the error shows up right before a backup failure I saw the correlation.  To fix it, you’lll need to modify the Heap Limit from the default of 0x00030000 to a larger size.  Here is the command to do that:

.server_config server_2 -v “param kernel mallocHeapLimit=0x40000” (to change the value)
.server_config server_2 -v “param kernel” (will list the kernel parameters).
 

Below is a list of all the kernel parameters:

Name                                                 Location        Current       Default
----                                                 ----------      ----------    ----------
kernel.AutoconfigDriverFirst                         0x0003b52d30    0x00000000    0x00000000
kernel.BufferCacheHitRatio                           0x0002093108    0x00000050    0x00000050
kernel.MSIXdebug                                     0x0002094714    0x00000001    0x00000001
kernel.MSIXenable                                    0x000209471c    0x00000001    0x00000001
kernel.MSI_NoStop                                    0x0002094710    0x00000001    0x00000001
kernel.MSIenable                                     0x0002094718    0x00000001    0x00000001
kernel.MsiRouting                                    0x0002094724    0x00000001    0x00000001
kernel.WatchDog                                      0x0003aeb4e0    0x00000001    0x00000001
kernel.autoreboot                                    0x0003a0aefc    0x00000258    0x00000258
kernel.bcmTimeoutFix                                 0x0002179920    0x00000002    0x00000002
kernel.buffersWatermarkPercentage                    0x0003ae964c    0x00000021    0x00000021
kernel.bufreclaim                                    0x0003ae9640    0x00000001    0x00000001
kernel.canRunRT                                      0x000208f7a0    0xffffffff    0xffffffff
kernel.dumpcompress                                  0x000208f794    0x00000001    0x00000001
kernel.enableFCFastInit                              0x00022c29d4    0x00000001    0x00000001
kernel.enableWarmReboot                              0x000217ee68    0x00000001    0x00000001
kernel.forceWholeTLBflush                            0x00039d0900    0x00000000    0x00000000
kernel.heapHighWater                                 0x00020930c8    0x00004000    0x00004000
kernel.heapLowWater                                  0x00020930c4    0x00000080    0x00000080
kernel.heapReserve                                   0x00020930c0    0x00022e98    0x00022e98
kernel.highwatermakpercentdirty                      0x00020930e0    0x00000064    0x00000064
kernel.lockstats                                     0x0002093128    0x00000001    0x00000001
kernel.longLivedChunkSize                            0x0003a23ed0    0x00002710    0x00002710
kernel.lowwatermakpercentdirty                       0x0003ae9654    0x00000000    0x00000000
kernel.mallocHeapLimit                               0x0003b5558c    0x00040000    0x00030000  (This is the parameter I changed)
kernel.mallocHeapMaxSize                             0x0003b55588    0x00000000    0x00000000
kernel.maskFcProc                                    0x0002094728    0x00000004    0x00000004
kernel.maxSizeToTryEMM                               0x0003a23f50    0x00000008    0x00000008
kernel.maxStrToBeProc                                0x0003b00f14    0x00000080    0x00000080
kernel.memSearchUsecs                                0x000208fa28    0x000186a0    0x000186a0
kernel.memThrottleMonitor                            0x0002091340    0x00000001    0x00000001
kernel.outerLoop                                     0x0003a0b508    0x00000001    0x00000001
kernel.panicOnClockStall                             0x0003a0cf30    0x00000000    0x00000000
kernel.pciePollingDefault                            0x00020948a0    0x00000001    0x00000001
kernel.percentOfFreeBufsToFreePerIter                0x00020930cc    0x0000000a    0x0000000a
kernel.periodicSyncInterval                          0x00020930e4    0x00000005    0x00000005
kernel.phTimeQuantum                                 0x0003b86e18    0x000003e8    0x000003e8
kernel.priBufCache.ReclaimPolicy                     0x00020930f4    0x00000001    0x00000001
kernel.priBufCache.UsageThreshold                    0x00020930f0    0x00000032    0x00000032
kernel.protect_zero                                  0x0003aeb4e8    0x00000001    0x00000001
kernel.remapChunkSize                                0x0003a23fd0    0x00000080    0x00000080
kernel.remapConfig                                   0x000208fe40    0x00000002    0x00000002
kernel.retryTLBflushIPI                              0x00020885b0    0x00000001    0x00000001
kernel.roundRobbin                                   0x0003a0b504    0x00000001    0x00000001
kernel.setMSRs                                       0x0002088610    0x00000001    0x00000001
kernel.shutdownWdInterval                            0x0002093238    0x0000000f    0x0000000f
kernel.startAP                                       0x0003aeb4e4    0x00000001    0x00000001
kernel.startIdleTime                                 0x0003aeb570    0x00000001    0x00000001
kernel.stream.assert                                 0x0003b00060    0x00000000    0x00000000
kernel.switchStackOnPanic                            0x000208f8e0    0x00000001    0x00000001
kernel.threads.alertOptions                          0x0003a22bf4    0x00000000    0x00000000
kernel.threads.maxBlockedTime                        0x000208f948    0x00000168    0x00000168
kernel.threads.minimumAlertBlockedTime               0x000208f94c    0x000000b4    0x000000b4
kernel.threads.panicIfHung                           0x0003a22bf0    0x00000000    0x00000000
kernel.timerCallbackHistory                          0x000208f780    0x00000001    0x00000001
kernel.timerCallbackTimeLimitMSec                    0x000208f784    0x00000003    0x00000003
kernel.trackIntrStats                                0x000209021c    0x00000001    0x00000001
kernel.usePhyDevName                                 0x0002094720    0x00000001    0x00000001

Celerra Disk Provisioning Wizard incorrectly believes there are not enough drives available for provisioning

I recently had an issue attempting to extend our production NAS file pool on our NS-960.  We had just added Six new 2TB SATA disks to the array, and when I launched the Disk Provisioning Wizard it gave me this error:

“The number of drives available for provisioning additional storage are insufficient”.

That of course wasn’t true, as a 4+2 RAID6 config is indeed supported on this platform and I had just added six drives. I did come up with a workaround to do it manually thanks to some helpful advice from our local EMC technical rep.  I manually created a RAID6 Raid Group in a 4+2 config, and then created a single LUN using all of the available space in the Raid Group (about 7337GB).   Once the LUN is created, you can add it to the Celerrra storage group, in my case it was named “Celerra_hostname”.

When adding the LUN to the storage group, there is a critical step that you must not skip.  The HLU number must be modified!  After you click on a LUN, click add, look for it in the list and notice that the far right column shows the HLU (Host LUN ID).  The LUN you just added will have a blank entry.  It doesn’t look like it’s an editable field, but it is – simply click on the blank area where the number should be and you’ll get a drop down box.  The number you chose must be greater than 15.  Once you’ve modified the HLU for the new LUN, then click on OK to complete the process.

Next, you’ll want to switch back over to the Celerra Management interface, click on the ‘Storage’ tab, then click on the ‘Rescan Storage Systems’ link.  You will get a warning message that states:

“Rescan detects newly available storage and storage systems. Do not rescan unless all primary Data Movers are operating normally. The operation might take a few minutes to complete.”  

Heed the warning and make sure your data movers are up and functional.  You can monitor the progress in the background tasks area.   On my first attempt the Rescan failed.  I got this error message:

“Storage API code=3593: SYMAPI_C_CLARIION_LOAD_ERROR.  An error occurred while data was being loaded from a Clariion.” | “No additional information is available” | “No recommended action is available”.

At the point I got that error I was at the end of my work day and decided to get back to it the next day.  I had planned on opening an SR.  When I re-ran the same scan the next day it worked fine and my production pool auto-extended.  Problem solved.

Celerra / VNX File replication job creation errors when selecting destination

I recently had an issue with setting up a new Celerra file system replication job. As soon as I selected the destination system I received the three errors below.

1) Query VDMs All. Cannot access any Data Mover on the remote system, hostname

Severity: Error
Brief Description: Cannot access any Data Mover on the remote system, hostname
Full Description: No Data Movers are available on the specified remote system to perform this operation
Recommended Action: 1) Check if the Data Movers on the specified remote system are accessible. 2) Ensure that the difference in system times between the local and remote Celerra systems or VNX systems does not exceed 10 minutes. Use NTP on the Control Stations to synchronize the system clocks. 3) Ensure that the passphrase on the local Control Station matches with the passphrase on the remote Control Station. 4) Ensure that the same local users that manage VNX for file systems exist on both the source and the destination Control Station. 5) Ensure that the global account is mapped to the same local account on both local and remote VNX Control Stations. Primus emc263860 provides more details.
Message ID: 13690601568

2) Query storage pools All. Execution failed: Segmentation fault: Operating system signal. [STRING.make_from_string]

Severity: Error
Brief Description: Execution failed: Segmentation fault: Operating system signal. [STRING.make_from_string]
Full Description: Operation failed for the reason described in the accompanying message.
Recommended Action: Correct the cause of the problem and repeat the operation.
Message ID: 13421840573

3) There are no destination pools available.

Severity: Info
Brief Description: There are no destination pools available.
Full Description: Destination side storage pools are not available.
Recommended Action: Check whether the storage pools have enough space.
Message ID: 26845970450

I was unable to determine the cause of the problem so I opened an SR with EMC.

It turns out there was a user discrepancy between the /etc/passwd file and the /nas/site/user_db file. This was causing the following error when checking the interconnect:

[nasadmin@celerra02 log]$ nas_cel -interconnect -l
Error 2237: Execution failed: Segmentation fault: Operating system signal. [STRING.make_from_string]

The output should look something like this:

[nasadmin@celerra02 log]$ nas_cel -interconnect -l
id    name          source_server destination_system destination_server
20001 loopback      server_2      DRSITE1            server_2
20003 SITE1VNX5500  server_2      DRSITE1            server_2
20004 SITE2NS960    server_2      DRSITE2            server_2
20007 SITE11NS960   server_2      DRSITE1            server_2
20005 SITE3VNX5500  server_2      DRSITE1            server_2
20006 SITE4VNX5500  server_2      DRSITE1            server_2
20008 SITE5NS120    server_2      DRSITE1            server_2
40001 loopback      server_4      DRSITE1            server_4
40003 SITE2NS40     server_4      DRSITE2            server_2

The problem was resolved by removing entries from /nas/site/user_db that were not in the /etc/passwd file. This was caused by a manual modification of the passwd file by a sysadmin, some old entries had been removed and the matching changes were not done in the user_db file.

 

Troubleshooting NAS Discovery issues on EMC Ionix Control Center (ECC)

We recently converted two of our existing VNX arrays to unified systems, and I was attempting to add our newly added NAS to Control Center.  I went through the normal assisted discovery using the ‘NAS Container’ option.  Unfortunately, I got an error in the discovery results window.  Here’s the error I saw:

SESSION_ACTION: Discover  [4]  MO Type = NasContainer
Container_IP=10.10.10.4 | Container_Port=443 | Container_Username=root | Container_Password=****** | Container_Type=Celerra
  command status = finished, errors
  objects found = 6  agents responding = 2
  completed in 231 seconds
  action begins at: Wed Nov 06 09:48:05 CST 2013
  action ends at: Wed Nov 06 09:51:56 CST 2013
 
Reported objects:
[1]  NasContainer=10.10.10.4
 
Reported agent errors:
[1]  ADAResult: (9) Celerra@10.10.10.4 : SSH communication failed – Please verify emcplink settings
    Responding agent: NAS Agent @ eccagtserver.rgare.net
[2]  ADAResult: (9) Celerra@10.10.10.4: nas_version returned invalid response
    Responding agent: NAS Agent @ eccinfserver.rgare.net
Responding agents:
[1]  NAS Agent @ eccagtserver.rgare.net
[2]  NAS Agent @ eccinfserver.rgare.net

 

It looks like there’s an ssh setting that’s incorrect.  Being unfamiliar with the emcplink utility, I did a bit of research on how to configure it properly, and I will go through what needs to be done.

Before diving in to using and configuring emcplink, here are some simple troubleshooting steps you should run through first:

   – Verify that the NAS agent is installed and active.  You can view all of the running agents by clicking on the gear icon on the lower right hand side (in the status bar).  Scroll through the agents and make sure the agent is active.

   – Verify that the Java Process is running on the Control Station.

         * Log in to the Control Station and type the following command:

                      ps -aex |grep java

          * If it’s running, you will see lines similar to the following:

                      21927 ? S 0:15 /usr/java/bin/java -server …..

                      22200 ? S 0:00 /usr/java/bin/java -server …..

 – Make sure that the ssh daemon is running (I’m assuing you’re using ssh for remote connectivity):

           * Log in to the Control Station and type the following command:

                       ps -aex |grep sshd

           * If it’s running, you will see a line similar to the following:

                       882 ?      Ss     0:00 /usr/sbin/sshd

 – Verify that the Celerra (or VNX) data mover is connected to an array

                       nas_storage -list

 – Verify the user name and password youre using during the assisted discovery works. Try logging on with that ID/password directly.

Here are the troubleshooting steps I took, and some more info about emcplink:

What is emcplink? It’s a utility allows you to specify security policies for secure shell (SSH) client authentication which is required for the Storage Agent for NAS to discover NAS containers.

The highest ssh security level (full security) requires that users manually run emcplink in order to provide a username and password for ssh authentication and to manually accept an ssh key returned from emcplink to discover the NAS container.  Afer it’s accepted the key is stored on the NAS Agent host. If the key changes, to rediscover the NAS container you must manually run emcplink again and accept the changed key. If your environment does not require full ssh security, use emcplink to set lower security levels that will automatically accept new or changed keys without requiring the manual entering of ssh usernames, passwords, and keys.

The emcplink command is a command line utility. To run emcplink, first open a command prompt window.  Change to the <install_root>/exec/CNN610 directory on the host where Storage Agent for NAS resides, where <install_root> is the ControlCenter infrastructure install directory.

If your installation uses SSH version 2, update your agent configuration so emcplink uses SSH version 2 when handling SSH keys, the default is version 1. Note that SSH version 2 is not backward compatible with version 1. If you switch to SSH version 2, you must run emcplink again to rediscover all NAS containers that were previously discovered with SSH version 1.

If you want to update your install to version 2, follow these steps (I did this during my troubleshooting):

1. Stop Storage Agent for NAS using the ControlCenter Console.

2. Edit the following file:

      <install_root>/exec/CNN610/cnn.ini

3. In cnn.ini [ssh] change version = 1 to version = 2

4. Save and exit cnn.ini.

5. Restart Storage Agent for NAS.

The next step is to enable the policy that you need for your environment. The default policy is EMC_SSH_KEY_SECURITY_FULL.  You can add more than one policy.  If added policies contradict one another, the most recently added policy takes effect.

Enter the following command to add or remove an SSH security policy:

       emcplink -setpolicy [+|-]policy_name

       Example:  emcplink -setpolicy +EMC_SSH_KEY_SECURITY_FULL

In my case, I first disabled the default policy, then enabled the policy that I wanted. Here are the commands I ran:

       emcplink -setpolicy -EMC_SSH_KEY_SECURITY_FULL

       emcplink -setpolicy +EMC_SSH_KEY_SECURITY_ALLOW_NEW

After running it, I used the ‘getpolicy’ option to verify what the current active policy was:

       emcplink -getpolicy

the output looks like this:

          Policy Is:
              EMC_SSH_KEY_SECURITY_ALLOW_NEW
 

Here are the policy options you can choose from:

EMC_SSH_KEY_SECURITY_FULL (default)

Do not automatically accept any new or changed NAS container keys. They must be accepted manually, using emcplink (refer to emcplink – interactive). Provides the same functionality that plink (no longer valid) provided in previous ControlCenter versions, when manual user name/password/key entry was required.

EMC_SSH_KEY_SECURITY_ALLOW_NEW

Accept new keys, but not changed keys. SSH authentication occurs automatically for initial discovery, and also for subsequent discoveries as long as the NAS Container key does not change. If a key is changed, discovery is attempted via telnet.

EMC_SSH_KEY_SECURITY_ALLOW_CHANGE

Accept changed keys, but not new keys. When a Celerra is initially discovered, SSH authentication occurs manually. If the key is changed for subsequent discoveries of that NAS Container, SSH authentication occurs automatically.

EMC_SSH_KEY_SECURITY_NONE

Accept both new and changed keys. SSH authentication occurs automatically at initial discovery and all subsequent discoveries, regardless of whether NAS container keys are changed.

After I verified the policy I wanted was running, I then manually enterted SSH security information for each array I wanted to add to accept the NAS container keys. Run the following command for each array to add the key to the server cache (the -2 optionally tells emcplink to use SSH version 2):

         emcplink -ssh -interactive -2 -pw password username@Array_IP_address

Note that SSH version 2 is not backward compatible with version 1. If you switch to SSH version 2, you must run emcplink again to rediscover all NAS containers that were previously discovered with SSH version 1.

That’s it! Once I accepted all of the ssh keys and re-ran the discoveries, the new arrays were discovered just fine.

 

Powerpath Install / Upgrade Issues

I recently had several issues when attempting to upgrade a Windows 2008 server from Powerpath v5.3 to v5.5 SP1. I uninstalled 5.3 using the windows utility, rebooted, then reinstalled v5.5 SP1. After the reboot, the server did not come back up. In order to get it to boot, the “last known good configuration” option had to be chosen. I opened an SR with EMC, and they determined that the uninstall process was not completing correctly.

To resolve the problem, you need to run the executable from the command line and add a few parameters. The name of the Powerpath install file will vary depending on the version you are installing, but the command looks like this:

EMCPower.Net32.signed.5.3.b310.exe /v”/L*v C:\logs\PPremove.log NO_REBOOT=1 PPREMOVE=ALL”

In this example, the c:\logs directory must exist before you run it. After running that command to uninstall powerpath and then reinstalling the new version, I no longer had the problem of the server not booting correctly.

After properly installing it, I continued to have a problem with Powerpath administrator not properly recognizing the devices. All of the devices showed up as “DEV ??”. I also saw “harddisk ??” when running powermt display dev=all. To resolve the problem I ran through the following steps:

1. Open Device Manager under Disks, and right-click the device drive that had a yellow ‘!’.
2. Choose “Update Driver Software”.
3. Click on “Browse my comptuer for driver software”.
4. Click on “Let me pick from a list of device drivers on my computer”.
5. In the next screen make sure the “Show compatible hardware” box is checked.
6. Under the Model list you should see the ‘PowerPath Devices’ driver. Highlight it and click next. This will install the PowerPath Driver. When it is done it will require a reboot. Once the server has come back online run another: ‘powermt display dev=all’ to see that the harddisk?? will have changed to harddisk## as expected.

Close requests fail on data mover when using Riverbed Steelhead appliance

We recently had a problem with one of our corporate applications having file close requests fail, resulting in 200,000+ open files on our production data mover.  This was causing numerous issues within the application.  We determined that the problem was a result of our Riverbed Steelhead appliance requiring a certain level of DART code in order to properly close the files.  The Steelhead applicance would fail when attempting to optimize SMBV2 connections.

Because a DART code upgrade is required to resolve the problem, the only temporary fix is to reboot the data mover.  I wrote a quick script on the Celerra to grab the number of open files, write it to a text file, and publish to our internal web server.  The command to check how many open files are on the data mover is below.

This command provides all the detailed information:

/nas/bin/server_stats server_2 -monitor cifs-std -interval 1 -count 1

server_2    CIFS     CIFS     CIFS    CIFS Avg   CIFS     CIFS    CIFS Avg     CIFS       CIFS
Timestamp   Total    Read     Read      Read     Write    Write    Write       Share      Open
            Ops/s    Ops/s    KiB/s   Size KiB   Ops/s    KiB/s   Size KiB  Connections   Files
11:15:36     3379      905     9584        11        9      272        30         1856     4915   

server_2    CIFS     CIFS     CIFS    CIFS Avg   CIFS     CIFS    CIFS Avg     CIFS       CIFS
Summary     Total    Read     Read      Read     Write    Write    Write       Share      Open
            Ops/s    Ops/s    KiB/s   Size KiB   Ops/s    KiB/s   Size KiB  Connections   Files
Minimum      3379      905     9584        11        9      272        30         1856     4915   
Average      3379      905     9584        11        9      272        30         1856     4915   
Maximum      3379      905     9584        11        9      272        30         1856     4915   

Adding a grep for Maximum and using awk to grab only the last column, this command will output only the number of open files, rather than the large output above:

/nas/bin/server_stats server_2 -monitor cifs-std -interval 1 -count 1 | grep Maximum | awk ‘{print $10}’

The output of that command would simply be ‘4915’ based on the sample full output I used above.

The solution number from Riverbed’s knowledgebase is S16257.  Your DART code needs to be at least 6.0.60.2 or 7.0.52.  You will also see in your steelhead logs a message similar to the one below indicating that the close request has failed for a particular file:

Sep 1 18:19:52 steelhead port[9444]: [smb2cfe.WARN] 58556726 {10.0.0.72:59207 10.0.0.72:445} Close failed for fid: 888819cd-z496-7fa2-2735-0000ffffffff with ntstatus: NT_STATUS_INVALID_PARAMETER

Frequent 0x622 and 0x606 errors in the SP Event Logs

During some routine checking of the SP Event logs on our NS-40 I was noticing a large number of alerts. Every few seconds I was seeing these three alerts pop in:

0x60a Internal Information Only. A logical unit has been enabled
0x622 Background Verify Aborted
0x606 Unit Shutdown for trespass
 

After a bit of investigation, I narrowed down the cause to several large LUNs that had just been added to a new ESX host.  It turns out that the LUNs were still running the background zeroing process, and that’s what was causing all the alerts in the SP Log. When you create a new LUN and the disks have been previously used for other LUNs, the new LUN needs to be “zeroed” (filled with all zeros to clear data). This takes place in the background and it is part of the LUN initialization.  Once this background zeroing process completed on my new LUNs the alert messages stopped.  I was unaware of that process, so I did a bit of research on it.

LUNs are immediately available for use after a bind (using “Fastbind”), however all the operations associated with a bind can take a long time to finish.  The duration of a LUN bind is dependent on these things:

  • LUN’s bind time background verify priority (rate)
  • Size of the LUN being bound
  • Type of drives in the LUN’s RAID Group
  • Potential disabling of initial verify on bind
  • State of the Storage System (Idle or Load)
  • Position of the LUN on the hard disks of the RAID Group

From that list, priority, LUN size, drive type and verification selection all have the greatest effect on duration.  You can calculate the approximate duration of the bind process with this formula:

Time = Bound LUN Capacity * Bind Rate

Here are the Average Bind Rates for FC and SATA disks:

Disk Type ASAP Bind Rate High Bind Rate Medium (default) Bind Rate Low Bind Rate
FC 83 MB/s 7.54 MB/s 5.02 MB/s 4.02 MB/s
SATA 61.7 MB/s 7.47 MB/s 5.09 MB/s 3.78 MB/s

If we were to calculate how many hours it would take to bind a 2000GB LUN on a five disk RAID5 group composed of SATA drives set to a medium (default) bind rate, here’s what the formula would look like:

Time = 2000 GB * ((1/5.09 MB/s) * 1024 MB->GB * (1/3600 sec->hrs) = 111.76 Hours.

There is a detailed white paper that covers this topic from EMC called “The Effect of Priorities on LUN Management Operations” that you can view here:  http://www.emc.com/collateral/hardware/white-papers/h4153-influence-priorities-emc-clariion-lun-wp.pdf.  That’s where I gathered the information above.

Can’t join CIFS Server to domain – sasl protocol violation

I was running a live disaster recovery test of our Celerra CIFS Server environment last week and I was not able to get the CIFS servers to join the replica of the domain controller on the DR network.  I would get the error ‘Sasl protocol violation’ on every attempt to join the domain.

We have two interfaces configured on the data mover, one connects to production and one connects to the DR private network.  The default route on the Celerra points to the DR network and we have static routes configured for each of our remote sites in production to allow replication traffic to pass through.  Everything on the network side checked out, I could ping DC’s and DNS servers, and NTP was configured to a DR network time server and was working.

I was able to ping the DNS Server and the domain controller:

[nasadmin@datamover1 ~]$ server_ping server_2 10.12.0.5
server_2 : 10.12.0.5 is alive, time= 0 ms
 
[nasadmin@datamover1 ~]$ server_ping server_2 10.12.18.5
server_2 : 10.12.18.5 is alive, time= 3 ms
 

When I tried to join the CIFS Server to the domain I would get this error:

[nasadmin@datamover1 ~]$ server_cifs prod_vdm_01 -Join compname=fileserver01,domain=company.net,admin=myadminaccount -option reuse prod_vdm_01 : Enter Password:********* Error 13157007706: prod_vdm_01 : DomainJoin::connect:: Unable to connect to the LDAP service on Domain Controller ‘domaincontroller.company.net’ (@10.12.0.5) for compname ‘fileserver01’. Result code is ‘Sasl protocol violation’. Error message is Sasl protocol violation.
 

I also saw this error messages during earlier tests:

Error 13157007708: prod_vdm_01 : DomainJoin::setAccountPassword:: Unable to set account password on Domain Controller ‘domaincontroller.company.net’ for compname ‘fileserver01’. Kerberos gssError is ‘Miscellaneous failure. Cannot contact any KDC for requested realm. ‘. Error message is d0000,-1765328228.
 

I noticed these error messages in the server log:

2012-06-21 07:03:00: KERBEROS: 3: acquire_accept_cred: Failed to get keytab entry for principal host/fileserver01.company.net@COMPANY.NET – error No principal in keytab matches desired name (39756033) 2012-06-21 07:03:00: SMB: 3: SSXAK=LOGON_FAILURE Client=x.x.x.x origin=510 stat=0x0,39756033 2012-06-21 07:03:42: KERBEROS: 5: Warning: send_as_request: Realm COMPANY.NET – KDC X.X.X.X returned error: Clients credentials have been revoked (18)
 

The final resolution to the problem was to reboot the data mover. EMC determined that the issue was because the kerberos keytab entry for the CIFS server was no longer valid. It could be caused by corruption or because the the machine account password expired. A reboot of the data mover causes the kerberos keytab and SPN credentials to be resubmitted, thus resolving the problem.

How to troubleshoot EMC Control Center WLA Archive issues

We’re running EMC Control Center 6.1 UB12, and we use it primarly for it’s robust performance data collection and reporting capabilities.  Performance Manager is a great tool and I use it frequently.

Over the years I’ve had occasional issues with the WLA Archives not collecting performance data and I’ve had to open service requests to get it fixed.  Now that I’ve been doing this for a while, I’ve collected enough info to troubleshoot this issue and correct it without EMC’s assistance in most cases.

Check your ..\WLAArchives\Archives directory and look under the Clariion (or Celerra) folder, then the folder with your array’s serial number, then the interval folder.  This is where the “*.ttp” (text) and “*.btp” (binary) performance data files are stored for Performance Manager.  Sort by date.  If there isn’t a new file that’s been written in the last few hours data is not being collected.

Here are the basic items I generally review when data isn’t being collected for an array:

  1. Log in to every array in Unisphere, go to system properties, and on the ‘General’ tab make sure statistics logging is enabled.  I’ve found that if you don’t have an analyzer license on your array and start the 7 day data collection for a “naz” file, after the 7 days is up the stats logging option will be disabled.  You’ll have to go back in and re-enable it after the 7 day collection is complete.  If stats logging isn’t enabled on the array the WLA data collection will fail.
  2. If you recently changed the password on your clarion domain account, Make sure that naviseccli is updated properly for security access to all of your arrays (use the “addusersecurity” CLI option) and perform a rediscovery of all your arrays as well from within the ECC console.  There is no way from within the ECC console to update the password on an array, you must go through the discovery process again for all of them.
  3.  Verify the agents are running.  In the ECC console, click on the gears icon in the lower right hand corner.  It will create a window that shows the status of all the agents, including the WLA Archiver.  If WLA isn’t started, you can start it by right clicking on any array, choosing Agents, then start.  Check the WLAArchives  directories again (after waiting about an hour) and see if it’s collecting data again.

If those basic steps don’t work, checking the logs may point you in the right direction:

  1.  Review the Clariion agent logs for errors.  You’re not looking for anything specific here, just do a search for “error”, “unreachable” or for the specific IP’s of your arrays and see if there is anything obvious wrong. 
            %ECC_INSTALL_ROOT%\exec\MGL610\MGL.log
            %ECC_INSTALL_ROOT%\exec\MGL610\MGL_Bx.log.gz
            %ECC_INSTALL_ROOT%\exec\MGL610\MGL.ini
            %ECC_INSTALL_ROOT%\exec\MGL610\MGL_Err.log
            %ECC_INSTALL_ROOT%\exec\MGL610\MGL_Bx_Err.log
            %ECC_INSTALL_ROOT%\exec\MGL610\MGL_Discovery.log.gz
 

Here’s an example of an error I found in one case:

            MGL 14:10:18 C P I 2536   (29.94 MB) [MGLAgent::ProcessAlert] => Processing SP
            Unreachable alert. MO = APM00100600999, Context = Clariion, Category = SP
            Element = Unreachable
 

      2.   Review the WLA Agent logs.  Again, just search for errors and see if there is anything obvious that’s wrong. 

            %ECC_INSTALL_ROOT%\exec\ENW610\ENW.log
            %ECC_INSTALL_ROOT%\exec\ENW610\ENW_Bx.log.gz
            %ECC_INSTALL_ROOT%\exec\ENW610\ENW.ini
            %ECC_INSTALL_ROOT%\exec\ENW610\ENW_Err.log
            %ECC_INSTALL_ROOT%\exec\ENW610\ENW_Bx_Err.log
 

If the logs don’t show anything obvious, here are the steps I take to restart everything.  This has worked on several occasions for me.

  1. From the Control Center console, stop all agents on the ECC Agent server.  Do this by right clicking on the agent server (in the left pane), choose agents and stop.  Follow the prompts from there.
  2. Log in to the ECC Agent server console and stop the master agent.  You can do this in Computer Management | Services, stop the service titled “EMC ControlCenter Master Agent”.
  3. From the Control Center console, stop all agents on the Infrastructure server.  Do this by right clicking on the agent server (in the left pane), choose agents and stop.  Follow the prompts from there.
  4. Verify that all services have stopped properly.
  5. From the ECC Agent server console, go to C:\Windows\ECC\ and delete all .comfile and .lck files.
  6. Restart all agents on the Infrastructure server.
  7. Restart the Master Agent on the Agent server.
  8. Restart all other services on the Agent server.
  9. Verify that all services have restarted properly.
  10. Wait at least an hour and check to see if the WLA Archive files are being written.

If none of these steps resolve your problem and you don’t see any errors in the logs, it’s time to open an SR with EMC.  I’ve found the EMC staff  that supports ECC to be very knowledgeable and helpful.

 

 

Errors when creating new replication jobs

I was attempting to create a new replication job on one of our VNX5500’s and was receiving several errors when selecting our DR NS-960 as the ‘destination celerra network server’.

It was displaying the following errors at the top of the window:

– “Query VDMs All.  Cannot access any Data Mover on the remote system, <celerra_name>”. The error details directed me to check that all the Data Moverss are accessible, that the time difference between the source and destination doesn’t exceed 10 min, and that the passphrase matches.  I confirmed that all of those were fine.

– “Query Storage Pools All.  Remote command failed:\nremote celerra – <celerra_name>\nremote exit status =0\nremote error = 0\nremote message = HTTP Error 500: Internal Server Error”.  The error details on this message say to search powerlink, not a very useful description.

– “There are no destination pools available”.  The details on this error say to check available space on the destination storage pool.  There is 3.5TB available in the pool I want to use on the destination side, so that wasn’t the issue either.

All existing replication jobs were still running fine so I knew there was not a network connectivity problem.  I reviewed the following items as well:

– I was able to validate all of the interconnects successfully, that wasn’t the issue.

– I ran nas_cel -update on the interconnects on both sides and received no errors, but it made no difference.

– I checked the server logs and didn’t see any errors relating to replication.

Not knowing where to look next, I opened an SR with EMC.  As it turns out, it was a security issue.

About a month ago an EMC CE accidently deleted our global security accounts during a service call.  I had recreated all of the deleted accounts and didn’t think there would be any further issues.  Logging in with the re-created nasadmin account after the accidental deletion was the root cause of the problem.  Here’s why:

The clariion global user account is tied to a local user account on the control station in /etc/passwd. When nasadmin was recreated on the domain, it attempted to create the nasadmin account on the control station as well.  Because the account already existed as a local account on the control station, it created a local account named ‘nasadmin1‘ instead, which is what caused the problem.  The two nasadmin accounts were no longer synchronized between the Celerra and the Clariion domain, so when logging in with the global nasadmin account you were no longer tied to the local nasadmin account on the control station.  Deleting all nasadmin accounts from the global domain and from the local /etc/passwd on the Celerra, and then recreating nasadmin in the domain solves the problem.  Because the issue was related only to the nasadmin account in this case, I could have also solved the problem by simply creating a new global account (with administrator priviliges) and using that to create the replication job.  I tested that as well and it worked fine.

Long Running FAST VP relocation job

I’ve noticed that our auto-tier data relocation job that runs every evening consistently shows 2+days for the estimated time of completion. We have it set to run only 8 hours per day, so with our current configuration it’s likely the job will never reach a completed state. Based on that observation I started investigating what options I had to try and reduce the amount of time that the relocation jobs runs.

Running this command will tell you the current amount of time estimated to complete the relocation job data migrations and how much data is queued up to move:

Naviseccli –h <clarion_ip> autotiering –info –opStatus

Auto-Tiering State: Enabled
Relocation Rate: Medium
Schedule Name: Default Schedule
Schedule State: Enabled
Default Schedule: Yes
Schedule Days: Sun Mon Tue Wed Thu Fri Sat
Schedule Start Time: 22:00
Schedule Stop Time: 6:00
Schedule Duration: 8 hours
Storage Pools: Clariion1_SPB, Clariion2_SPA
Storage Pool Name: Clariion2_SPA
Storage Pool ID: 0
Relocation Start Time: 12/05/11 22:00
Relocation Stop Time: 12/06/11 6:00
Relocation Status: Inactive
Relocation Type: Scheduled
Relocation Rate: Medium
Data to Move Up (GBs): 2854.11
Data to Move Down (GBs): 1909.06
Data Movement Completed (GBs): 2316.00
Estimated Time to Complete: 2 days, 9 hours, 12 minutes
Schedule Duration Remaining: None
 

I’ll review some possibilities based on research I’ve done in the past few days.  I’m still in the evaluation process and have not made any changes yet, I’ll update this blog post once I’ve implemented a change myself.  If you are having issues with your data relocation job not finishing I would recommend opening an SR with EMC support for a detailed analysis before implementing any of these options.

1. Reduce the number of LUNs that use auto-tiering by disabling it on a LUN-by-LUN basis.

I would recommend monitoring which LUNs have the highest rate of change when the relocation job runs and then evaluate if any can be removed from auto-tiering altogether.  The goal of this would be to reduce the amount of data that needs to be moved.  The one caveat with this process is that when a LUN has auto-tiering disabled, the tier distribution of the LUN will remain exactly the same from the moment it is disabled.  If you disable it on a LUN that is using a large amount of EFD it will not change unless you force it to a different tier or re-enable auto-tiering later.

This would be an effective way to reduce the amount of data being relocated, but the process of determining which LUNs should have auto-tiering disabled is subjective and would require careful analysis.

2. Reset all the counters on the relocation job.

Any incorrectly labeled “hot” data will be removed from the counters and all LUNs would be re-evaluated for data movement.  One of the potential problems with auto-tiering is with servers that have IO intensive batch jobs that run infrequently.  That data would be incorrectly labeled as “hot” and scheduled to move up even though the server is not normally busy.  This information is detailed in emc268245.

To reset the counters, use the command to stop and start autotiering:

Naviseccli –h <clarion_ip> autotiering –relocation -<stop | start>

If you need to temporarily stop replication and do not want to reset the counters, use the pause/resume command instead:

Naviseccli –h <clarion_ip> autotiering –relocation -<pause | resume>

I wanted to point out that changing a specific LUN from “auto-tier” to “No Movement” also does not reset the counters, the LUN will maintain it’s tiering schedule. It is the same as pausing auto-tiering just for that LUN.

3. Increase free space available on the storage pools.

If your storage pools are nearly 100% utilized there may not be enough space to effectively migrate the data between the tiers.  Add additional disks to the pool, or migrate LUNs to other RAID groups or storage pools.

4. Increase the relocation rate.

This of course could have dramatic effects on IO performance if it’s increased and it should only be changed during periods of measured low IO activity.

Run this command to change the data relocation rate:

Naviseccli –h <clarion_ip> autotiering –setRate –rate <high | medium | low>

5. Use a batch or shell script to pause and restart the job with the goal of running it more frequently during periods of low IO activity.

There is no way to set the relocation schedule to run at different times on different days of the week, a script is necessary to accomplish that.  I currently run the job only in the middle of the night during off peak (non-business) hours, but I would be able to run it all weekend as well.  I have done that manually in the past.

You would need to use an external windows or unix server to schedule the scripts.  The relocation schedule should be set to run 24×7, then add the pause/resume command to have the job pause during the times you don’t want it to run.  To have it run on weekends and overnight, set up two separate scripts (one for pause and one for resume), then schedule each with task scheduler or cron to run throughout the week.

The cron schedule below would allow it to run from 10PM to 6AM on weeknights and from 10PM to 6AM on Monday over the weekend.

pause.sh:       Naviseccli –h <clarion_ip> autotiering –relocation –pause

resume.sh:   Naviseccli –h <clarion_ip> autotiering –relocation -resume

0 6 * * *  /scripts/pause.sh        @6AM on Monday – pause
0 10 * * * /scripts/resume.sh    @10PM on Monday – resume
0 6 * * *  /scripts/pause.sh        @6AM on Tuesay – pause
0 10 * * * /scripts/resume.sh    @10PM on Tuesday – resume
0 6 * * *  /scripts/pause.sh        @6AM on Wednesday – pause
0 10 * * * /scripts/resume.sh    @10PM on Wednesday – resume
0 6 * * *  /scripts/pause.sh        @6AM on Thursday – pause
0 10 * * * /scripts/resume.sh    @10PM on Thursday – resume
0 6 * * *  /scripts/pause.sh        @6AM on Friday – pause
0 10 * * * /scripts/resume.sh    @10PM on Friday – resume
<Do not pause again until Monday morning>
 

Problem with soft media errors on SSD drives and FastCache

4/25/2012 Update:  EMC has released a fix for this issue.  Call your account service representative and say you need to upgrade your NS-960 dart to 6.0.55.300 and flare to 4.30.000.5.524 plus drive firmware upgrade on all SSD drives to TC3Q.

Do you have FastCache enabled on your array?  Keep a close eye on your SP event logs for soft media errors on your SSD drives.  I just noticed over 2000 soft media errors on one of my FastCache enabled arrays, and found a technical advisory from EMC (emc282741) that desribes this as a potentially critical problem.  I just opened a case with EMC for my array to be reviewed for a possible disk replacement.  In the event a second disk drive in the same FastCache RAID group encounters soft media errors before the system automatically retires the first drive a dual faulted RAID Group could occur.  This can result in storage pools going offline and becoming completely inaccessible to the attached hosts.  That’s basically a total SAN outage, not good.

Look for errors like the following in your SP event logs:

“Date Stamp”  “Time Stamp” Bus1 Enc1 Dsk0  820 Soft Media Error [Bad block]

EMC states in emc282741 that enhancements are targeted for Q1 2012 to address SSD media errors and dual hardware faults, but in the meantime, make sure you review the SP logs if you have CLARiiON or VNX arrays that are configured with SSD disk drives or are using FAST Cache.  If any instance of the “Soft Media Error” listed above is associated with any one of the solid state disk drives in your arrays, the array should be upgraded to at least FLARE Release 04.30.000.5.522 (for CX4 Series arrays) or Release 05.31.000.5.509 (for VNX Series arrays) and then start a Proactive Copy (PACO) to a hot spare and replace the drive as soon as possible.

In order to quickly review this on each of my arrays, I wrote the following script to update my intranet site with a report every morning:

naviseccli -h clariion1a getlog >clariion1a.txt
naviseccli -h clariion1b getlog >clariion1b.txt  
cat clariion1a.txt | grep -i ‘soft media’ >clariion1_softmedia_errors.csv
cat clariion1b.txt | grep -i ‘soft media’ >>clariion1_softmedia_errors.csv
./csv2htm.pl -e -T -i /home/scripts/clariion1_softmedia_errors.csv -o /<intranet_web_server>/clariion1_softmedia_errors.html
 

The script dumps the entire SP log from each SP into a text file, greps for only soft media errors in each file, then converts the output to HTML and writes it to my intranet web server.

 

Powerpath commands in AIX causing unexpected errors / initialization errors.

We recently had a problem with one of our AIX VIO servers not being able to run any powerpath commands.  Any attempt to run a command would result in an unexpected error or initialization error.   After speaking to EMC about it, the root cause is usually either running out of space on the root filesystem or having the data and stack ulimit paramenters set too low after adding a large number of new LUNs.   We are running AIX 6.1 on an IBM pSeries 550 with PowerPath 5.3 HF1.

Here are the errors that were popping up:

root@vioserver1:/script # powermt config
Unexpected error occured.

root@vioserver1:/script # powermt display dev=all
Initialization error.

root@vioserver1:/script # naviseccli -h <san_dns_name> lun -list -all
evp_enc.c(282): OpenSSL internal error, assertion failed: inl > 0
ksh: 503926 IOT/Abort trap(coredump)

Having too many LUNs caused the issue,  we had recently added an additional 35 for a total of  70.  Increasing the data and stack parameters to ‘unlimited’ resolved the problem.

root@vioserver1:/script # ulimit -a
time(seconds)        unlimited
file(blocks)         unlimited
data(kbytes)         unlimited
stack(kbytes)        unlimited
memory(kbytes)       unlimited
coredump(blocks)     2097151
nofiles(descriptors) 2000
threads(per process) unlimited
processes(per user)  unlimited

VMWare/ESX can’t write to a Celerra Read/Write NFS mounted datastore

I had just created serveral new Celerra NFS mounted datastores for our ESX administrator.  When he tried to create new VM hosts using the new datastores, he would get this error:   Call “FileManager.MakeDirectory” for object “FileManager” on vCenter Server “servername.company.com” failed.

Searching for that error message on powerlink, the VMWare forums, and general google searches didn’t bring back any easy answers or solutions.  It looked like ESX was unable to write to the NFS mount for some reason, even though it was mounted as Read/Write.  I also had the ESX hosts added to the R/W access permissions for the NFS export.

After much digging and experimentation, I did resolve the problem.  Here’s what you have to check:

1. The VMKernel IP must be in the root hosts permissions on the NFS export.   I put in the IP of the ESX server along with the VMKernel IP.

2. The NFS export must be mounted with the no_root_squash option.  By default, the root user with UID 0 is not given access to an NFS volume, mounting the export with no_root_squash allows the root user access.  The VMkernal must be able to access the NFS volume with UID 0.

I first set up the exports and permissions settings in the GUI, then went to the CLI to add the mount options.
command:  server_mount server_2 -option rw,uncached,sync,no_root_squash <sharename> /<sharename>

3. From within the ESX Console/Virtual Center, the Firewall settings should be updated to add the NFS Client.   Go to ‘Configuration’ | ‘Security Profile’ | ‘Properties’ | Click the NFS Client checkbox.

4. One other important item to note when adding NFS mounted datastores is the default limit of 8 in ESX.  You can increase the limit by going to ‘Configuration’ | ‘Advanced Settings’ | ‘NFS’ in the left column | Scroll to ‘NFS.MaxVolumes’ on the left, increase the number up to 64.  If you try to add a new datastore above the NFS.MaxVolumes limit, you will get the same error in red at the top of this post.

That’s it.  Adding the VMKernel IP to the root permissions, mounting with no_root_squash, and adding the NFS Client to ESX resolved the problem.

Unable to provision Celerra storage?

This one really made no sense to me at first.  I was attempting to create a new file storage pool on our NS960 Celerra.  Upon launching the disk provisioning wizard, it would pause for a minute and then give the following error:

“ERROR: Unable to continue provisioning.  click for details”

It was strange because I have plenty of disks that could be used for provisioning.  Why wasn’t it working?

Here is the detailed error message:

Message Details:

Message: Unable to continue provisioning

Full Description:  Not able to fetch disk information. n Command Failed, error code: 1, output: errormessage:string=”Timeout (60 seconds) waiting for state SS_DISKS_LOADED”

Recommended Action: No recommended action is available. Go to http://powerlink.EMC.com for more information.

Event Code: 15301214354

As a workaround and to test the issue, I used 8 spare SATA drives and created a new raid group, with one large LUN using all of the space.  I added it to the Celerra storage group and rescanned the SAN for storage.

The following error popped up:

Brief Description:  Invalid credentials for the storage array APM01034413494. 
Full Description:  The FLARE version on this storage array requires secure communication. Saved credentials are found, but authentication failed. The Control Station does not have valid credentials. 
Recommended Action:  Set the credentials by running the “nas_storage -modify <backend_name> -security” command. 
Message ID:  13422231564 

Well, how about that.  An error message that actually gives you the command to resolve the problem. 🙂  As it turns out, one of our other SAN administrators had changed the password for the system account.  Running nas_storage -modify id=<xx> -security resolved the problem.

(Note: You can get the ID number by running nas_storage -list)

DM Interconnect failure with Celerra Replicator

We just installed a new VNX 5500 a few weeks ago in the UK, and i intially set up a VDM replication job between it and it’s replication partner, an NS-960 in Canada.  The setup went fine with no errors, and replication of the VDM has completed successfully every day up until yesterday when I noticed that the status on the main replications screen says “network communication has been lost”.   I am able to use the server_ping command to ping the data mover/replication interface from UK to Canada, so network connectivity appears to be ok.

I was attempting to set up new replication jobs for the filesystems on this VDM, and the background tasks to create the replication jobs are stuck at “Establishing communication with secondary side for Create task” with a status of “Incomplete”.

I went to the DM interconnect next to validate that it was working, and the validation test failed with the following message: “Validate Data Mover Interconnect server_2:<SAN_name>. The following interfaces cannot connect: source interface=10.x.x.x destination interface=10.x.x.x, Message_ID=13160415446: Authentication failed for DIC communication.”

So, why is the DM Interconnect is failing?   It was working fine for several weeks!

My next trip was to the server log (>server_log server_2) where I spotted another issue.  Hundreds of entries that looked just like these:

2011-07-07 16:32:07: CMD: 6: CmdReplicatev2ReversePri::startSecondary dicSt 16 cmdSt 214
2011-07-07 16:32:10: CIC: 3: <DicXmlSyncMsgService> Sending Cmd to 10.x.x.x failed (16=Bad authentication)
2011-07-07 16:32:10: CMD: 3: DicXmlSyncRequest::sendMessage sendCmd failed:16
2011-07-07 16:32:12: CIC: 3: <DicXmlSyncMsgService> Sending Cmd to 10.x.x.x failed (16=Bad authentication)
2011-07-07 16:32:12: CMD: 3: DicXmlSyncRequest::sendMessage sendCmd failed:16

Bad Authentication? Hmmm.  There is something amiss with the trusted relationship between the VNX and the NS960.  I did a quick read of EMC’s VNX replication manual (yep, rtfm!) and found the command to update the interconnect, nas_cel.

First, run nas_cel -list to view all of your interconnects, noting the ID number of the one you’re having difficulty with.

[nasadmin@<name> ~]$ nas_cel -list
id    name          owner mount_dev  channel    net_path                                      CMU
0     <name_1>  0                               10.x.x.x                                   APM007039002350000
2     <name_2>      0                           10.x.x.x                                   APM001052420000000
4     <name_3>      0                           10.x.x.x                                   APM009015016510000
5     <name_4>       0                           10.x.x.x                                  APM000827205690000

In this case, I was having trouble with <name_3>, which is ID 4.

Run this command next:  nas_cel -update id=4.   After that command completed, my interconnect immediately started working and I was able to create new replication jobs.

A guide for troubleshooting CIFS issues on the Celerra

In my experience, every CIFS issue you may have will fall into 8 basic areas, the first five being the most common.   Check all of these things and I can almost guarantee you will resolve your problem. 🙂

1. CIFS Service.  Check and make sure the CIFS Service is running:  server_cifs server_2 -protocol CIFS -option start

2. DNS.  Check and make sure that your DNS server entries on the Celerra are correct, that you’re configured to point to at least two, and that they are up and running with the DNS Service running.

3. NTP.  Make sure your NTP server entry is correct on the Celerra, and that the IP is reachable on the network and is actively providing NTP services.

4. User Mapping.

5. Default Gateway.  Double check your default gateway in the Celerra’s routing table.  Get the network team involved if you’re not sure.

6. Interfaces.  Make sure the interfaces are physically connected and properly configured.

7. Speed/Duplex.  Make sure the speed and duplex settings on the Celerra match those of the switch port that the interfaces are plugged in to.

8. VLAN.  Double check your VLAN settings on the interfaces, make sure it matches what is configured on the connected switch.