Category Archives: troubleshooting-emc

VNX NAS Files incorrectly report as Locked for Editing

When opening a shared Microsoft Office file, you may see the error message “File in Use, file_name is locked for editing by user_name“, when in fact no other user is currently using the file.

We had users that would view files with the preview pane, it would create a lock on the file, and then when the explorer window was closed the lock would remain.  The next time the file was accessed it would state that it was locked even though the user didn’t have it open.  Below are some steps you can take to troubleshoot and resolve the issue.  Note that changing some of these parameters can have a performance impact, so make these changes at your own risk.  Oplocks let clients lock files and locally cache information while preventing another user from changing the file. This increases performance for many file operations.

1. Disable Oplocks on the VNX

Disabling oplocks can affect client performance. It will increase the number of metadata requests that are sent to the server because when you use SMB with oplocks, the client caches the data that is locked to speed up access to frequently accessed files. When oplocks are disabled, the client does not cache data and all reads are made directly to the NAS server.

Syntax for disabling oplocks and verifying the change:

[nasadmin@VNX1 ~]$ server_mount vdm_file_system -o nooplock test_file_system01_fs /test_file_system01
vdm_file_system : done

[nasadmin@VNX1 ~]$ server_mount vdm_file_system | grep test_file_system01_fs
test_file_system01_fs on /test_file_system01 uxfs,perm,rw,noprefetch,nonotify,accesspolicy=NATIVE,nooplock

2. Disable caching on the Windows client

The Windows client setting controls the cache lifetime. As stated earlier, if caching is disabled on the windows client then all reads are to the NAS server directly. In order to to disable caching on the windows client rather than disabling oplocks on the VNX Data Mover, the following three registry changes would need to be made:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\LanmanWorkstation\Parameters\

– Directory cache,  set DirectoryCacheLifetime to Zero.
– File Not Found cache, set FileNotFoundCacheLifetime to Zero.
– File information cache, set FileInfoCacheLifetime to Zero.

3. Apply a Microsoft hotfix

The Microsoft KB article http://support.microsoft.com/kb/942146 describes the problem and the fix in detail.  It directly addresses the issue with files locking from the preview pane.  It applies to all versions of Windows Vista and 7 as well as Windows Server 2008.

Advertisements

VPLEX Unisphere Login hung at “Retrieving Meta-Volume Information”

I recently had an issue where I was unable to log in to the Unisphere GUI on the VPLEX, it would hang with the message “Retrieving Meta-Volume Information” after progressing about 30% on the progress bar.

This was caused by a hung Java process.  In order to resolve it, you must restart the management server. This will not cause any disruption to hosts connected to the VPLEX.

To do this, run the following command:

ManagementServer:/> sudo /etc/init.d/VPlexManagementConsole restart

If this hangs or does not complete, you will need to run the top command to identify the PID for the java service:

admin@service:~>top
Mem:   3920396k total,  2168748k used,  1751648k free,    29412k buffers
Swap:  8388604k total,    54972k used,  8333632k free,   527732k cached

  PID USER      PR  NI  VIRT  RES  SHR S   %CPU %MEM    TIME+  COMMAND
26993 service   20   0 2824m 1.4g  23m S     14 36.3  18:58.31 java
 4948 rabbitmq  20   0  122m  42m 1460 S      1  1.1  13118:32 beam.smp
    1 root      20   0 10540   48   36 S      0  0.0  12:34.13 init

Once you’ve identified the PID for the java service, you can kill the process with the kill command, and then run the command to restart the management console again.

ManagementServer:/> sudo kill -9 8798
ManagementServer:/> sudo /etc/init.d/VPlexManagementConsole start

Once the management server restarts, you should be able to log in to the Unisphere for VPLEX GUI again.

Rescan Storage System command on Celerra results in conflict:storageID-devID error

I was attempting to extend our main production NAS file pool on our NS-960 and ran into an issue.  I had recently freed up 8 SATA disks from a block pool and was attempting to re-use them and extend a Celerra file pool.  I created a new RAID Group and LUN that used the maximum capacity of the RAID Group.  I then added the LUN to the celerra storage group, making sure to set the HLU to a number greater than 15.  I then changed the setting on our main production file pool to auto-extend, and clicked on the “Rescan Storage Systems” option.  Unfortunately rescanning produced an error every time it was run.  I have done this exact same procedure in the past and it’s worked fine.  Here is the error:

conflict:storageID-devID: disk=17 old:symm=APM00100600999,dev=001F new:symm=APM00100600999,dev=001F addr=c16t1l11

I checked the disks on the Celerra using the nas_disk –l command, and the new disk shows up as “in use” even though the rescan command didn’t properly complete.

[nasadmin@Celerra tools]$ nas_disk -l
id   inuse  sizeMB    storageID-devID      type   name  servers
17    y     7513381   APM00100600999-001F  CLATA  d17   <BLANK>

Once the dvol is presented to Celerra (assuming the rescan goes fine) it should not be inuse until it is assigned to a storage pool and a file system uses it.  In this case that didn’t happen.  If you run /nas/tools/whereisfs (depending on your DART version, it may be “.whereisfs” with the dot) it shows a listing of every file system and which disk and which LUN they reside on.  I verified that the disk was not in use using that command.

In order to be on the safe side, I opened an SR with EMC rather than simply deleting the disk.  They suggested that the NAS database has a corruption. I’m going to have EMC’s Recovery Team check the usage of the diskvol and then delete it and re-add it.  In order to engage the recovery team you need to sign a “Data Deletion Form” absolving EMC of any liability for data loss, which is standard practice when they delete volumes on a customer array.  If there are any further caveats or important things to note after EMC has taken care of this I’ll update this post.

VPLEX initiator paths dropped

We recently ran into an SP bug check on one of our VNX arrays and after it came back up several of the initiator paths to the VPLEX did not come back up.  We were also seeing IO timeouts.  This is a known bug that happens when there is an SP reboot and is fixed with Patch 1 for GeoSynchrony 5.3.  EMC has released a script that provides a workaround until the patch can be applied: https://download.emc.com/downloads/DL56253_VPLEX_VNX_SCRIPT.zip.zip

The following pre-conditions need to happen during a VNX NDU to see this issue on VPLEX:
1] During a VNX NDU, SPA goes down.
2] At this point IO time-outs start happening on IT nexus’s pertaining to SPA.
3] The IO time-outs cause the VPLEX SCSI Layer to send LU Reset TMF’s. These LU Reset TMF’s get timed out as well.

You can review ETA 000193541 on EMC’s support site for more information.  It’s a critical bug and I’d suggest patching as soon as possible.

 

VPLEX Health Check

This is a brief post to share the CLI commands and sample output for a quick VPLEX health check.  Our VPLEX had a dial home event and below are the commands that EMC ran to verify that it was healthy.  Here is the dial home event that was generated:

SymptomCode: 0x8a266032
SymptomCode: 0x8a34601a
Category: Status
Severity: Error
Status: Failed
Component: CLUSTER
ComponentID: director-1-1-A
SubComponent: stdf
CallHome: Yes
FirstTime: 2014-11-14T11:20:11.008Z
LastTime: 2014-11-14T11:20:11.008Z
CDATA: Compare and Write cache transaction submit failed, status 1 [Versions:MS{D30.60.0.3.0, D30.0.0.112, D30.60.0.3}, Director{6.1.202.1.0}, ClusterWitnessServer{unknown}] RCA: The attempt to start a cache transaction for a Scsi Compare and Write command failed. Remedy: Contact EMC Customer Support.

Description: The processing of a Scsi Com pare and Write command could not complete.
ClusterID: cluster-1

Based on that error the commands below were run to make sure the cluster was healthy.

This is the general health check command:

VPlexcli:/> health-check
 Product Version: 5.3.0.00.00.10
 Product Type: Local
 Hardware Type: VS2
 Cluster Size: 2 engines
 Cluster TLA:
 cluster-1: FNM00141800023
 
 Clusters:
 ---------
 Cluster Cluster Oper Health Connected Expelled Local-com
 Name ID State State
 --------- ------- ----- ------ --------- -------- ---------
 cluster-1 1 ok ok True False ok
 
 Meta Data:
 ----------
 Cluster Volume Volume Oper Health Active
 Name Name Type State State
 --------- ------------------------------- ----------- ----- ------ ------
 cluster-1 c1_meta_backup_2014Nov21_100107 meta-volume ok ok False
 cluster-1 c1_meta_backup_2014Nov20_100107 meta-volume ok ok False
 cluster-1 c1_meta meta-volume ok ok True
 
 Director Firmware Uptime:
 -------------------------
 Director Firmware Uptime
 -------------- ------------------------------------------
 director-1-1-A 147 days, 16 hours, 15 minutes, 29 seconds
 director-1-1-B 147 days, 15 hours, 58 minutes, 3 seconds
 director-1-2-A 147 days, 15 hours, 52 minutes, 15 seconds
 director-1-2-B 147 days, 15 hours, 53 minutes, 37 seconds
 
 Director OS Uptime:
 -------------------
 Director OS Uptime
 -------------- ---------------------------
 director-1-1-A 12:49pm up 147 days 16:09
 director-1-1-B 12:49pm up 147 days 16:09
 director-1-2-A 12:49pm up 147 days 16:09
 director-1-2-B 12:49pm up 147 days 16:09
 
 Inter-director Management Connectivity:
 ---------------------------------------
 Director Checking Connectivity
 Enabled
 -------------- -------- ------------
 director-1-1-A Yes Healthy
 director-1-1-B Yes Healthy
 director-1-2-A Yes Healthy
 director-1-2-B Yes Healthy
 
 Front End:
 ----------
 Cluster Total Unhealthy Total Total Total Total
 Name Storage Storage Registered Ports Exported ITLs
 Views Views Initiators Volumes
 --------- ------- --------- ---------- ----- -------- -----
 cluster-1 56 0 299 16 353 9802
 
 Storage:
 --------
 Cluster Total Unhealthy Total Unhealthy Total Unhealthy No Not visible With
 Name Storage Storage Virtual Virtual Dist Dist Dual from Unsupported
 Volumes Volumes Volumes Volumes Devs Devs Paths All Dirs # of Paths
 --------- ------- --------- ------- --------- ----- --------- ----- ----------- -----------
 cluster-1 203 0 199 0 0 0 0 0 0
 
 Consistency Groups:
 -------------------
 Cluster Total Unhealthy Total Unhealthy
 Name Synchronous Synchronous Asynchronous Asynchronous
 Groups Groups Groups Groups
 --------- ----------- ----------- ------------ ------------
 cluster-1 0 0 0 0
 
 Cluster Witness:
 ----------------
 Cluster Witness is not configured

This command checks the status of the cluster:

VPlexcli:/> cluster status
Cluster cluster-1
operational-status: ok
transitioning-indications:
transitioning-progress:
health-state: ok
health-indications:
local-com: ok

This command checks the state of the storage volumes:

VPlexcli:/> storage-volume summary
Storage-Volume Summary (no tier)
---------------------- --------------------

Health out-of-date 0
storage-volumes 203
unhealthy 0

Vendor DGC 203

Use meta-data 4
used 199

Capacity total 310T

Celerra Disk Provisioning Wizard incorrectly believes there are not enough drives available for provisioning

I recently had an issue attempting to extend our production NAS file pool on our NS-960.  We had just added Six new 2TB SATA disks to the array, and when I launched the Disk Provisioning Wizard it gave me this error:

“The number of drives available for provisioning additional storage are insufficient”.

That of course wasn’t true, as a 4+2 RAID6 config is indeed supported on this platform and I had just added six drives. I did come up with a workaround to do it manually thanks to some helpful advice from our local EMC technical rep.  I manually created a RAID6 Raid Group in a 4+2 config, and then created a single LUN using all of the available space in the Raid Group (about 7337GB).   Once the LUN is created, you can add it to the Celerrra storage group, in my case it was named “Celerra_hostname”.

When adding the LUN to the storage group, there is a critical step that you must not skip.  The HLU number must be modified!  After you click on a LUN, click add, look for it in the list and notice that the far right column shows the HLU (Host LUN ID).  The LUN you just added will have a blank entry.  It doesn’t look like it’s an editable field, but it is – simply click on the blank area where the number should be and you’ll get a drop down box.  The number you chose must be greater than 15.  Once you’ve modified the HLU for the new LUN, then click on OK to complete the process.

Next, you’ll want to switch back over to the Celerra Management interface, click on the ‘Storage’ tab, then click on the ‘Rescan Storage Systems’ link.  You will get a warning message that states:

“Rescan detects newly available storage and storage systems. Do not rescan unless all primary Data Movers are operating normally. The operation might take a few minutes to complete.”  

Heed the warning and make sure your data movers are up and functional.  You can monitor the progress in the background tasks area.   On my first attempt the Rescan failed.  I got this error message:

“Storage API code=3593: SYMAPI_C_CLARIION_LOAD_ERROR.  An error occurred while data was being loaded from a Clariion.” | “No additional information is available” | “No recommended action is available”.

At the point I got that error I was at the end of my work day and decided to get back to it the next day.  I had planned on opening an SR.  When I re-ran the same scan the next day it worked fine and my production pool auto-extended.  Problem solved.

Celerra / VNX File replication job creation errors when selecting destination

I recently had an issue with setting up a new Celerra file system replication job. As soon as I selected the destination system I received the three errors below.

1) Query VDMs All. Cannot access any Data Mover on the remote system, hostname

Severity: Error
Brief Description: Cannot access any Data Mover on the remote system, hostname
Full Description: No Data Movers are available on the specified remote system to perform this operation
Recommended Action: 1) Check if the Data Movers on the specified remote system are accessible. 2) Ensure that the difference in system times between the local and remote Celerra systems or VNX systems does not exceed 10 minutes. Use NTP on the Control Stations to synchronize the system clocks. 3) Ensure that the passphrase on the local Control Station matches with the passphrase on the remote Control Station. 4) Ensure that the same local users that manage VNX for file systems exist on both the source and the destination Control Station. 5) Ensure that the global account is mapped to the same local account on both local and remote VNX Control Stations. Primus emc263860 provides more details.
Message ID: 13690601568

2) Query storage pools All. Execution failed: Segmentation fault: Operating system signal. [STRING.make_from_string]

Severity: Error
Brief Description: Execution failed: Segmentation fault: Operating system signal. [STRING.make_from_string]
Full Description: Operation failed for the reason described in the accompanying message.
Recommended Action: Correct the cause of the problem and repeat the operation.
Message ID: 13421840573

3) There are no destination pools available.

Severity: Info
Brief Description: There are no destination pools available.
Full Description: Destination side storage pools are not available.
Recommended Action: Check whether the storage pools have enough space.
Message ID: 26845970450

I was unable to determine the cause of the problem so I opened an SR with EMC.

It turns out there was a user discrepancy between the /etc/passwd file and the /nas/site/user_db file. This was causing the following error when checking the interconnect:

[nasadmin@celerra02 log]$ nas_cel -interconnect -l
Error 2237: Execution failed: Segmentation fault: Operating system signal. [STRING.make_from_string]

The output should look something like this:

[nasadmin@celerra02 log]$ nas_cel -interconnect -l
id    name          source_server destination_system destination_server
20001 loopback      server_2      DRSITE1            server_2
20003 SITE1VNX5500  server_2      DRSITE1            server_2
20004 SITE2NS960    server_2      DRSITE2            server_2
20007 SITE11NS960   server_2      DRSITE1            server_2
20005 SITE3VNX5500  server_2      DRSITE1            server_2
20006 SITE4VNX5500  server_2      DRSITE1            server_2
20008 SITE5NS120    server_2      DRSITE1            server_2
40001 loopback      server_4      DRSITE1            server_4
40003 SITE2NS40     server_4      DRSITE2            server_2

The problem was resolved by removing entries from /nas/site/user_db that were not in the /etc/passwd file. This was caused by a manual modification of the passwd file by a sysadmin, some old entries had been removed and the matching changes were not done in the user_db file.

 

Powerpath Install / Upgrade Issues

I recently had several issues when attempting to upgrade a Windows 2008 server from Powerpath v5.3 to v5.5 SP1. I uninstalled 5.3 using the windows utility, rebooted, then reinstalled v5.5 SP1. After the reboot, the server did not come back up. In order to get it to boot, the “last known good configuration” option had to be chosen. I opened an SR with EMC, and they determined that the uninstall process was not completing correctly.

To resolve the problem, you need to run the executable from the command line and add a few parameters. The name of the Powerpath install file will vary depending on the version you are installing, but the command looks like this:

EMCPower.Net32.signed.5.3.b310.exe /v”/L*v C:\logs\PPremove.log NO_REBOOT=1 PPREMOVE=ALL”

In this example, the c:\logs directory must exist before you run it. After running that command to uninstall powerpath and then reinstalling the new version, I no longer had the problem of the server not booting correctly.

After properly installing it, I continued to have a problem with Powerpath administrator not properly recognizing the devices. All of the devices showed up as “DEV ??”. I also saw “harddisk ??” when running powermt display dev=all. To resolve the problem I ran through the following steps:

1. Open Device Manager under Disks, and right-click the device drive that had a yellow ‘!’.
2. Choose “Update Driver Software”.
3. Click on “Browse my comptuer for driver software”.
4. Click on “Let me pick from a list of device drivers on my computer”.
5. In the next screen make sure the “Show compatible hardware” box is checked.
6. Under the Model list you should see the ‘PowerPath Devices’ driver. Highlight it and click next. This will install the PowerPath Driver. When it is done it will require a reboot. Once the server has come back online run another: ‘powermt display dev=all’ to see that the harddisk?? will have changed to harddisk## as expected.

Close requests fail on data mover when using Riverbed Steelhead appliance

We recently had a problem with one of our corporate applications having file close requests fail, resulting in 200,000+ open files on our production data mover.  This was causing numerous issues within the application.  We determined that the problem was a result of our Riverbed Steelhead appliance requiring a certain level of DART code in order to properly close the files.  The Steelhead applicance would fail when attempting to optimize SMBV2 connections.

Because a DART code upgrade is required to resolve the problem, the only temporary fix is to reboot the data mover.  I wrote a quick script on the Celerra to grab the number of open files, write it to a text file, and publish to our internal web server.  The command to check how many open files are on the data mover is below.

This command provides all the detailed information:

/nas/bin/server_stats server_2 -monitor cifs-std -interval 1 -count 1

server_2    CIFS     CIFS     CIFS    CIFS Avg   CIFS     CIFS    CIFS Avg     CIFS       CIFS
Timestamp   Total    Read     Read      Read     Write    Write    Write       Share      Open
            Ops/s    Ops/s    KiB/s   Size KiB   Ops/s    KiB/s   Size KiB  Connections   Files
11:15:36     3379      905     9584        11        9      272        30         1856     4915   

server_2    CIFS     CIFS     CIFS    CIFS Avg   CIFS     CIFS    CIFS Avg     CIFS       CIFS
Summary     Total    Read     Read      Read     Write    Write    Write       Share      Open
            Ops/s    Ops/s    KiB/s   Size KiB   Ops/s    KiB/s   Size KiB  Connections   Files
Minimum      3379      905     9584        11        9      272        30         1856     4915   
Average      3379      905     9584        11        9      272        30         1856     4915   
Maximum      3379      905     9584        11        9      272        30         1856     4915   

Adding a grep for Maximum and using awk to grab only the last column, this command will output only the number of open files, rather than the large output above:

/nas/bin/server_stats server_2 -monitor cifs-std -interval 1 -count 1 | grep Maximum | awk ‘{print $10}’

The output of that command would simply be ‘4915’ based on the sample full output I used above.

The solution number from Riverbed’s knowledgebase is S16257.  Your DART code needs to be at least 6.0.60.2 or 7.0.52.  You will also see in your steelhead logs a message similar to the one below indicating that the close request has failed for a particular file:

Sep 1 18:19:52 steelhead port[9444]: [smb2cfe.WARN] 58556726 {10.0.0.72:59207 10.0.0.72:445} Close failed for fid: 888819cd-z496-7fa2-2735-0000ffffffff with ntstatus: NT_STATUS_INVALID_PARAMETER

Frequent 0x622 and 0x606 errors in the SP Event Logs

During some routine checking of the SP Event logs on our NS-40 I was noticing a large number of alerts. Every few seconds I was seeing these three alerts pop in:

0x60a Internal Information Only. A logical unit has been enabled
0x622 Background Verify Aborted
0x606 Unit Shutdown for trespass
 

After a bit of investigation, I narrowed down the cause to several large LUNs that had just been added to a new ESX host.  It turns out that the LUNs were still running the background zeroing process, and that’s what was causing all the alerts in the SP Log. When you create a new LUN and the disks have been previously used for other LUNs, the new LUN needs to be “zeroed” (filled with all zeros to clear data). This takes place in the background and it is part of the LUN initialization.  Once this background zeroing process completed on my new LUNs the alert messages stopped.  I was unaware of that process, so I did a bit of research on it.

LUNs are immediately available for use after a bind (using “Fastbind”), however all the operations associated with a bind can take a long time to finish.  The duration of a LUN bind is dependent on these things:

  • LUN’s bind time background verify priority (rate)
  • Size of the LUN being bound
  • Type of drives in the LUN’s RAID Group
  • Potential disabling of initial verify on bind
  • State of the Storage System (Idle or Load)
  • Position of the LUN on the hard disks of the RAID Group

From that list, priority, LUN size, drive type and verification selection all have the greatest effect on duration.  You can calculate the approximate duration of the bind process with this formula:

Time = Bound LUN Capacity * Bind Rate

Here are the Average Bind Rates for FC and SATA disks:

Disk Type ASAP Bind Rate High Bind Rate Medium (default) Bind Rate Low Bind Rate
FC 83 MB/s 7.54 MB/s 5.02 MB/s 4.02 MB/s
SATA 61.7 MB/s 7.47 MB/s 5.09 MB/s 3.78 MB/s

If we were to calculate how many hours it would take to bind a 2000GB LUN on a five disk RAID5 group composed of SATA drives set to a medium (default) bind rate, here’s what the formula would look like:

Time = 2000 GB * ((1/5.09 MB/s) * 1024 MB->GB * (1/3600 sec->hrs) = 111.76 Hours.

There is a detailed white paper that covers this topic from EMC called “The Effect of Priorities on LUN Management Operations” that you can view here:  http://www.emc.com/collateral/hardware/white-papers/h4153-influence-priorities-emc-clariion-lun-wp.pdf.  That’s where I gathered the information above.

Can’t join CIFS Server to domain – sasl protocol violation

I was running a live disaster recovery test of our Celerra CIFS Server environment last week and I was not able to get the CIFS servers to join the replica of the domain controller on the DR network.  I would get the error ‘Sasl protocol violation’ on every attempt to join the domain.

We have two interfaces configured on the data mover, one connects to production and one connects to the DR private network.  The default route on the Celerra points to the DR network and we have static routes configured for each of our remote sites in production to allow replication traffic to pass through.  Everything on the network side checked out, I could ping DC’s and DNS servers, and NTP was configured to a DR network time server and was working.

I was able to ping the DNS Server and the domain controller:

[nasadmin@datamover1 ~]$ server_ping server_2 10.12.0.5
server_2 : 10.12.0.5 is alive, time= 0 ms
 
[nasadmin@datamover1 ~]$ server_ping server_2 10.12.18.5
server_2 : 10.12.18.5 is alive, time= 3 ms
 

When I tried to join the CIFS Server to the domain I would get this error:

[nasadmin@datamover1 ~]$ server_cifs prod_vdm_01 -Join compname=fileserver01,domain=company.net,admin=myadminaccount -option reuse prod_vdm_01 : Enter Password:********* Error 13157007706: prod_vdm_01 : DomainJoin::connect:: Unable to connect to the LDAP service on Domain Controller ‘domaincontroller.company.net’ (@10.12.0.5) for compname ‘fileserver01’. Result code is ‘Sasl protocol violation’. Error message is Sasl protocol violation.
 

I also saw this error messages during earlier tests:

Error 13157007708: prod_vdm_01 : DomainJoin::setAccountPassword:: Unable to set account password on Domain Controller ‘domaincontroller.company.net’ for compname ‘fileserver01’. Kerberos gssError is ‘Miscellaneous failure. Cannot contact any KDC for requested realm. ‘. Error message is d0000,-1765328228.
 

I noticed these error messages in the server log:

2012-06-21 07:03:00: KERBEROS: 3: acquire_accept_cred: Failed to get keytab entry for principal host/fileserver01.company.net@COMPANY.NET – error No principal in keytab matches desired name (39756033) 2012-06-21 07:03:00: SMB: 3: SSXAK=LOGON_FAILURE Client=x.x.x.x origin=510 stat=0x0,39756033 2012-06-21 07:03:42: KERBEROS: 5: Warning: send_as_request: Realm COMPANY.NET – KDC X.X.X.X returned error: Clients credentials have been revoked (18)
 

The final resolution to the problem was to reboot the data mover. EMC determined that the issue was because the kerberos keytab entry for the CIFS server was no longer valid. It could be caused by corruption or because the the machine account password expired. A reboot of the data mover causes the kerberos keytab and SPN credentials to be resubmitted, thus resolving the problem.

Errors when creating new replication jobs

I was attempting to create a new replication job on one of our VNX5500’s and was receiving several errors when selecting our DR NS-960 as the ‘destination celerra network server’.

It was displaying the following errors at the top of the window:

– “Query VDMs All.  Cannot access any Data Mover on the remote system, <celerra_name>”. The error details directed me to check that all the Data Moverss are accessible, that the time difference between the source and destination doesn’t exceed 10 min, and that the passphrase matches.  I confirmed that all of those were fine.

– “Query Storage Pools All.  Remote command failed:\nremote celerra – <celerra_name>\nremote exit status =0\nremote error = 0\nremote message = HTTP Error 500: Internal Server Error”.  The error details on this message say to search powerlink, not a very useful description.

– “There are no destination pools available”.  The details on this error say to check available space on the destination storage pool.  There is 3.5TB available in the pool I want to use on the destination side, so that wasn’t the issue either.

All existing replication jobs were still running fine so I knew there was not a network connectivity problem.  I reviewed the following items as well:

– I was able to validate all of the interconnects successfully, that wasn’t the issue.

– I ran nas_cel -update on the interconnects on both sides and received no errors, but it made no difference.

– I checked the server logs and didn’t see any errors relating to replication.

Not knowing where to look next, I opened an SR with EMC.  As it turns out, it was a security issue.

About a month ago an EMC CE accidently deleted our global security accounts during a service call.  I had recreated all of the deleted accounts and didn’t think there would be any further issues.  Logging in with the re-created nasadmin account after the accidental deletion was the root cause of the problem.  Here’s why:

The clariion global user account is tied to a local user account on the control station in /etc/passwd. When nasadmin was recreated on the domain, it attempted to create the nasadmin account on the control station as well.  Because the account already existed as a local account on the control station, it created a local account named ‘nasadmin1‘ instead, which is what caused the problem.  The two nasadmin accounts were no longer synchronized between the Celerra and the Clariion domain, so when logging in with the global nasadmin account you were no longer tied to the local nasadmin account on the control station.  Deleting all nasadmin accounts from the global domain and from the local /etc/passwd on the Celerra, and then recreating nasadmin in the domain solves the problem.  Because the issue was related only to the nasadmin account in this case, I could have also solved the problem by simply creating a new global account (with administrator priviliges) and using that to create the replication job.  I tested that as well and it worked fine.

Long Running FAST VP relocation job

I’ve noticed that our auto-tier data relocation job that runs every evening consistently shows 2+days for the estimated time of completion. We have it set to run only 8 hours per day, so with our current configuration it’s likely the job will never reach a completed state. Based on that observation I started investigating what options I had to try and reduce the amount of time that the relocation jobs runs.

Running this command will tell you the current amount of time estimated to complete the relocation job data migrations and how much data is queued up to move:

Naviseccli –h <clarion_ip> autotiering –info –opStatus

Auto-Tiering State: Enabled
Relocation Rate: Medium
Schedule Name: Default Schedule
Schedule State: Enabled
Default Schedule: Yes
Schedule Days: Sun Mon Tue Wed Thu Fri Sat
Schedule Start Time: 22:00
Schedule Stop Time: 6:00
Schedule Duration: 8 hours
Storage Pools: Clariion1_SPB, Clariion2_SPA
Storage Pool Name: Clariion2_SPA
Storage Pool ID: 0
Relocation Start Time: 12/05/11 22:00
Relocation Stop Time: 12/06/11 6:00
Relocation Status: Inactive
Relocation Type: Scheduled
Relocation Rate: Medium
Data to Move Up (GBs): 2854.11
Data to Move Down (GBs): 1909.06
Data Movement Completed (GBs): 2316.00
Estimated Time to Complete: 2 days, 9 hours, 12 minutes
Schedule Duration Remaining: None
 

I’ll review some possibilities based on research I’ve done in the past few days.  I’m still in the evaluation process and have not made any changes yet, I’ll update this blog post once I’ve implemented a change myself.  If you are having issues with your data relocation job not finishing I would recommend opening an SR with EMC support for a detailed analysis before implementing any of these options.

1. Reduce the number of LUNs that use auto-tiering by disabling it on a LUN-by-LUN basis.

I would recommend monitoring which LUNs have the highest rate of change when the relocation job runs and then evaluate if any can be removed from auto-tiering altogether.  The goal of this would be to reduce the amount of data that needs to be moved.  The one caveat with this process is that when a LUN has auto-tiering disabled, the tier distribution of the LUN will remain exactly the same from the moment it is disabled.  If you disable it on a LUN that is using a large amount of EFD it will not change unless you force it to a different tier or re-enable auto-tiering later.

This would be an effective way to reduce the amount of data being relocated, but the process of determining which LUNs should have auto-tiering disabled is subjective and would require careful analysis.

2. Reset all the counters on the relocation job.

Any incorrectly labeled “hot” data will be removed from the counters and all LUNs would be re-evaluated for data movement.  One of the potential problems with auto-tiering is with servers that have IO intensive batch jobs that run infrequently.  That data would be incorrectly labeled as “hot” and scheduled to move up even though the server is not normally busy.  This information is detailed in emc268245.

To reset the counters, use the command to stop and start autotiering:

Naviseccli –h <clarion_ip> autotiering –relocation -<stop | start>

If you need to temporarily stop replication and do not want to reset the counters, use the pause/resume command instead:

Naviseccli –h <clarion_ip> autotiering –relocation -<pause | resume>

I wanted to point out that changing a specific LUN from “auto-tier” to “No Movement” also does not reset the counters, the LUN will maintain it’s tiering schedule. It is the same as pausing auto-tiering just for that LUN.

3. Increase free space available on the storage pools.

If your storage pools are nearly 100% utilized there may not be enough space to effectively migrate the data between the tiers.  Add additional disks to the pool, or migrate LUNs to other RAID groups or storage pools.

4. Increase the relocation rate.

This of course could have dramatic effects on IO performance if it’s increased and it should only be changed during periods of measured low IO activity.

Run this command to change the data relocation rate:

Naviseccli –h <clarion_ip> autotiering –setRate –rate <high | medium | low>

5. Use a batch or shell script to pause and restart the job with the goal of running it more frequently during periods of low IO activity.

There is no way to set the relocation schedule to run at different times on different days of the week, a script is necessary to accomplish that.  I currently run the job only in the middle of the night during off peak (non-business) hours, but I would be able to run it all weekend as well.  I have done that manually in the past.

You would need to use an external windows or unix server to schedule the scripts.  The relocation schedule should be set to run 24×7, then add the pause/resume command to have the job pause during the times you don’t want it to run.  To have it run on weekends and overnight, set up two separate scripts (one for pause and one for resume), then schedule each with task scheduler or cron to run throughout the week.

The cron schedule below would allow it to run from 10PM to 6AM on weeknights and from 10PM to 6AM on Monday over the weekend.

pause.sh:       Naviseccli –h <clarion_ip> autotiering –relocation –pause

resume.sh:   Naviseccli –h <clarion_ip> autotiering –relocation -resume

0 6 * * *  /scripts/pause.sh        @6AM on Monday – pause
0 10 * * * /scripts/resume.sh    @10PM on Monday – resume
0 6 * * *  /scripts/pause.sh        @6AM on Tuesay – pause
0 10 * * * /scripts/resume.sh    @10PM on Tuesday – resume
0 6 * * *  /scripts/pause.sh        @6AM on Wednesday – pause
0 10 * * * /scripts/resume.sh    @10PM on Wednesday – resume
0 6 * * *  /scripts/pause.sh        @6AM on Thursday – pause
0 10 * * * /scripts/resume.sh    @10PM on Thursday – resume
0 6 * * *  /scripts/pause.sh        @6AM on Friday – pause
0 10 * * * /scripts/resume.sh    @10PM on Friday – resume
<Do not pause again until Monday morning>
 

Problem with soft media errors on SSD drives and FastCache

4/25/2012 Update:  EMC has released a fix for this issue.  Call your account service representative and say you need to upgrade your NS-960 dart to 6.0.55.300 and flare to 4.30.000.5.524 plus drive firmware upgrade on all SSD drives to TC3Q.

Do you have FastCache enabled on your array?  Keep a close eye on your SP event logs for soft media errors on your SSD drives.  I just noticed over 2000 soft media errors on one of my FastCache enabled arrays, and found a technical advisory from EMC (emc282741) that desribes this as a potentially critical problem.  I just opened a case with EMC for my array to be reviewed for a possible disk replacement.  In the event a second disk drive in the same FastCache RAID group encounters soft media errors before the system automatically retires the first drive a dual faulted RAID Group could occur.  This can result in storage pools going offline and becoming completely inaccessible to the attached hosts.  That’s basically a total SAN outage, not good.

Look for errors like the following in your SP event logs:

“Date Stamp”  “Time Stamp” Bus1 Enc1 Dsk0  820 Soft Media Error [Bad block]

EMC states in emc282741 that enhancements are targeted for Q1 2012 to address SSD media errors and dual hardware faults, but in the meantime, make sure you review the SP logs if you have CLARiiON or VNX arrays that are configured with SSD disk drives or are using FAST Cache.  If any instance of the “Soft Media Error” listed above is associated with any one of the solid state disk drives in your arrays, the array should be upgraded to at least FLARE Release 04.30.000.5.522 (for CX4 Series arrays) or Release 05.31.000.5.509 (for VNX Series arrays) and then start a Proactive Copy (PACO) to a hot spare and replace the drive as soon as possible.

In order to quickly review this on each of my arrays, I wrote the following script to update my intranet site with a report every morning:

naviseccli -h clariion1a getlog >clariion1a.txt
naviseccli -h clariion1b getlog >clariion1b.txt  
cat clariion1a.txt | grep -i ‘soft media’ >clariion1_softmedia_errors.csv
cat clariion1b.txt | grep -i ‘soft media’ >>clariion1_softmedia_errors.csv
./csv2htm.pl -e -T -i /home/scripts/clariion1_softmedia_errors.csv -o /<intranet_web_server>/clariion1_softmedia_errors.html
 

The script dumps the entire SP log from each SP into a text file, greps for only soft media errors in each file, then converts the output to HTML and writes it to my intranet web server.

 

Powerpath commands in AIX causing unexpected errors / initialization errors.

We recently had a problem with one of our AIX VIO servers not being able to run any powerpath commands.  Any attempt to run a command would result in an unexpected error or initialization error.   After speaking to EMC about it, the root cause is usually either running out of space on the root filesystem or having the data and stack ulimit paramenters set too low after adding a large number of new LUNs.   We are running AIX 6.1 on an IBM pSeries 550 with PowerPath 5.3 HF1.

Here are the errors that were popping up:

root@vioserver1:/script # powermt config
Unexpected error occured.

root@vioserver1:/script # powermt display dev=all
Initialization error.

root@vioserver1:/script # naviseccli -h <san_dns_name> lun -list -all
evp_enc.c(282): OpenSSL internal error, assertion failed: inl > 0
ksh: 503926 IOT/Abort trap(coredump)

Having too many LUNs caused the issue,  we had recently added an additional 35 for a total of  70.  Increasing the data and stack parameters to ‘unlimited’ resolved the problem.

root@vioserver1:/script # ulimit -a
time(seconds)        unlimited
file(blocks)         unlimited
data(kbytes)         unlimited
stack(kbytes)        unlimited
memory(kbytes)       unlimited
coredump(blocks)     2097151
nofiles(descriptors) 2000
threads(per process) unlimited
processes(per user)  unlimited

VMWare/ESX can’t write to a Celerra Read/Write NFS mounted datastore

I had just created serveral new Celerra NFS mounted datastores for our ESX administrator.  When he tried to create new VM hosts using the new datastores, he would get this error:   Call “FileManager.MakeDirectory” for object “FileManager” on vCenter Server “servername.company.com” failed.

Searching for that error message on powerlink, the VMWare forums, and general google searches didn’t bring back any easy answers or solutions.  It looked like ESX was unable to write to the NFS mount for some reason, even though it was mounted as Read/Write.  I also had the ESX hosts added to the R/W access permissions for the NFS export.

After much digging and experimentation, I did resolve the problem.  Here’s what you have to check:

1. The VMKernel IP must be in the root hosts permissions on the NFS export.   I put in the IP of the ESX server along with the VMKernel IP.

2. The NFS export must be mounted with the no_root_squash option.  By default, the root user with UID 0 is not given access to an NFS volume, mounting the export with no_root_squash allows the root user access.  The VMkernal must be able to access the NFS volume with UID 0.

I first set up the exports and permissions settings in the GUI, then went to the CLI to add the mount options.
command:  server_mount server_2 -option rw,uncached,sync,no_root_squash <sharename> /<sharename>

3. From within the ESX Console/Virtual Center, the Firewall settings should be updated to add the NFS Client.   Go to ‘Configuration’ | ‘Security Profile’ | ‘Properties’ | Click the NFS Client checkbox.

4. One other important item to note when adding NFS mounted datastores is the default limit of 8 in ESX.  You can increase the limit by going to ‘Configuration’ | ‘Advanced Settings’ | ‘NFS’ in the left column | Scroll to ‘NFS.MaxVolumes’ on the left, increase the number up to 64.  If you try to add a new datastore above the NFS.MaxVolumes limit, you will get the same error in red at the top of this post.

That’s it.  Adding the VMKernel IP to the root permissions, mounting with no_root_squash, and adding the NFS Client to ESX resolved the problem.

Unable to provision Celerra storage?

This one really made no sense to me at first.  I was attempting to create a new file storage pool on our NS960 Celerra.  Upon launching the disk provisioning wizard, it would pause for a minute and then give the following error:

“ERROR: Unable to continue provisioning.  click for details”

It was strange because I have plenty of disks that could be used for provisioning.  Why wasn’t it working?

Here is the detailed error message:

Message Details:

Message: Unable to continue provisioning

Full Description:  Not able to fetch disk information. n Command Failed, error code: 1, output: errormessage:string=”Timeout (60 seconds) waiting for state SS_DISKS_LOADED”

Recommended Action: No recommended action is available. Go to http://powerlink.EMC.com for more information.

Event Code: 15301214354

As a workaround and to test the issue, I used 8 spare SATA drives and created a new raid group, with one large LUN using all of the space.  I added it to the Celerra storage group and rescanned the SAN for storage.

The following error popped up:

Brief Description:  Invalid credentials for the storage array APM01034413494. 
Full Description:  The FLARE version on this storage array requires secure communication. Saved credentials are found, but authentication failed. The Control Station does not have valid credentials. 
Recommended Action:  Set the credentials by running the “nas_storage -modify <backend_name> -security” command. 
Message ID:  13422231564 

Well, how about that.  An error message that actually gives you the command to resolve the problem. 🙂  As it turns out, one of our other SAN administrators had changed the password for the system account.  Running nas_storage -modify id=<xx> -security resolved the problem.

(Note: You can get the ID number by running nas_storage -list)

DM Interconnect failure with Celerra Replicator

We just installed a new VNX 5500 a few weeks ago in the UK, and i intially set up a VDM replication job between it and it’s replication partner, an NS-960 in Canada.  The setup went fine with no errors, and replication of the VDM has completed successfully every day up until yesterday when I noticed that the status on the main replications screen says “network communication has been lost”.   I am able to use the server_ping command to ping the data mover/replication interface from UK to Canada, so network connectivity appears to be ok.

I was attempting to set up new replication jobs for the filesystems on this VDM, and the background tasks to create the replication jobs are stuck at “Establishing communication with secondary side for Create task” with a status of “Incomplete”.

I went to the DM interconnect next to validate that it was working, and the validation test failed with the following message: “Validate Data Mover Interconnect server_2:<SAN_name>. The following interfaces cannot connect: source interface=10.x.x.x destination interface=10.x.x.x, Message_ID=13160415446: Authentication failed for DIC communication.”

So, why is the DM Interconnect is failing?   It was working fine for several weeks!

My next trip was to the server log (>server_log server_2) where I spotted another issue.  Hundreds of entries that looked just like these:

2011-07-07 16:32:07: CMD: 6: CmdReplicatev2ReversePri::startSecondary dicSt 16 cmdSt 214
2011-07-07 16:32:10: CIC: 3: <DicXmlSyncMsgService> Sending Cmd to 10.x.x.x failed (16=Bad authentication)
2011-07-07 16:32:10: CMD: 3: DicXmlSyncRequest::sendMessage sendCmd failed:16
2011-07-07 16:32:12: CIC: 3: <DicXmlSyncMsgService> Sending Cmd to 10.x.x.x failed (16=Bad authentication)
2011-07-07 16:32:12: CMD: 3: DicXmlSyncRequest::sendMessage sendCmd failed:16

Bad Authentication? Hmmm.  There is something amiss with the trusted relationship between the VNX and the NS960.  I did a quick read of EMC’s VNX replication manual (yep, rtfm!) and found the command to update the interconnect, nas_cel.

First, run nas_cel -list to view all of your interconnects, noting the ID number of the one you’re having difficulty with.

[nasadmin@<name> ~]$ nas_cel -list
id    name          owner mount_dev  channel    net_path                                      CMU
0     <name_1>  0                               10.x.x.x                                   APM007039002350000
2     <name_2>      0                           10.x.x.x                                   APM001052420000000
4     <name_3>      0                           10.x.x.x                                   APM009015016510000
5     <name_4>       0                           10.x.x.x                                  APM000827205690000

In this case, I was having trouble with <name_3>, which is ID 4.

Run this command next:  nas_cel -update id=4.   After that command completed, my interconnect immediately started working and I was able to create new replication jobs.